liu.seSearch for publications in DiVA
Change search
Refine search result
1234 1 - 50 of 154
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Abrahamsson, Peder
    Linköping University, Department of Computer and Information Science.
    Mer lättläst: Påbyggnad av ett automatiskt omskrivningsverktyg till lätt svenska2011Independent thesis Basic level (degree of Bachelor), 12 credits / 18 HE creditsStudent thesis
    Abstract [sv]

    Det svenska språket ska finnas tillgängligt för alla som bor och verkar i Sverige. Därförär det viktigt att det finns lättlästa alternativ för dem som har svårighet att läsa svensktext. Detta arbete bygger vidare på att visa att det är möjligt att skapa ett automatisktomskrivningsprogram som gör texter mer lättlästa. Till grund för arbetet liggerCogFLUX som är ett verktyg för automatisk omskrivning till lätt svenska. CogFLUXinnehåller funktioner för att syntaktiskt skriva om texter till mer lättläst svenska.Omskrivningarna görs med hjälp av omskrivningsregler framtagna i ett tidigare projekt.I detta arbete implementeras ytterligare omskrivningsregler och även en ny modul förhantering av synonymer. Med dessa nya regler och modulen ska arbetet undersöka omdet är det är möjligt att skapa system som ger en mer lättläst text enligt etableradeläsbarhetsmått som LIX, OVIX och Nominalkvot. Omskrivningsreglerna ochsynonymhanteraren testas på tre olika texter med en total lägnd på ungefär hundra tusenord. Arbetet visar att det går att sänka både LIX-värdet och Nominalkvoten signifikantmed hjälp av omskrivningsregler och synonymhanterare. Arbetet visar även att det finnsfler saker kvar att göra för att framställa ett riktigt bra program för automatiskomskrivning till lätt svenska.

  • 2.
    Ahrenberg, Lars
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts.1998In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL'98) / [ed] Pierre Isabelle, Stroudsburg, PA, USA: The Association for Computational Linguistics , 1998, p. 29-35Conference paper (Refereed)
  • 3.
    Ahrenberg, Lars
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Alignment-based profiling of Europarl data in an English-Swedish parallel corpus2010In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) / [ed] Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias, Paris, France: European Language Resources Association (ELRA) , 2010, p. 3398-3404Conference paper (Refereed)
    Abstract [en]

    This paper profiles the Europarl part of an English-Swedish parallel corpus and compares it with three other subcorpora of the sameparallel corpus. We first describe our method for comparison which is based on alignments, both at the token level and the structurallevel. Although two of the other subcorpora contains fiction, it is found that the Europarl part is the one having the highest proportion ofmany types of restructurings, including additions, deletions and long distance reorderings. We explain this by the fact that the majorityof Europarl segments are parallel translations.

  • 4.
    Ahrenberg, Lars
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Comparing machine translation and human translation: A case study2017In: Proceedings of The First Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT) / [ed] Irina Temnikova, Constantin Orasan, Gloria Corpas and Stephan Vogel, Wolverhampton, UK, 2017, p. 21-28Conference paper (Refereed)
    Abstract [en]

    As machine translation technology improves comparisons to human performance are often made in quite general and exaggerated terms. Thus, it is important to be able to account for differences accurately. This paper reports a simple, descriptive scheme for comparing translations and applies it to two translations of a British opinion article published in March, 2017. One is a human translation (HT) into Swedish, and the other a machine translation (MT). While the comparison is limited to one text, the results are indicative of current limitations in MT.

  • 5.
    Ahrenberg, Lars
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Converting an English-Swedish Parallel Treebank to Universal Dependencies2015In: Proceedings of the Third International Conference on Dependency Linguistics (DepLing 2015), Association for Computational Linguistics, 2015, p. 10-19, article id W15-2103Conference paper (Refereed)
    Abstract [en]

    The paper reports experiences of automatically converting the dependency analysis of the LinES English-Swedish parallel treebank to universal dependencies (UD). The most tangible result is a version of the treebank that actually employs the relations and parts-of-speech categories required by UD, and no other. It is also more complete in that punctuation marks have received dependencies, which is not the case in the original version. We discuss our method in the light of problems that arise from the desire to keep the syntactic analyses of a parallel treebank internally consistent, while available monolingual UD treebanks for English and Swedish diverge somewhat in their use of UD annotations. Finally, we compare the output from the conversion program with the existing UD treebanks.

  • 6.
    Ahrenberg, Lars
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Towards a research infrastructure for translation studies.2014Conference paper (Other academic)
    Abstract [en]

    In principle the CLARIN research infrastructure provides a good environment to support research on translation. In reality, the progress within CLARIN in this area seems to be fairly slow. In this paper I will give examples of the resources currently available, and suggest what is needed to achieve a relevant research infrastructure for translation studies. Also, I argue that translation studies has more to gain from language technology, and statistical machine translation in particular, than what is generally assumed, and give some examples.

  • 7.
    Ahrenberg, Lars
    et al.
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Merkel, Magnus
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    A knowledge-lite approach to word alignment2000In: Parallel Text Processing: Alignment and Use of Translation Corpora / [ed] Jean Veronis, Dordrecht, The Netherlands: Kluwer Academic Publishers, 2000, p. 97-116Chapter in book (Other academic)
    Abstract [en]

    The most promising approach to word alignment is to combine statistical methods with non-statistical information sources. Some of the proposed non-statistical sources, including bilingual dictionaries, POS-taggers and lemmatizers, rely on considerable linguistic knowledge, while other knowledge-lite sources such as cognate heuristics and word order heuristics can be implemented relatively easy. While knowledge-heavy sources might be expected to give better performance, knowledge-lite systems are easier to port to new language pairs and text types, and they can give sufficiently good results for many purposes, e.g. if the output is to be used by a human user for the creation of a complete word-aligned bitext. In this paper we describe the current status of the Linköping Word Aligner (LWA), which combines the use of statistical measures of co-occurrence with four knowledge-lite modules for (i)) word categorization, (ii) morphological variation, (iii) word order, and (iv) phrase recognition. We demonstrate the portability of the system (from English-Swedish texts to French-English texts) and present results for these two language-pairs. Finally, we will report observations from an error analysis of system output, and identify the major strengths and weaknesses of the system.

  • 8.
    Ahrenberg, Lars
    et al.
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Merkel, Magnus
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Correspondence measures for MT evaluation.2000In: Proceedings of the Second International Conference on Linguistic Resources and Evaluation (LREC-2000, Paris, France: European Language Resources Association (ELRA) , 2000, p. 41-46Conference paper (Refereed)
  • 9.
    Ahrenberg, Lars
    et al.
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Merkel, Magnus
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Sågvall Hein, Anna
    Institutionen för lingvistik, Uppsala universitet..
    Tiedemann, Jörg
    Institutionen för lingvistik, Uppsala universitet.
    Evaluation of word alignment systems2000In: Proceedings of the Second International Conference on Linguistic Resources and Evaluation (LREC-2000), Paris, France: European Language Resources Association (ELRA) , 2000, p. 1255-1261Conference paper (Refereed)
  • 10.
    Albertsson, Sarah
    et al.
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. SICS East Swedish ICT AB, Linköping, Sweden.
    Rennes, Evelina
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. SICS East Swedish ICT AB, Linköping, Sweden.
    Jönsson, Arne
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Arts and Sciences. SICS East Swedish ICT AB, Linköping, Sweden.
    Similarity-Based Alignment of Monolingual Corpora for Text Simplification2016In: CL4LC 2016 - Computational Linguistics for Linguistic Complexity: Proceedings of the Workshop, 2016, p. 154-163Conference paper (Refereed)
    Abstract [en]

    Comparable or parallel corpora are beneficial for many NLP tasks.  The automatic collection of corpora enables large-scale resources, even for less-resourced languages, which in turn can be useful for deducing rules and patterns for text rewriting algorithms, a subtask of automatic text simplification. We present two methods for the alignment of Swedish easy-to-read text segments to text segments from a reference corpus.  The first method (M1) was originally developed for the task of text reuse detection, measuring sentence similarity by a modified version of a TF-IDF vector space model. A second method (M2), also accounting for part-of-speech tags, was devel- oped, and the methods were compared.  For evaluation, a crowdsourcing platform was built for human judgement data collection, and preliminary results showed that cosine similarity relates better to human ranks than the Dice coefficient. We also saw a tendency that including syntactic context to the TF-IDF vector space model is beneficial for this kind of paraphrase alignment task.

  • 11.
    Askarieh, Sona
    Linköping University, Department of Culture and Communication. Linköping University, Faculty of Arts and Sciences.
    Cohesion and Comprehensibility in Swedish-English Machine Translated Texts2014Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Access to various texts in different languages causes an increasing demand for fast, multi-purpose, and cheap translators. Pervasive internet use intensifies the necessity for intelligent and cheap translators, since traditional translation methods are excessively slow to translate different texts. During the past years, scientists carried out much research in order to add human and artificial intelligence into the old machine translation systems and the idea of developing a machine translation system came into existence during the days of World War (Kohenn, 2010). The new invention was useful in order to help the human translators and many other people who need to translate different types of texts according to their needs. The new translation systems are useful in meeting people’s needs. Since the machine translation systems vary according to the quality of the systems outputs, their performance should be evaluated from the linguistic point of view in order to reach a fair judgment about the quality of the systems outputs. To achieve this goal, two various Swedish texts were translated by two different machine translation systems in the thesis. The translated texts were evaluated to examine the extent to which errors affect the comprehensibility of the translations. The performances of the systems were evaluated using three approaches. Firstly, most common linguistically errors, which appear in the machine translation systems outputs, were analyzed (e.g. word alignment of the translated texts). Secondly, the influence of different types of errors on the cohesion chains were evaluated. Finally, the effect of the errors on the comprehensibility of the translations were investigated.

    Numerical results showed that some types of errors have more effects on the comprehensibility of the systems’ outputs. The obtained data illustrated that the subjects’ comprehension of the translated texts depend on the type of error, but not frequency. The analyzing depicted which translation system had best performance.

  • 12.
    Auer, Cornelia
    et al.
    Zuse Institut Berlin, Germany.
    Hotz, Ingrid
    Zuse Institut Berlin, Germany.
    Complete Tensor Field Topology on 2D Triangulated Manifolds embedded in 3D2011In: Computer graphics forum (Print), ISSN 0167-7055, E-ISSN 1467-8659, Vol. 30, no 3, p. 831-840Article in journal (Refereed)
    Abstract [en]

    This paper is concerned with the extraction of the surface topology of tensor fields on 2D triangulated manifoldsembedded in 3D. In scientific visualization topology is a meaningful instrument to get a hold on the structure of agiven dataset. Due to the discontinuity of tensor fields on a piecewise planar domain, standard topology extractionmethods result in an incomplete topological skeleton. In particular with regard to the high computational costs ofthe extraction this is not satisfactory. This paper provides a method for topology extraction of tensor fields thatleads to complete results. The core idea is to include the locations of discontinuity into the topological analysis.For this purpose the model of continuous transition bridges is introduced, which allows to capture the entiretopology on the discontinuous field. The proposed method is applied to piecewise linear three-dimensional tensorfields defined on the vertices of the triangulation and for piecewise constant two or three-dimensional tensor fieldsgiven per triangle, e.g. rate of strain tensors of piecewise linear flow fields.

  • 13.
    Axelsson, Nils
    Linköping University, Department of Computer and Information Science, Human-Centered systems.
    Dynamic Programming Algorithms for Semantic Dependency Parsing2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Dependency parsing can be a useful tool to allow computers to parse text. In 2015, Kuhlmann and Jonsson proposed a logical deduction system that parsed to non-crossing dependency graphs with an asymptotic time complexity of O(n3), where “n” is the length of the sentence to parse. This thesis extends the deduction system by Kuhlmann and Jonsson; the extended deduction system introduces certain crossing edges, while maintaining an asymptotic time complexity of O(n4).

    In order to extend the deduction system by Kuhlmann and Jonsson, fifteen logical item types are added to the five proposed by Kuhlmann and Jonsson. These item types allow the deduction system to intro-duce crossing edges while acyclicity can be guaranteed. The number of inference rules in the deduction system is increased from the 19 proposed by Kuhlmann and Jonsson to 172, mainly because of the larger number of combinations of the 20 item types.

    The results are a modest increase in coverage on test data (by roughly 10% absolutely, i.e. approx. from 70% to 80%), and a comparable placement to that of Kuhlmann and Jonsson by the SemEval 2015 task 18 metrics. By the method employed to introduce crossing edges, derivational uniqueness is impossible to maintain. It is hard to defien the graph class to which the extended algorithm, QAC, parses, and it is therefore empirically compared to 1-endpoint crossing and graphs with a page number of two or less, compared to which it achieves lower coverage on test data. The QAC graph class is not limited by page number or crossings.

    The takeaway of the thesis is that extending a very minimal deduction system is not necessarily the best approach, and that it may be better to start off with a strong idea of to which graph class the extended algorithm should parse. Additionally, several alternative ways of extending Kuhlmann and Jonsson are proposed.

  • 14.
    Axelsson, Robin
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, The Institute of Technology.
    Implementation och utvärdering av termlänkare i Java2013Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    Aligning parallell terms in a parallell corpus can be done by aligning all words and phrases in the corpus and then performing term extraction on the aligned set of word pairs. Alternatively, term extraction in the source and target text can be made separately and then the resulting term candidates can be aligned, forming aligned parallell terms. This thesis describes an implementation of a word aligner that is applied on extracted term candidates in both the source and the target texts. The term aligner uses statistical measures, the tool Giza++ and heuristics in the search for alignments. The evaluation reveals that the best results are obtained when the term alignment relies heavily on the Giza++ tool and Levenshtein heuristic.

  • 15. Bremin, Sofia
    et al.
    Hu, Hongzhan
    Karlsson, Johanna
    Prytz Lillkull, Anna
    Wester, Martin
    Danielsson, Henrik
    Linköping University, The Swedish Institute for Disability Research. Linköping University, Department of Behavioural Sciences and Learning, Disability Research. Linköping University, Faculty of Arts and Sciences.
    Stymne, Sara
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Methods for human evaluation of machine translation2010In: Proceedings of the Swedish Language Technology Conference (SLTC2010), 2010, p. 47-48Conference paper (Other academic)
  • 16.
    Bretan, Ivan
    et al.
    Telia Research AB, Haninge, SWEDEN.
    Eklund, Robert
    Telia Research AB, Haninge, SWEDEN.
    MacDermid, Catriona
    Telia Research AB, Haninge, SWEDEN.
    Approaches to gathering realistic training data for speech translation systems1996In: Proceedings of Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, 1996, Institute of Electrical and Electronics Engineers (IEEE), 1996, p. 97-100Conference paper (Refereed)
    Abstract [en]

    The Spoken Language Translator (SLT) is a multi-lingual speech-to-speech translation prototype supporting English, Swedish and French within the air traffic information system (ATIS) domain. The design of SLT is characterized by a strongly corpus-driven approach, which accentuates the need for cost-efficient collection procedures to obtain training data. This paper discusses various approaches to the data collection issue pursued within a speech translation framework. Original American English speech and language data have been collected using traditional Wizard-of-Oz (WOZ) techniques, a relatively costly procedure yielding high-quality results. The resulting corpus has been translated textually into Swedish by a large number of native speakers (427) and used as prompts for training the target language speech model. This ᅵbudgetᅵ collection method is compared to the accepted method, i.e., gathering data by means of a full-blown WOZ simulation. The results indicate that although translation in this case proved economical and produced considerable data, the method is not sensitive to certain features typical of spoken language, for which WOZ is superior

  • 17.
    Capshaw, Riley
    Linköping University, Department of Computer and Information Science, Human-Centered systems.
    Relation Classification using Semantically-Enhanced Syntactic Dependency Paths: Combining Semantic and Syntactic Dependencies for Relation Classification using Long Short-Term Memory Networks2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Many approaches to solving tasks in the field of Natural Language Processing (NLP) use syntactic dependency trees (SDTs) as a feature to represent the latent nonlinear structure within sentences. Recently, work in parsing sentences to graph-based structures which encode semantic relationships between words—called semantic dependency graphs (SDGs)—has gained interest. This thesis seeks to explore the use of SDGs in place of and alongside SDTs within a relation classification system based on long short-term memory (LSTM) neural networks. Two methods for handling the information in these graphs are presented and compared between two SDG formalisms. Three new relation extraction system architectures have been created based on these methods and are compared to a recent state-of-the-art LSTM-based system, showing comparable results when semantic dependencies are used to enhance syntactic dependencies, but with significantly fewer training parameters.

  • 18.
    Carlsson, Bertil
    et al.
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Jönsson, Arne
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Using the pyramid method to create gold standards for evaluation of extraction based text summarization techniques2010In: Proceedings of the Third Swedish Language Technology Conference (SLTC-2010), 2010Conference paper (Refereed)
  • 19.
    Cederblad, Gustav
    Linköping University, Department of Computer and Information Science.
    Finding Synonyms in Medical Texts: Creating a system for automatic synonym extraction from medical texts2018Independent thesis Basic level (degree of Bachelor), 12 credits / 18 HE creditsStudent thesis
    Abstract [en]

    This thesis describes the work of creating an automatic system for identifying synonyms and semantically related words in medical texts. Before this work, as a part of the project E-care@home, medical texts have been classified as either lay or specialized by both a lay annotator and an expert annotator. The lay annotator, in this case, is a person without any medical knowledge, whereas the expert annotator has professional knowledge in medicine. Using these texts made it possible to create co-occurrences matrices from which the related words could be identified. Fifteen medical terms were chosen as system input. The Dice similarity of these words in a context window of ten words around them was calculated. As output, five candidate related terms for each medical term was returned. Only unigrams were considered. The candidate related terms were evaluated using a questionnaire, where 223 healthcare professionals rated the similarity using a scale from one to five. A Fleiss kappa test showed that the agreement among these raters was 0.28, which is a fair agreement. The evaluation further showed that there was a significant correlation between the human ratings and the relatedness score (Dice similarity). That is, words with higher Dice similarity tended to get a higher human rating. However, the Dice similarity interval in which the words got the highest average human rating was 0.35-0.39. This result means that there is much room for improving the system. Further developments of the system should remove the unigram limitation and expand the corpus the provide a more accurate and reliable result.

  • 20.
    De Bona, Fabio
    et al.
    Friedrich Miescher Laboratory of the Max Planck Society, Tübingen, Germany.
    Riezler, Stefan
    Hall, Keith
    Ciaramita, Massimiliano
    Herdagdelen, Amac
    University of Trento, Rovereto, Italy.
    Holmqvist, Maria
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Learning dense models of query similarity from user click logs2010In: HLT '10: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, p. 474-482Conference paper (Refereed)
  • 21.
    Debusmann, Ralph
    et al.
    Saarland University, Saarbrücken, Germany.
    Kuhlmann, Marco
    Uppsala universitet, Institutionen för lingvistik och filologi.
    Dependency Grammar: Classification and Exploration2010In: Resource-Adaptive Cognitive Processes / [ed] Matthew W. Crocker, Jörg Siekmann, Springer Berlin/Heidelberg, 2010, p. 365-388Chapter in book (Other academic)
    Abstract [en]

    Syntactic representations based on word-to-word dependencies have a long tradition in descriptive linguistics [29]. In recent years, they have also become increasingly used in computational tasks, such as information extraction [5], machine translation [43], and parsing [42]. Among the purported advantages of dependency over phrase structure representations are conciseness, intuitive appeal, and closeness to semantic representations such as predicate-argument structures. On the more practical side, dependency representations are attractive due to the increasing availability of large corpora of dependency analyses, such as the Prague Dependency Treebank [19].

  • 22.
    Dienes, Péter
    et al.
    Saarland University, Saarbrücken, Germany.
    Koller, Alexander
    Saarland University, Saarbrücken, Germany.
    Kuhlmann, Marco
    Saarland University, Saarbrücken, Germany.
    Statistical A-Star Dependency Parsing2003In: Proceedings of the Workshop on Prospects and Advances in the Syntax/Semantics Interface / [ed] Denys Duchier and Geert-Jan Kruijff, 2003, p. 85-89Conference paper (Refereed)
    Abstract [en]

    Extensible Dependency Grammar (XDG; Duchier and Debusmann (2001)) is a recently developed dependency grammar formalism that allows the characterization of linguistic structures along multiple dimensions of description. It can be implemented efficiently using constraint programming (CP; Koller and Niehren 2002). In the CP context, parsing is cast as a search problem: The states of the search are partial parse trees, successful end states are complete and valid parses. In this paper, we propose a probability model for XDG dependency trees and an A-Star search control regime for the XDG parsing algorithm that guarantees the best parse to be found first. Extending XDG with a statistical component has the benefit of bringing the formalism further into the grammatical mainstream; it also enables XDG to efficiently deal with large, corpus-induced grammars that come with a high degree of ambiguity.

  • 23.
    Drewes, Frank
    et al.
    Umeå University.
    Knight, Kevin
    University of Southern California, Information Sciences Institute.
    Kuhlmann, Marco
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Formal Models of Graph Transformation in Natural Language Processing (Dagstuhl Seminar 15122)2015In: Dagstuhl Reports, ISSN 2192-5283, Vol. 5, no 3, p. 143-161Article in journal (Other academic)
    Abstract [en]

    In natural language processing (NLP) there is an increasing interest in formal models for processing graphs rather than more restricted structures such as strings or trees. Such models of graph transformation have previously been studied and applied in various other areas of computer science, including formal language theory, term rewriting, theory and implementation of programming languages, concurrent processes, and software engineering. However, few researchers from NLP are familiar with this work, and at the same time, few researchers from the theory of graph transformation are aware of the specific desiderata, possibilities and challenges that one faces when applying the theory of graph transformation to NLP problems. The Dagstuhl Seminar 15122 “Formal Models of Graph Transformation in Natural Language Processing” brought researchers from the two areas together. It initiated an interdisciplinary exchange about existing work, open problems, and interesting applications.

  • 24.
    Drewes, Frank
    et al.
    Umeå University, Umeå, Sweden.
    Kuhlmann, MarcoUppsala universitet, Institutionen för lingvistik och filologi.
    ATANLP 2012 Workshop on Applications of Tree Automata Techniques in Natural Language Processing: Proceedings of the Workshop2012Conference proceedings (editor) (Other academic)
  • 25.
    Drewes, Frank
    et al.
    Umeå University, Umeå, Sweden.
    Kuhlmann, MarcoUppsala universitet, Institutionen för lingvistik och filologi.
    Workshop on Applications of Tree Automata in Natural Language Processing 2010 (ATANLP 2010)2010Conference proceedings (editor) (Other academic)
  • 26.
    Edholm, Lars
    Linköping University, Department of Computer and Information Science.
    Automatisk kvalitetskontroll av terminologi i översättningar2007Independent thesis Advanced level (degree of Magister), 20 points / 30 hpStudent thesis
    Abstract [en]

    Quality in translations depends on the correct use of specialized terms, which can make the translation easier to understand as well as reduce the required time and costs for the translation (Lommel, 2007). Consistent use of terminology is important, and should be taken into account during quality checks of for example translated documentation (Esselink, 2000). Today, several commercial programs have functions for automatic quality checking of terminology. The aim of this study is to evaluate such functions since no earlier major study of this has been found.

    To get some insight into quality checking in practice, two qualitative interviews were initially carried out with individuals involved in this at a translation agency. The results were compared to current theories in the subject field and revealed a general agreement with for example the recommendations of Bass (2006).

    The evaluations started with an examination of the recall for a genuine terminology database compared to subjectively marked terms in a test corpus based on an authentic translation memory. The examination however revealed a relatively low recall. To increase the recall the terminology database was modified, it was for example extended with longer terms from the test corpus.

    After that, the function for checking terminology in four different commercial programs was run on the test corpus using the modified terminology database. Finally, the test corpus was also modified, by planting out a number of errors to produce a more idealized evaluation. The results from the programs, in the form of alarms for potential errors, were categorized and judged as true or false alarms. This constitutes a base for measures of precision of the checks, and in the last evaluation also of their recall.

    The evaluations showed that for terminology in translations of English to Swedish, it was advantageous to match terms from the terminology database using partial matching of words in the source and target segments of the translation. In that way, terms with different inflected forms could be matched without support for language﷓specific morphology. A cause of many problems in the matching process was the form of the entries in the terminology database, which were more suited for being read by human translators than by a machine.

    Recommendations regarding the introduction of tools for automatic checking of terminology were formulated, based on the results from the interviews and evaluations. Due to factors of uncertainty in the automatic checking, a manual review of its results is motivated. By running the check on a sample that has already been manually checked in other aspects, a reasonable number of results to manually review can be obtained. The quality of the terminology database is crucial for its recall on translations, and in the long run also for the value of using it for automatic checking.

  • 27.
    Eklund, Robert
    Stockholm University, Department of Computational Linguistics, Institute of Linguistics.
    A Probabilistic Tagging Module Based on Surface Pattern Matching1993Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    A problem with automatic tagging and lexical analysis is that it is never 100 % accurate. In order to arrive at better figures, one needs to study the character of what is left untagged by automatic taggers. In this paper untagged residue outputted by the automatic analyser SWETWOL (Karlsson 1992) at Helsinki is studied. SWETWOL assigns tags to words in Swedish texts mainly through dictionary lookup. The contents of the untagged residue files are described and discussed, and possible ways of solving different problems are proposed. One method of tagging residual output is proposed and implemented: the left-stripping method, through which untagged words are bereaved their left-most letters, searched in a dictionary, and if found, tagged according to the information found in the said dictionary. If the stripped word is not found in the dictionary, a match is searched in ending lexica containing statistical information about word classes associated with that particular word form (i.e., final letter cluster, be this a grammatical suffix or not), and the relative frequency of each word class. If a match is found, the word is given graduated tagging according to the statistical information in the ending lexicon. If a match is not found, the word is stripped of what is now its left-most letter and is recursively searched in a dictionary and ending lexica (in that order). The ending lexica employed in this paper are retrieved from a reversed version of Nusvensk Frekvensordbok (Allén 1970), and contain endings of between one and seven letters. The contents of the ending lexica are to a certain degree described and discussed. The programs working according to the principles described are run on files of untagged residual output. Appendices include, among other things, LISP source code, untagged and tagged files, the ending lexica containing one and two letter endings and excerpts from ending lexica containing three to seven letters.

  • 28.
    Eklund, Robert
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Disfluency in Swedish human–human and human–machine travel booking dialogues2004Doctoral thesis, monograph (Other academic)
    Abstract [en]

    This thesis studies disfluency in spontaneous Swedish speech, i.e., the occurrence of hesitation phenomena like eh, öh, truncated words, repetitions and repairs, mispronunciations, truncated words and so on. The thesis is divided into three parts:

    PART I provides the background, both concerning scientific, personal and industrial–academic aspects in the Tuning in quotes, and the Preamble and Introduction (chapter 1).

    PART II consists of one chapter only, chapter 2, which dives into the etiology of disfluency. Consequently it describes previous research on disfluencies, also including areas that are not the main focus of the present tome, like stuttering, psychotherapy, philosophy, neurology, discourse perspectives, speech production, application-driven perspectives, cognitive aspects, and so on. A discussion on terminology and definitions is also provided. The goal of this chapter is to provide as broad a picture as possible of the phenomenon of disfluency, and how all those different and varying perspectives are related to each other.

    PART III describes the linguistic data studied and analyzed in this thesis, with the following structure: Chapter 3 describes how the speech data were collected, and for what reason. Sum totals of the data and the post-processing method are also described. Chapter 4 describes how the data were transcribed, annotated and analyzed. The labeling method is described in detail, as is the method employed to do frequency counts. Chapter 5 presents the analysis and results for all different categories of disfluencies. Besides general frequency and distribution of the different types of disfluencies, both inter- and intra-corpus results are presented, as are co-occurrences of different types of disfluencies. Also, inter- and intra-speaker differences are discussed. Chapter 6 discusses the results, mainly in light of previous research. Reasons for the observed frequencies and distribution are proposed, as are their relation to language typology, as well as syntactic, morphological and phonetic reasons for the observed phenomena. Future work is also envisaged, both work that is possible on the present data set, work that is possible on the present data set given extended labeling and work that I think should be carried out, but where the present data set fails, in one way or another, to meet the requirements of such studies.

    Appendices 1–4 list the sum total of all data analyzed in this thesis (apart from Tok Pisin data). Appendix 5 provides an example of a full human–computer dialogue.

  • 29.
    Eklund, Robert
    Stockholms universitet, Stockholm, Sverige.
    En introduktion till programmering i prolog1996Other (Other academic)
    Abstract [sv]

    Detta kompendium är ämnat som en grundläggande introduktion till programmeringsspråket PROLOG.I Eftersom det operativa ordet här är "grundläggande" så förstås att kompendiet inte har några anspråk på att tillfredsställa en professionell hackers2 alla lustar. Särskild hänsyn har i stället tagits till de personer vilka inte har någon som helst tidigare programmeringserfarenhet. Detta innebär att personer som redan är förtrogna med andra programmeringsspråk kan komma att tycka att framställningen till viss del och i någon mening är trivial ( och måhända på gränsen till felaktig, en fara vid alla försök att förenkla). Det innebär också att mycket är utelämnat, och att således personer som redan kan prolog kan komma att utbrista "Men varför tog du inte med det här?!". Jag har försökt att ta med sådant som oundgängligen utgör ett slags bas för att gå vidare. Det som har utelämnats är givetvis inte oviktigt, utan sådant som inte krävs för att kunna leka och ha kul med prolog som första bekantskap. Om man berättar allt på första träffen så finns ju inga hemligheter kvar att upptäcka!

  • 30. Fagerlund, Martin
    et al.
    Merkel, Magnus
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Eldén, Lars
    Linköping University, Department of Mathematics, Scientific Computing. Linköping University, The Institute of Technology.
    Ahrenberg, Lars
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Computing Word Senses by Semantic Mirroring and Spectral Graph Partitioning2010In: Proceedings of TextGraphs-5 - 2010 Workshop on Graph-based Methods for Natural Language Processing / [ed] Carmen Banea, Alessandro Moschitti, Swapna Somasundaran and Fabio Massimo Zanzotto, Stroudsburg, PA, USA: The Association for Computational Linguistics , 2010, p. 103-107Conference paper (Refereed)
    Abstract [en]

    Using the technique of ”semantic mirroring”a graph is obtained that representswords and their translations from a parallelcorpus or a bilingual lexicon. The connectednessof the graph holds informationabout the different meanings of words thatoccur in the translations. Spectral graphtheory is used to partition the graph, whichleads to a grouping of the words accordingto different senses. We also report resultsfrom an evaluation using a small sample ofseed words from a lexicon of Swedish andEnglish adjectives.

  • 31.
    Fahlborg, Daniel
    et al.
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Rennes, Evelina
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. RISE SICS East, Linköping, Sweden.
    Introducing SAPIS – an API Service for Text Analysis and Simplification2016Conference paper (Refereed)
    Abstract [en]

    In several projects, we are developing tools and techniques for simplifying and analyzing textual data, aiming to enhance the accessibility of texts. We present SAPIS, an API service by which these techniques can be reached from a remote server. The API currently involves four running services, and is designed for easy implementation of new services. SAPIS aims to reach professional or daily users interested in the simplification and analysis of texts.

  • 32.
    Falkenjack, Johan
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Towards a Model of General Text Complexity for Swedish2018Licentiate thesis, monograph (Other academic)
    Abstract [en]

    In an increasingly networked world, where the amount of written information is growing at a rate never before seen, the ability to read and absorb written information is of utmost importance for anything but a superficial understanding of life's complexities. That is an example of a sentence which is not very easy to read. It can be said to have a relatively high degree of text complexity. Nevertheless, the sentence is also true. It is important to be able to read and understand written materials. While not everyone might have a job where they have to read a lot, access to written material is necessary in order to participate in modern society. Most information, from news reporting, to medical information, to governmental information, come primarily in a written form.

    But what makes the sentence at the start of this abstract so complex? We can probably all agree that the length is part of it. But then what? Researches in the field of readability and text complexity analysis have been studying this question for almost 100 years. That research has over time come to include many computational and data driven methods within the field of computational linguistics.

    This thesis cover some of my contributions to this field of research, though with a main focus on Swedish rather than English text. It aims to explore two primary questions (1) Which linguistic features are most important when assessing text complexity in Swedish? and (2) How can we deal with the problem of data sparsity with regards to complexity annotated texts in Swedish?

    The first issue is tackled by exploring the task of identifying easy-to-read ("lättläst") text using classification with Support Vector Machines. A large set of linguistic features is evaluated with regards to predictive performance and is shown to separate easy-to-read texts from regular texts with a very high accuracy. Meanwhile, using a genetic algorithm for variable selection, we find that almost the same accuracy can be reached with only 8 features. This implies that this classification problem is not very hard and that results might not generalize to comparing less easy-to-read texts.

    This, in turn, brings us to the second question. Except for easy-to-read labeled texts, the data with text complexity annotations is very sparse. It consist of multiple small corpora using different scales to label documents. To deal with this problem, we propose a novel statistical model. The model belongs to the larger family of Probit models and is implemented in a Bayesian fashion and estimated using a Gibbs sampler based on extending a well established Gibbs sampler for the Ordered Probit model. This model is evaluated using both simulated and real world readability data with very promising results.

  • 33.
    Falkenjack, Johan
    et al.
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, The Institute of Technology. Santa Anna IT Research Institute AB, Linköping, Sweden.
    Heimann Mühlenbock, Katarina
    Språkbanken, University of Gothenburg, Gothenburg.
    Using the probability of readability to order Swedish texts2012In: Proceedings of the Fourth Swedish Language Technology Conference, 2012, p. 27-28Conference paper (Refereed)
    Abstract [en]

    In this study we present a new approach to rank readability in Swedish texts based on lexical, morpho-syntactic and syntactic analysis of text as well as machine learning. The basic premise and theory is presented as well as a small experiment testing the feasibility, but not actual performance, of the approach. The experiment shows that it is possible to implement a system based on the approach, however, the actual performance of such a system has not been evaluated as the necessary resources for such an evaluation does not yet exist for Swedish. The experiment also shows that a classifier based on the aforementioned linguistic analysis, on our limited test set, outperforms classifiers based on established metrics used to assess readability such as LIX, OVIX and Nominal Ratio.

  • 34.
    Falkenjack, Johan
    et al.
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, The Institute of Technology. SICS East Swedish ICT AB .
    Heimann Mühlenbock, Katarina
    Göteborgs Universitet.
    Jönsson, Arne
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, The Institute of Technology. SICS East Swedish ICT AB .
    Features indicating readability in Swedish text2013In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013) / [ed] Stephan Oepen, Kristin Hagen, Janne Bondi Johannesse, Linköping, 2013, p. 27-40Conference paper (Refereed)
    Abstract [en]

    Studies have shown that modern methods of readability assessment, using automated linguistic analysis and machine learning (ML), is a viable road forward for readability classification and ranking. In this paper we present a study of different levels of analysis and a large number of features and how they affect an ML-system’s accuracy when it comes to readability assessment. We test a large number of features proposed for different languages (mainly English) and evaluate their usefulness for readability assessment for Swedish as well as comparing their performance to that of established metrics. We find that the best performing features are language models based on part-of-speech and dependency type.

  • 35.
    Falkenjack, Johan
    et al.
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Jönsson, Arne
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Classifying easy-to-read texts without parsing2014In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), Association for Computational Linguistics, 2014, p. 114-122Conference paper (Refereed)
    Abstract [en]

    Document classification using automated linguistic analysis and machine learning (ML) has been shown to be a viable road forward for readability assessment. The best models can be trained to decide if a text is easy to read or not with very high accuracy, e.g. a model using 117 parameters from shallow, lexical, morphological and syntactic analyses achieves 98,9% accuracy. In this paper we compare models created by parameter optimization over subsets of that total model to find out to which extent different high-performing models tend to consist of the same parameters and if it is possible to find models that only use features not requiring parsing. We used a genetic algorithm to systematically optimize parameter sets of fixed sizes using accuracy of a Support Vector Machine classi- fier as fitness function. Our results show that it is possible to find models almost as good as the currently best models while omitting parsing based features.

  • 36.
    Falkenjack, Johan
    et al.
    SICS East Swedish ICT AB.
    Jönsson, Arne
    SICS East Swedish ICT AB.
    Implicit readability ranking using the latent variable of a Bayesian Probit model2016In: CL4LC 2016 - Computational Linguistics for Linguistic Complexity: Proceedings of the Workshop, 2016, p. 104-112Conference paper (Refereed)
    Abstract [en]

    Data driven approaches to readability analysis for languages other than English has been plagued by a scarcity of suitable corpora.  Often, relevant corpora consist only of easy-to-read texts with no  rank  information  or  empirical  readability  scores,  making  only  binary  approaches,  such  as classification, applicable.  We propose a Bayesian, latent variable, approach to get the most out of these kinds of corpora. In this paper we present results on using such a model for readability ranking. The model is evaluated on a preliminary corpus of ranked student texts with encourag- ing results.  We also assess the model by showing that it performs readability classification on par with a state of the art classifier while at the same being transparent enough to allow more sophisticated interpretations.

  • 37.
    Falkenjack, Johan
    et al.
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Rennes, Evelina
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Fahlborg, Daniel
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Johansson, Vida
    Linköping University, Department of Computer and Information Science. Linköping University, Faculty of Arts and Sciences.
    Jönsson, Arne
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Arts and Sciences.
    Services for text simplification and analysis2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 2017Conference paper (Refereed)
  • 38.
    Falkenjack, Johan
    et al.
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. SICS East Swedish ICT AB, Linköping, Sweden.
    Santini, Marina
    Linköping University, Department of Computer and Information Science, Human-Centered systems. SICS East Swedish ICT AB, Linköping, Sweden.
    Jönsson, Arne
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Arts and Sciences. SICS East Swedish ICT AB, Linköping, Sweden.
    An Exploratory Study on Genre Classification using Readability Features2016Conference paper (Other academic)
    Abstract [en]

    We present a preliminary study that explores whether text features used for readability assessment are reliable genre-revealingfeatures. We empirically explore the difference between genre and domain. We carry out two sets of experiments with bothsupervised and unsupervised methods. Findings on the Swedish national corpus (the SUC) show that readability cues are goodindicators of genre variation.

  • 39.
    Fallgren, Per
    Linköping University, Department of Computer and Information Science. Linköping University, Faculty of Arts and Sciences.
    Användning av Self Organizing Maps som en metod att skapa semantiska representationer ur text2015Independent thesis Basic level (degree of Bachelor), 12 credits / 18 HE creditsStudent thesis
    Abstract [sv]

    Denna studie är ett kognitionsvetenskapligt examensarbete som syftar på att skapa en modell som skapar semantiska representationer utifrån ett mer biologiskt plausibelt tillvägagångssätt jämfört med traditionella metoder. Denna modell kan ses som ett första steg i utredningen av ansatsen som följer. Studien utreder antagandet om Self Organizing Maps kan användas för att skapa semantiska representationer ur stora mängder text utifrån ett distribuerat inspirerat tillvägagångssätt. Resultatet visar på ett potentiellt fungerande system, men som behöver utredas vidare i framtida studier för verifiering av högre grad.

  • 40.
    Fallgren, Per
    Linköping University, Department of Computer and Information Science.
    Thoughts don't have Colour, do they?: Finding Semantic Categories of Nouns and Adjectives in Text Through Automatic Language Processing2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Not all combinations of nouns and adjectives are possible and some are clearly more fre- quent than other. With this in mind this study aims to construct semantic representations of the two types of parts-of-speech, based on how they occur with each other. By inves- tigating these ideas via automatic natural language processing paradigms the study aims to find evidence for a semantic mutuality between nouns and adjectives, this notion sug- gests that the semantics of a noun can be captured by its corresponding adjectives, and vice versa. Furthermore, a set of proposed categories of adjectives and nouns, based on the ideas of Gärdenfors (2014), is presented that hypothetically are to fall in line with the produced representations. Four evaluation methods were used to analyze the result rang- ing from subjective discussion of nearest neighbours in vector space to accuracy generated from manual annotation. The result provided some evidence for the hypothesis which suggests that further research is of value. 

  • 41.
    Fallgren, Per
    et al.
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Segeblad, Jesper
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Kuhlmann, Marco
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Towards a Standard Dataset of Swedish Word Vectors2016In: Proceedings of the Sixth Swedish Language Technology Conference (SLTC), 2016Conference paper (Refereed)
    Abstract [en]

    Word vectors, embeddings of words into a low-dimensional space, have been shown to be useful for a large number of natural language processing tasks. Our goal with this paper is to provide a useful dataset of such vectors for Swedish. To this end, we investigate three standard embedding methods: the continuous bag-of-words and the skip-gram model with negative sampling of Mikolov et al. (2013a), and the global vectors of Pennington et al. (2014). We compare these methods using QVEC-CCA (Tsvetkov et al., 2016), an intrinsic evaluation measure that quantifies the correlation of learned word vectors with external linguistic resources. For this propose we use SALDO, the Swedish Association Lexicon (Borin et al., 2013). Our experiments show that the continuous bag-of-words model produces vectors that are most highly correlated to SALDO, with the skip-gram model very close behind. Our learned vectors will be provided for download at the paper’s website.

  • 42.
    Ferrara Boston, Marisa
    et al.
    Department of Linguistics, Cornell University, Ithaca, NY, USA.
    Hale, John
    Department of Linguistics, Cornell University, Ithaca, NY, USA.
    Kuhlmann, Marco
    Uppsala universitet, Institutionen för lingvistik och filologi.
    Dependency Structures Derived from Minimalist Grammars2010In: The Mathematics of Language: 10th and 11th Biennial Conference, MOL 10, Los Angeles, CA, USA, July 28–30, 2007, and MOL 11, Bielefeld, Germany, August 20–21, 2009, Revised Selected Papers, Springer Berlin/Heidelberg, 2010, p. 1-12Conference paper (Refereed)
    Abstract [en]

    This paper provides an interpretation of Minimalist Grammars (Stabler, 1997; Stabler & Keenan, 2003) in terms of dependency structures. Under this interpretation, merge operations derive projective dependency structures, and movement operations create both non-projective and illnested structures. This provides a new characterization of the generative capacity of Minimalist Grammar, and makes it possible to discuss the linguistic relevance of non-projectivity and illnestedness based on grammars that derive structures with these properties.

  • 43.
    Foo, Jody
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Computational Terminology: Exploring Bilingual and Monolingual Term Extraction2012Licentiate thesis, comprehensive summary (Other academic)
    Abstract [en]

    Terminologies are becoming more important to modern day society as technology and science continue to grow at an accelerating rate in a globalized environment. Agreeing upon which terms should be used to represent which concepts and how those terms should be translated into different languages is important if we wish to be able to communicate with as little confusion and misunderstandings as possible.

    Since the 1990s, an increasing amount of terminology research has been devoted to facilitating and augmenting terminology-related tasks by using computers and computational methods. One focus for this research is Automatic Term Extraction (ATE).

    In this compilation thesis, studies on both bilingual and monolingual ATE are presented. First, two publications reporting on how bilingual ATE using the align-extract approach can be used to extract patent terms. The result in this case was 181,000 manually validated English-Swedish patent terms which were to be used in a machine translation system for patent documents. A critical component of the method used is the Q-value metric, presented in the third paper, which can be used to rank extracted term candidates (TC) in an order that correlates with TC precision. The use of Machine Learning (ML) in monolingual ATE is the topic of the two final contributions. The first ML-related publication shows that rule induction based ML can be used to generate linguistic term selection patterns, and in the second ML-related publication, contrastive n-gram language models are used in conjunction with SVM ML to improve the precision of term candidates selected using linguistic patterns.

    List of papers
    1. Computer aided term bank creation and standardization: Building standardized term banks through automated term extraction and advanced editing tools
    Open this publication in new window or tab >>Computer aided term bank creation and standardization: Building standardized term banks through automated term extraction and advanced editing tools
    2010 (English)In: Terminology in Everyday Life / [ed] Marcel Thelen and Frieda Steurs, John Benjamins Publishing Company , 2010, p. 163-180Chapter in book (Other academic)
    Abstract [en]

    Using a standardized term bank in both authoring and translation processes can facilitate the use of consistent terminology, which in turn minimizes confusion and frustration from the readers. One of the problems of creating a standardized term bank, is the time and effort required. Recent developments in term extraction techniques based on word alignment can improve extraction of term candidates when parallel texts are available. The aligned units are processed automatically, but a large quantity of term candidates will still have to be processed by a terminologist to select which candidates should be promoted to standardized terms. To minimize the work needed to process the extracted term candidates, we propose a method based on using efficient editing tools, as well as ranking the extracted set of term candidates by quality. This sorted set of term candidates can then be edited, categorized and filtered in a more effective way. In this paper, the process and methods used to arrive at a standardized term bank are presented and discussed.

     

    Place, publisher, year, edition, pages
    John Benjamins Publishing Company, 2010
    Series
    Terminology and Lexicography Research and Practice, ISSN 1388-8455 ; 13
    Keywords
    terminology, extraction, term bank, automation
    National Category
    Language Technology (Computational Linguistics) Computer Sciences
    Identifiers
    urn:nbn:se:liu:diva-59842 (URN)978 90 272 2337 1 (ISBN)
    Available from: 2010-09-27 Created: 2010-09-27 Last updated: 2018-01-12Bibliographically approved
    2. Automatic Extraction and Manual Validation of Hierarchical Patent Terminology
    Open this publication in new window or tab >>Automatic Extraction and Manual Validation of Hierarchical Patent Terminology
    Show others...
    2009 (English)In: NORDTERM 16. Ontologier og taksonomier.: Rapport fra NORDTERM 2009 / [ed] B. Nistrup Madsen & H. Erdman Thomsen, Copenhagen, Denmark: Copenhagen Business School Press, 2009, p. 249-262Conference paper, Published paper (Refereed)
    Abstract [en]

    Several methods can be applied to create a set of validated terms from existing documents. In this paper we describe an automatic bilingual term candidate extraction method, and the validation process used to create a hierarchical patent terminology. The process described was used to extract terms from patent texts, commissioned by the Swedish Patent Office with the purpose of using the terms for machine translation. Information on the correct linguistic inflection patterns and hierarchical partitioning of terms based on their use are of utmost importance.The process contains six phases, 1) Analysis of the source material and system configuration; 2) Term candidate extraction; 3) Term candidate filtering and initial linguistic validation; 4) Manual validation by domain experts; 5) Final linguistic validation; and 6) Publishing the validated terms.Input to the extraction process consisted of more than 91 000 patent document pairs in English and Swedish, 565 million words in English and 450 million words in Swedish. The English documents were supplied in EBD SGML format and the Swedish documents were supplied in OCR processed scans of patent documents. After grammatical and statistical analysis, the documents were word-aligned. Using the word-aligned material, candidate terms were extracted based on linguistic patterns. 750 000 term candidates were extracted and stored in a relational database. The term candidates were processed in 8 months resulting in 181 000 unique validated term pairs that were exported into several hierarchically organized OLIF files.

    Place, publisher, year, edition, pages
    Copenhagen, Denmark: Copenhagen Business School Press, 2009
    Keywords
    automatic term extraction, computational terminology, patent terminology
    National Category
    Language Technology (Computational Linguistics)
    Identifiers
    urn:nbn:se:liu:diva-75236 (URN)978-87-994577-0-0 (ISBN)
    Conference
    NORDTERM 2009, København, Danmark 9‐12. juni 2009
    Available from: 2012-02-23 Created: 2012-02-22 Last updated: 2018-01-12Bibliographically approved
    3. Terminology extraction and term ranking for standardizing term banks
    Open this publication in new window or tab >>Terminology extraction and term ranking for standardizing term banks
    2007 (English)In: Proceedings of 16th Nordic Conference of Computational Linguistics Nodalida,2007 / [ed] Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit, Tartu, Estonia: University of Tartu , 2007, p. 349-354Conference paper, Published paper (Refereed)
    Abstract [en]

    This paper presents how word alignment techniques could be used for building standardized term banks. It is shown that time and effort could be saved by a relatively simple evaluation metric based on frequency data from term pairs, and source and target distributions inside the alignment results. The proposed Q-value metric is shown to outperform other tested metrics such as Dice's coefficient, and simple pair frequency.

     

    Place, publisher, year, edition, pages
    Tartu, Estonia: University of Tartu, 2007
    Keywords
    terminology extraction, metric, word alignment
    National Category
    Computer Sciences
    Identifiers
    urn:nbn:se:liu:diva-41011 (URN)54924 (Local ID)978-9985-4-0513-0 (ISBN)54924 (Archive number)54924 (OAI)
    Conference
    NODALIDA 2007, 16th Nordic Conference of Computational Linguistics, 24-26 May 2007, University of Tartu, Estonia
    Available from: 2010-09-29 Created: 2009-10-10 Last updated: 2018-01-13Bibliographically approved
    4. Using machine learning to perform automatic term recognition
    Open this publication in new window or tab >>Using machine learning to perform automatic term recognition
    2010 (English)In: Proceedings of the LREC 2010 Workshop on Methods for automatic acquisition of Language Resources and their evaluation methods / [ed] Núria Bel, Béatrice Daille, Andrejs Vasiljevs, European Language Resources Association, 2010, p. 49-54Conference paper, Published paper (Refereed)
    Abstract [en]

    In this paper a machine learning approach is applied to Automatic Term Recognition (ATR). Similar approaches have been successfully used in Automatic Keyword Extraction (AKE). Using a dataset consisting of Swedish patent texts and validated terms belonging to these texts, unigrams and bigrams are extracted and annotated with linguistic and statistical feature values. Experiments using a varying ratio between positive and negative examples in the training data are conducted using the annotated n-grams. The results indicate that a machine learning approach is viable for ATR. Furthermore, a machine learning approach for bilingual ATR is discussed. Preliminary analysis however indicate that some modifications have to be made to apply the monolingual machine learning approach to a bilingual context.

    Place, publisher, year, edition, pages
    European Language Resources Association, 2010
    National Category
    Language Technology (Computational Linguistics)
    Identifiers
    urn:nbn:se:liu:diva-75237 (URN)000356879501100 ()978-2-9517408-6-0 (ISBN)
    Conference
    LREC 2010 Workshop on Methods for automatic acquisition of Language Resources and their evaluation methods, 23 May 2010, Valletta, Malta
    Available from: 2012-03-01 Created: 2012-02-22 Last updated: 2018-01-12Bibliographically approved
    5. Exploring termhood using language models
    Open this publication in new window or tab >>Exploring termhood using language models
    2011 (English)In: Proceedings of the Workshop CHAT 2011: Creation, Harmonization and Application of Terminology Resources / [ed] Tatiana Gornostay, Andrejs Vasiljevs, Tartu University Library (Estonia): Northern European Association for Language Technology (NEALT) , 2011, p. 32-35Conference paper, Published paper (Refereed)
    Abstract [en]

    Term extraction metrics are mostly based on frequency counts. This can be a problem when trying to extract previously unseen multi-word terms. This paper explores whether smoothed language models can be used instead. Although a simplistic use of language models is examined in this paper, the results indicate that with more refinement, smoothed language models may be used instead of unsmoothed frequency-count based termhood metrics.

    Place, publisher, year, edition, pages
    Tartu University Library (Estonia): Northern European Association for Language Technology (NEALT), 2011
    Series
    NEALT Proceedings Series, ISSN 1736-8197, E-ISSN 1736-6305 ; Vol. 12
    Keywords
    automatic term extraction, computational terminology, machine learning
    National Category
    Language Technology (Computational Linguistics)
    Identifiers
    urn:nbn:se:liu:diva-75238 (URN)
    Conference
    NODALIDA 2011 Workshop Creation, Harmonization and Application of Terminology Resources, May 11, 2011, Riga, Latvia
    Available from: 2012-02-23 Created: 2012-02-22 Last updated: 2018-01-12Bibliographically approved
  • 44.
    Foo, Jody
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Exploring termhood using language models2011In: Proceedings of the Workshop CHAT 2011: Creation, Harmonization and Application of Terminology Resources / [ed] Tatiana Gornostay, Andrejs Vasiljevs, Tartu University Library (Estonia): Northern European Association for Language Technology (NEALT) , 2011, p. 32-35Conference paper (Refereed)
    Abstract [en]

    Term extraction metrics are mostly based on frequency counts. This can be a problem when trying to extract previously unseen multi-word terms. This paper explores whether smoothed language models can be used instead. Although a simplistic use of language models is examined in this paper, the results indicate that with more refinement, smoothed language models may be used instead of unsmoothed frequency-count based termhood metrics.

  • 45.
    Foo, Jody
    et al.
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Merkel, Magnus
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Computer aided term bank creation and standardization: Building standardized term banks through automated term extraction and advanced editing tools2010In: Terminology in Everyday Life / [ed] Marcel Thelen and Frieda Steurs, John Benjamins Publishing Company , 2010, p. 163-180Chapter in book (Other academic)
    Abstract [en]

    Using a standardized term bank in both authoring and translation processes can facilitate the use of consistent terminology, which in turn minimizes confusion and frustration from the readers. One of the problems of creating a standardized term bank, is the time and effort required. Recent developments in term extraction techniques based on word alignment can improve extraction of term candidates when parallel texts are available. The aligned units are processed automatically, but a large quantity of term candidates will still have to be processed by a terminologist to select which candidates should be promoted to standardized terms. To minimize the work needed to process the extracted term candidates, we propose a method based on using efficient editing tools, as well as ranking the extracted set of term candidates by quality. This sorted set of term candidates can then be edited, categorized and filtered in a more effective way. In this paper, the process and methods used to arrive at a standardized term bank are presented and discussed.

     

  • 46.
    Foo, Jody
    et al.
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Merkel, Magnus
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Using machine learning to perform automatic term recognition2010In: Proceedings of the LREC 2010 Workshop on Methods for automatic acquisition of Language Resources and their evaluation methods / [ed] Núria Bel, Béatrice Daille, Andrejs Vasiljevs, European Language Resources Association, 2010, p. 49-54Conference paper (Refereed)
    Abstract [en]

    In this paper a machine learning approach is applied to Automatic Term Recognition (ATR). Similar approaches have been successfully used in Automatic Keyword Extraction (AKE). Using a dataset consisting of Swedish patent texts and validated terms belonging to these texts, unigrams and bigrams are extracted and annotated with linguistic and statistical feature values. Experiments using a varying ratio between positive and negative examples in the training data are conducted using the annotated n-grams. The results indicate that a machine learning approach is viable for ATR. Furthermore, a machine learning approach for bilingual ATR is discussed. Preliminary analysis however indicate that some modifications have to be made to apply the monolingual machine learning approach to a bilingual context.

  • 47.
    Gavin, Jacob
    et al.
    Linköping University.
    Hammarbäck, Jimmy
    Linköping University.
    Hammarbäck, Madeleine
    Linköping University.
    Helmersson, Benjamin
    Linköping University.
    Nyberg, Martina
    Linköping University.
    Svensson, Cassandra
    Linköping University.
    Foo, Jody
    Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering.
    Koch, Felix-Sebastian
    Linköping University, Department of Behavioural Sciences and Learning, Psychology. Linköping University, Faculty of Arts and Sciences.
    An eye-tracking study on the importance of consistent terminology2014In: Proceedings of the Fifth Swedish Language Technology Conference (SLTC-14), 2014Conference paper (Refereed)
    Abstract [en]

    Using inconsistent terminology, e.g. having different terms in documentation and e.g. labels and menu items in a user interface is believed to be confusing to users. However, few empirical studies exist on this particular topic. In this paper we show how users' interaction with an interface is affected by inconsistent terminology. An experimental eye-tracking study with 30 participants was conducted where the participants were shown a user interface and a task description. The terminology in the interface and task description was manipulated to be either consistent or inconsistent. The results show that terminological inconsistencies led to a significantly higher number of visual fixations, more time needed to perform the task, and more returns to the task description. The conclusion is that inconsistent use of terms create unnecessary cognitive workload for the user that can be avoided by ensuring terminological consistency within a system.

  • 48.
    Gorrell, Genevieve
    Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
    Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing2006Doctoral thesis, monograph (Other academic)
    Abstract [en]

    The current surge of interest in search and comparison tasks in natural language processing has brought with it a focus on vector space approaches and vector space dimensionality reduction techniques. Presenting data as points in hyperspace provides opportunities to use a variety of welldeveloped tools pertinent to this representation. Dimensionality reduction allows data to be compressed and generalised. Eigen decomposition and related algorithms are one category of approaches to dimensionality reduction, providing a principled way to reduce data dimensionality that has time and again shown itself capable of enabling access to powerful generalisations in the data. Issues with the approach, however, include computational complexity and limitations on the size of dataset that can reasonably be processed in this way. Large datasets are a persistent feature of natural language processing tasks. This thesis focuses on two main questions. Firstly, in what ways can eigen decomposition and related techniques be extended to larger datasets? Secondly, this having been achieved, of what value is the resulting approach to information retrieval and to statistical language modelling at the ngram level? The applicability of eigen decomposition is shown to be extendable through the use of an extant algorithm; the Generalized Hebbian Algorithm (GHA), and the novel extension of this algorithm to paired data; the Asymmetric Generalized Hebbian Algorithm (AGHA). Several original extensions to the these algorithms are also presented, improving their applicability in various domains. The applicability of GHA to Latent Semantic Analysisstyle tasks is investigated. Finally, AGHA is used to investigate the value of singular value decomposition, an eigen decomposition variant, to ngram language modelling. A sizeable perplexity reduction is demonstrated.

  • 49.
    Gustavsson, Pär
    et al.
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Jönsson, Arne
    Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
    Text Summarization using Random Indexing and PageRank2010In: Proceedings of the Third Swedish Language Technology Conference (SLTC-2010),, 2010Conference paper (Refereed)
  • 50.
    Gómez-Rodriguez, Carlos
    et al.
    Departamento de Computación, Universidade da Coruña, A Coruña, Spain.
    Kuhlmann, Marco
    Uppsala universitet, Institutionen för lingvistik och filologi.
    Satta, Giorgio
    Department of Information Engineering, University of Padua, Padua, Italy.
    Weir, David
    Department of Informatics, University of Sussex, East Sussex, Storbritannien.
    Optimal Reduction of Rule Length in Linear Context-Free Rewriting Systems2009In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, 2009, p. 539-547Conference paper (Other academic)
    Abstract [en]

    Linear Context-free Rewriting Systems (LCFRS) is an expressive grammar formalism with applications in syntax-based machine translation. The parsing complexity of an LCFRS is exponential in both the rank of a production, defined as the number of nonterminals on its right-hand side, and a measure for the discontinuity of a phrase, called fan-out. In this paper, we present an algorithm that transforms an LCFRS into a strongly equivalent form in which all productions have rank at most 2, and has minimal fan-out. Our results generalize previous work on Synchronous Context-Free Grammar, and are particularly relevant for machine translation from or to languages that require syntactic analyses with discontinuous constituents.

1234 1 - 50 of 154
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf