liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Stymne, Sara
Publications (10 of 28) Show all publications
Stymne, S., Cancedda, N. & Ahrenberg, L. (2013). Generation of Compound Words in Statistical Machine Translation into Compounding Languages. Computational linguistics - Association for Computational Linguistics (Print), 39(4), 1067-1108
Open this publication in new window or tab >>Generation of Compound Words in Statistical Machine Translation into Compounding Languages
2013 (English)In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 39, no 4, p. 1067-1108Article in journal (Refereed) Published
Abstract [en]

In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.

Place, publisher, year, edition, pages
MIT PRESS, 2013
Keywords
Machine translation, compound words, compounding languages
National Category
Language Technology (Computational Linguistics) General Language Studies and Linguistics Computer Sciences
Identifiers
urn:nbn:se:liu:diva-76689 (URN)10.1162/COLI_a_00162 (DOI)000327124700008 ()
Available from: 2012-04-16 Created: 2012-04-16 Last updated: 2018-01-12
Holmqvist, M., Stymne, S., Ahrenberg, L. & Merkel, M. (2012). Alignment-based reordering for SMT. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Paper presented at The Eight International Conference on Language Resources and Evaluation (LREC'12), May 2012, Istanbul, Turkey (pp. 3436-3440).
Open this publication in new window or tab >>Alignment-based reordering for SMT
2012 (English)In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), 2012, p. 3436-3440Conference paper, Published paper (Other academic)
Abstract [en]

We present a method for improving word alignment quality for phrase-based statistical machine translation by reordering the source text according to the target word order suggested by an initial word alignment. The reordered text is used to create a second word alignment which can be an improvement of the first alignment, since the word order is more similar. The method requires no other pre-processing such as part-of-speech tagging or parsing. We report improved Bleu scores for English-to-German and English-to-Swedish translation. We also examined the effect on word alignment quality and found that the reordering method increased recall while lowering precision, which partly can explain the improved Bleu scores. A manual evaluation of the translation output was also performed to understand what effect our reordering method has on the translation system. We found that where the system employing reordering differed from the baseline in terms of having more words, or a different word order, this generally led to an improvement in translation quality.

Keywords
Mahine translation, statistical machine translation, word alignment, reordering
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:liu:diva-80355 (URN)
Conference
The Eight International Conference on Language Resources and Evaluation (LREC'12), May 2012, Istanbul, Turkey
Available from: 2012-08-23 Created: 2012-08-23 Last updated: 2018-01-12
Stymne, S. (2012). Clustered Word Classes for Preordering in Statistical Machine Translation. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics: . Paper presented at The 13th Conference of the European Chapter of the Association for Computational Linguistics April 24, Avignon, France (pp. 28-34). Association for Computational Linguistics
Open this publication in new window or tab >>Clustered Word Classes for Preordering in Statistical Machine Translation
2012 (English)In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2012, p. 28-34Conference paper, Published paper (Refereed)
Abstract [en]

Clustered word classes have been used in connection with statistical machine translation, for instance for improving word alignments. In this work we investigate if clustered word classes can be used in a preordering strategy, where the source language is reordered prior to training and translation. Part-of-speech tagging has previously been successfully used for learning reordering rules that can be applied before training and translation. We show that we can use word clusters for learning rules, and significantly improve on a baseline with only slightly worse performance than for standard POS-tags on an English–German translation task. We also show the usefulness of the approach for the less-resourced language Haitian Creole, for translation into English, where the suggested approach is significantly better than the baseline.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2012
Keywords
Statistical machine translation, reordering, clustering, unsupervised learning
National Category
Language Technology (Computational Linguistics) Computer Sciences
Identifiers
urn:nbn:se:liu:diva-76706 (URN)
Conference
The 13th Conference of the European Chapter of the Association for Computational Linguistics April 24, Avignon, France
Available from: 2012-04-17 Created: 2012-04-17 Last updated: 2018-01-12Bibliographically approved
Stymne, S., Danielsson, H., Bremin, S., Hu, H., Karlsson, J., Prytz Lillkull, A. & Wester, M. (2012). Eye Tracking as a Tool for Machine Translation Error Analysis. In: Proceedings of the eighth international conference on Language Resources and Evaluation (LREC). Paper presented at The eighth international conference on Language Resources and Evaluation (LREC), 21-27 May 2012, Istanbul.
Open this publication in new window or tab >>Eye Tracking as a Tool for Machine Translation Error Analysis
Show others...
2012 (English)In: Proceedings of the eighth international conference on Language Resources and Evaluation (LREC), 2012Conference paper, Published paper (Other academic)
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:liu:diva-78179 (URN)
Conference
The eighth international conference on Language Resources and Evaluation (LREC), 21-27 May 2012, Istanbul
Available from: 2012-06-07 Created: 2012-06-07 Last updated: 2018-01-12
Stymne, S. & Ahrenberg, L. (2012). On the practice of error analysis for machine translation evaluation. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12): . Paper presented at The Eight International Conference on Language Resources and Evaluation (LREC'12) (pp. 1786-1790). European Language Resources Association
Open this publication in new window or tab >>On the practice of error analysis for machine translation evaluation
2012 (English)In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), European Language Resources Association , 2012, p. 1786-1790Conference paper, Published paper (Refereed)
Abstract [en]

Error analysis is a means to assess machine translation output in qualitative terms, which can be used as a basis for the generation of error profiles for different systems. As for other subjective approaches to evaluation it runs the risk of low inter-annotator agreement, but very often in papers applying error analysis to MT, this aspect is not even discussed. In this paper, we report results from a comparative evaluation of two systems where agreement initially was low, and discuss the different ways we used to improve it. We compared the effects of using more or less fine-grained taxonomies, and the possibility to restrict analysis to short sentences only. We report results on inter-annotator agreement before and after measures were taken, on error categories that are most likely to be confused, and on the possibility to establish error profiles also in the absence of a high inter-annotator agreement.

Place, publisher, year, edition, pages
European Language Resources Association, 2012
Keywords
Mahine translation, statistical machine translation, error analysis, inter-annotator agreement
National Category
Language Technology (Computational Linguistics) General Language Studies and Linguistics Computer Sciences
Identifiers
urn:nbn:se:liu:diva-80353 (URN)000323927701143 ()978-2-9517408-7-7 (ISBN)
Conference
The Eight International Conference on Language Resources and Evaluation (LREC'12)
Available from: 2012-08-23 Created: 2012-08-23 Last updated: 2018-01-12
Stymne, S. (2012). Text Harmonization Strategies for Phrase-Based Statistical Machine Translation. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Text Harmonization Strategies for Phrase-Based Statistical Machine Translation
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In this thesis I aim to improve phrase-based statistical machine translation (PBSMT) in a number of ways by the use of text harmonization strategies. PBSMT systems are built by training statistical models on large corpora of human translations. This architecture generally performs well for languages with similar structure. If the languages are different for example with respect to word order or morphological complexity, however, the standard methods do not tend to work well. I address this problem through text harmonization, by making texts more similar before training and applying a PBSMT system.

I investigate how text harmonization can be used to improve PBSMT with a focus on four areas: compounding, definiteness, word order, and unknown words. For the first three areas, the focus is on linguistic differences between languages, which I address by applying transformation rules, using either rule-based or machine learning-based techniques, to the source or target data. For the last area, unknown words, I harmonize the translation input to the training data by replacing unknown words with known alternatives.

I show that translation into languages with closed compounds can be improved by splitting and merging compounds. I develop new merging algorithms that outperform previously suggested algorithms and show how part-of-speech tags can be used to improve the order of compound parts. Scandinavian definite noun phrases are identified as a problem forPBSMT in translation into Scandinavian languages and I propose a preprocessing approach that addresses this problem and gives large improvements over a baseline. Several previous proposals for how to handle differences in reordering exist; I propose two types of extensions, iterating reordering and word alignment and using automatically induced word classes, which allow these methods to be used for less-resourced languages. Finally I identify several ways of replacing unknown words in the translation input, most notably a spell checking-inspired algorithm, which can be trained using character-based PBSMT techniques.

Overall I present several approaches for extending PBSMT by the use of pre- and postprocessing techniques for text harmonization, and show experimentally that these methods work. Text harmonization methods are an efficient way to improve statistical machine translation within the phrase-based approach, without resorting to more complex models.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2012. p. 95
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1451
Keywords
Statistical machine translation, text harmonization, compound words, definiteness, reordering, unknown words
National Category
Language Technology (Computational Linguistics) Computer Sciences General Language Studies and Linguistics
Identifiers
urn:nbn:se:liu:diva-76766 (URN)978-91-7519-887-3 (ISBN)
Public defence
2012-06-11, Visionen, Hus B, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2012-05-14 Created: 2012-04-19 Last updated: 2018-01-12Bibliographically approved
Stymne, S. (2011). Blast: A Tool for Error Analysis of Machine Translation Output. In: Sadao Kurohashi (Ed.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, system demonstrations: . Paper presented at The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, system demonstrations (pp. 56-61). Association for Computational Linguistics
Open this publication in new window or tab >>Blast: A Tool for Error Analysis of Machine Translation Output
2011 (English)In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, system demonstrations / [ed] Sadao Kurohashi, Association for Computational Linguistics, 2011, p. 56-61Conference paper, Published paper (Refereed)
Abstract [en]

We present BLAST, an open source tool for error analysis of machine translation (MT) output. We believe that error analysis, i.e., to identify and classify MT errors, should be an integral part of MT development, since it gives a qualitative view, which is not obtained by standard evaluation methods. BLAST can aid MT researchers and users in this process, by providing an easy-to-use graphical user interface. It is designed to be flexible, and can be used with any MT system, language pair, and error typology. The annotation task can be aided by highlighting similarities with a reference translation.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2011
Keywords
Machine translation, error analysis, MT evalutaion
National Category
Language Technology (Computational Linguistics) Language Technology (Computational Linguistics) Computer Sciences
Identifiers
urn:nbn:se:liu:diva-70125 (URN)9781932432909 (ISBN)
Conference
The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, system demonstrations
Available from: 2011-08-19 Created: 2011-08-19 Last updated: 2018-01-12
Stymne, S. (2011). Definite Noun Phrases in Statistical Machine Translation into Scandinavian Languages.. In: Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste (Ed.), Proceedings of the 15th conference of the European Association for Machine Translation (EAMT 2011). Paper presented at EAMT-2011: the 15th Annual Conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium (pp. 289-296).
Open this publication in new window or tab >>Definite Noun Phrases in Statistical Machine Translation into Scandinavian Languages.
2011 (English)In: Proceedings of the 15th conference of the European Association for Machine Translation (EAMT 2011) / [ed] Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste, 2011, p. 289-296Conference paper, Published paper (Refereed)
Abstract [en]

The Scandinavian languages have an unusual structure of definite noun phrases (NPs), with a noun suffix as one possibility of expressing definiteness, which is problematic for statistical machine translation from languages with different NP structures. We show that translation can be improved by simple source side transformations of definite NPs, for translation from English and Italian, into Danish, Swedish, and Norwegian, with small adjustments of the preprocessing strategy, depending on the language pair. We also explored target side transformations, with mixed results.

Keywords
Machine translation, definiteness, Scandinavian languages
National Category
Language Technology (Computational Linguistics) Language Technology (Computational Linguistics) Computer Sciences
Identifiers
urn:nbn:se:liu:diva-70123 (URN)
Conference
EAMT-2011: the 15th Annual Conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium
Available from: 2011-08-19 Created: 2011-08-19 Last updated: 2018-01-12Bibliographically approved
Holmqvist, M., Stymne, S. & Ahrenberg, L. (2011). Experiments with word alignment, normalization and clause reordering for SMT between English and German. In: Chris Callison-Burch, Philipp Koehn, Christof Monz, Omar F. Zaidan (Ed.), Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT 2011). Paper presented at The Sixth Workshop on Statistical Machine Translation (WMT 2011) (pp. 393-398).
Open this publication in new window or tab >>Experiments with word alignment, normalization and clause reordering for SMT between English and German
2011 (English)In: Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT 2011) / [ed] Chris Callison-Burch, Philipp Koehn, Christof Monz, Omar F. Zaidan, 2011, p. 393-398Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents the LIU system for the WMT 2011 shared task for translation between German and English. For English– German we attempted to improve the translation tables with a combination of standard statistical word alignments and phrase-based word alignments. For German–English translation we tried to make the German text more similar to the English text by normalizing German morphology and performing rule-based clause reordering of the German text. This resulted in small improvements for both translation directions.

Keywords
Machine translation, word alignment, reordering, normalization
National Category
Language Technology (Computational Linguistics) Language Technology (Computational Linguistics) Computer Sciences
Identifiers
urn:nbn:se:liu:diva-70129 (URN)
Conference
The Sixth Workshop on Statistical Machine Translation (WMT 2011)
Available from: 2011-08-19 Created: 2011-08-19 Last updated: 2018-01-12
Stymne, S. (2011). Iterative reordering and word alignment for statistical MT. In: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa (Ed.), Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011): . Paper presented at The 18th Nordic Conference of Computational Linguistics, May 11–13, Riga, Latvia (pp. 315-318).
Open this publication in new window or tab >>Iterative reordering and word alignment for statistical MT
2011 (English)In: Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011) / [ed] Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa, 2011, p. 315-318Conference paper, Published paper (Refereed)
Abstract [en]

Word alignment is necessary for statistical machine translation (SMT), and reordering as a preprocessing step has been shown to improve SMT for many language pairs. In this initial study we investigate if both word alignment and reordering can be improved by iterating these two steps, since they both depend on each other. Overall no consistent improvements were seen on the translation task, but the reordering rules contain different information in the different iterations, leading us to believe that the iterative strategy can be useful.

Series
NEALT Proceedings Series, ISSN 1736-6305 ; 11
Keywords
Machine translation, reordering
National Category
Language Technology (Computational Linguistics) Language Technology (Computational Linguistics) Computer Sciences
Identifiers
urn:nbn:se:liu:diva-70122 (URN)
Conference
The 18th Nordic Conference of Computational Linguistics, May 11–13, Riga, Latvia
Available from: 2011-08-19 Created: 2011-08-19 Last updated: 2018-01-12Bibliographically approved
Organisations

Search in DiVA

Show all publications