liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Text Harmonization Strategies for Phrase-Based Statistical Machine Translation
Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In this thesis I aim to improve phrase-based statistical machine translation (PBSMT) in a number of ways by the use of text harmonization strategies. PBSMT systems are built by training statistical models on large corpora of human translations. This architecture generally performs well for languages with similar structure. If the languages are different for example with respect to word order or morphological complexity, however, the standard methods do not tend to work well. I address this problem through text harmonization, by making texts more similar before training and applying a PBSMT system.

I investigate how text harmonization can be used to improve PBSMT with a focus on four areas: compounding, definiteness, word order, and unknown words. For the first three areas, the focus is on linguistic differences between languages, which I address by applying transformation rules, using either rule-based or machine learning-based techniques, to the source or target data. For the last area, unknown words, I harmonize the translation input to the training data by replacing unknown words with known alternatives.

I show that translation into languages with closed compounds can be improved by splitting and merging compounds. I develop new merging algorithms that outperform previously suggested algorithms and show how part-of-speech tags can be used to improve the order of compound parts. Scandinavian definite noun phrases are identified as a problem forPBSMT in translation into Scandinavian languages and I propose a preprocessing approach that addresses this problem and gives large improvements over a baseline. Several previous proposals for how to handle differences in reordering exist; I propose two types of extensions, iterating reordering and word alignment and using automatically induced word classes, which allow these methods to be used for less-resourced languages. Finally I identify several ways of replacing unknown words in the translation input, most notably a spell checking-inspired algorithm, which can be trained using character-based PBSMT techniques.

Overall I present several approaches for extending PBSMT by the use of pre- and postprocessing techniques for text harmonization, and show experimentally that these methods work. Text harmonization methods are an efficient way to improve statistical machine translation within the phrase-based approach, without resorting to more complex models.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2012. , 95 p.
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1451
Keyword [en]
Statistical machine translation, text harmonization, compound words, definiteness, reordering, unknown words
National Category
Language Technology (Computational Linguistics) Computer Science General Language Studies and Linguistics
Identifiers
URN: urn:nbn:se:liu:diva-76766ISBN: 978-91-7519-887-3 (print)OAI: oai:DiVA.org:liu-76766DiVA: diva2:516686
Public defence
2012-06-11, Visionen, Hus B, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2012-05-14 Created: 2012-04-19 Last updated: 2012-05-15Bibliographically approved
List of papers
1. Generation of Compound Words in Statistical Machine Translation into Compounding Languages
Open this publication in new window or tab >>Generation of Compound Words in Statistical Machine Translation into Compounding Languages
2013 (English)In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 39, no 4, 1067-1108 p.Article in journal (Refereed) Published
Abstract [en]

In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.

Place, publisher, year, edition, pages
MIT PRESS, 2013
Keyword
Machine translation, compound words, compounding languages
National Category
Language Technology (Computational Linguistics) General Language Studies and Linguistics Computer Science
Identifiers
urn:nbn:se:liu:diva-76689 (URN)10.1162/COLI_a_00162 (DOI)000327124700008 ()
Available from: 2012-04-16 Created: 2012-04-16 Last updated: 2017-12-07
2. Definite Noun Phrases in Statistical Machine Translation into Danish
Open this publication in new window or tab >>Definite Noun Phrases in Statistical Machine Translation into Danish
2009 (English)In: Proceedings of the Workshop on Extracting and Using Constructions in NLP / [ed] Magnus Sahlgren and Ola Knutsson, 2009, 4-9 p.Conference paper, Published paper (Refereed)
Abstract [en]

There are two ways to express definiteness in Danish, which makes it problematic for statistical machine translation (SMT) from English, since the wrong realisation can be chosen. We present a part-of-speech-based method for identifying and transforming English definite NPs that would likely be expressed in a different way in Danish. The transformed English is used for training a phrase-based SMT system.This technique gives significant improvements of translation quality, of up to 22.1% relative on Bleu, compared to a baseline trained on original English, in two different domains.

Series
SICS Technical Report, ISSN 1100-3154 ; T2009:10
Keyword
Statistical machine translation, definiteness, nouns, Scandinavian languages
National Category
Language Technology (Computational Linguistics) Language Technology (Computational Linguistics) Computer Science
Identifiers
urn:nbn:se:liu:diva-53955 (URN)
Conference
Workshop on Extracting and Using Constructions in NLP, May 14, Odense, Denmark
Available from: 2010-02-15 Created: 2010-02-15 Last updated: 2012-05-14Bibliographically approved
3. Definite Noun Phrases in Statistical Machine Translation into Scandinavian Languages.
Open this publication in new window or tab >>Definite Noun Phrases in Statistical Machine Translation into Scandinavian Languages.
2011 (English)In: Proceedings of the 15th conference of the European Association for Machine Translation (EAMT 2011) / [ed] Mikel L.Forcada, Heidi Depraetere, Vincent Vandeghinste, 2011, 289-296 p.Conference paper, Published paper (Refereed)
Abstract [en]

The Scandinavian languages have an unusual structure of definite noun phrases (NPs), with a noun suffix as one possibility of expressing definiteness, which is problematic for statistical machine translation from languages with different NP structures. We show that translation can be improved by simple source side transformations of definite NPs, for translation from English and Italian, into Danish, Swedish, and Norwegian, with small adjustments of the preprocessing strategy, depending on the language pair. We also explored target side transformations, with mixed results.

Keyword
Machine translation, definiteness, Scandinavian languages
National Category
Language Technology (Computational Linguistics) Language Technology (Computational Linguistics) Computer Science
Identifiers
urn:nbn:se:liu:diva-70123 (URN)
Conference
EAMT-2011: the 15th Annual Conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium
Available from: 2011-08-19 Created: 2011-08-19 Last updated: 2012-05-14Bibliographically approved
4. Iterative reordering and word alignment for statistical MT
Open this publication in new window or tab >>Iterative reordering and word alignment for statistical MT
2011 (English)In: Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011) / [ed] Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa, 2011, 315-318 p.Conference paper, Published paper (Refereed)
Abstract [en]

Word alignment is necessary for statistical machine translation (SMT), and reordering as a preprocessing step has been shown to improve SMT for many language pairs. In this initial study we investigate if both word alignment and reordering can be improved by iterating these two steps, since they both depend on each other. Overall no consistent improvements were seen on the translation task, but the reordering rules contain different information in the different iterations, leading us to believe that the iterative strategy can be useful.

Series
NEALT Proceedings Series, ISSN 1736-6305 ; 11
Keyword
Machine translation, reordering
National Category
Language Technology (Computational Linguistics) Language Technology (Computational Linguistics) Computer Science
Identifiers
urn:nbn:se:liu:diva-70122 (URN)
Conference
The 18th Nordic Conference of Computational Linguistics, May 11–13, Riga, Latvia
Available from: 2011-08-19 Created: 2011-08-19 Last updated: 2013-07-19Bibliographically approved
5. Clustered Word Classes for Preordering in Statistical Machine Translation
Open this publication in new window or tab >>Clustered Word Classes for Preordering in Statistical Machine Translation
2012 (English)In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2012, 28-34 p.Conference paper, Published paper (Refereed)
Abstract [en]

Clustered word classes have been used in connection with statistical machine translation, for instance for improving word alignments. In this work we investigate if clustered word classes can be used in a preordering strategy, where the source language is reordered prior to training and translation. Part-of-speech tagging has previously been successfully used for learning reordering rules that can be applied before training and translation. We show that we can use word clusters for learning rules, and significantly improve on a baseline with only slightly worse performance than for standard POS-tags on an English–German translation task. We also show the usefulness of the approach for the less-resourced language Haitian Creole, for translation into English, where the suggested approach is significantly better than the baseline.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2012
Keyword
Statistical machine translation, reordering, clustering, unsupervised learning
National Category
Language Technology (Computational Linguistics) Computer Science
Identifiers
urn:nbn:se:liu:diva-76706 (URN)
Conference
The 13th Conference of the European Chapter of the Association for Computational Linguistics April 24, Avignon, France
Available from: 2012-04-17 Created: 2012-04-17 Last updated: 2016-08-22Bibliographically approved
6. Vs and OOVs: Two Problems for Translation between German and English
Open this publication in new window or tab >>Vs and OOVs: Two Problems for Translation between German and English
2010 (English)In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR (WMT'10), 2010, 183-188 p.Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we report on experiments with three preprocessing strategies for improving translation output in a statistical MT system. In training, two reordering strategies were studied: (i) reorder on thebasis of the alignments from Giza++, and (ii) reorder by moving all verbs to the end of segments. In translation, out-of-vocabulary words were preprocessed in a knowledge-lite fashion to identify a likely equivalent. All three strategies were implemented for our English-German systems submitted to the WMT10 shared task. Combining them lead to improvements in both language directions.

Keyword
Machine translation, reordering, Out-of-vocabulary words
National Category
Language Technology (Computational Linguistics) Computer Science
Identifiers
urn:nbn:se:liu:diva-58979 (URN)978-1-932432-71-8 (ISBN)1-932432-71-X (ISBN)
Conference
The Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, 15-16 July 2010 Uppsala, Sweden
Available from: 2010-09-03 Created: 2010-09-03 Last updated: 2012-05-14Bibliographically approved
7. Spell Checking Techniques for Replacement of Unknown Words and Data Cleaning for Haitian Creole SMS Translation
Open this publication in new window or tab >>Spell Checking Techniques for Replacement of Unknown Words and Data Cleaning for Haitian Creole SMS Translation
2011 (English)In: Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT 2011) / [ed] Chris Callison-Burch, Philipp Koehn, Christof Monz, Omar F. Zaidan, Stroudsburg: Association for Computational Linguistics, 2011, 470-477 p.Conference paper, Published paper (Refereed)
Abstract [en]

We report results on translation of SMS messages from Haitian Creole to English. We show improvements by applying spell checking techniques to unknown words and creating a lattice with the best known spelling equivalents. We also used a small cleaned corpus to train a cleaning model that we applied to the noisy corpora.

Place, publisher, year, edition, pages
Stroudsburg: Association for Computational Linguistics, 2011
Keyword
Machine translation, unknown words, spell checking, data cleaning, Haitian Creole
National Category
Language Technology (Computational Linguistics) Language Technology (Computational Linguistics) Computer Science
Identifiers
urn:nbn:se:liu:diva-70127 (URN)978-1-937284-12-1 (ISBN)1-937284-12-3 (ISBN)
Conference
The Sixth Workshop on Statistical Machine Translation (WMT 2011), July 30-31, Edinburgh, UK
Available from: 2011-08-19 Created: 2011-08-19 Last updated: 2012-05-14Bibliographically approved

Open Access in DiVA

Text Harmonization Strategies for Phrase-Based Statistical Machine Translation(985 kB)2095 downloads
File information
File name FULLTEXT01.pdfFile size 985 kBChecksum SHA-512
fa71a0e8522a05f2276433b92c410ed4dc37c7777d317b61fe01036cd620e1e54fb1b6c53a4424802a3eab6007786b854db8470faec94cdf23101c71484de8b3
Type fulltextMimetype application/pdf
omslag(183 kB)61 downloads
File information
File name COVER01.pdfFile size 183 kBChecksum SHA-512
586d53e550c9bfbf41fb82adefe6974f94075c09d05a8ea6fd727d62d451613598807420eb0fc883dae4527e4b2ba3cafc71db7dc1c0bc6571fea78f8c31d4b9
Type coverMimetype application/pdf

Authority records BETA

Stymne, Sara

Search in DiVA

By author/editor
Stymne, Sara
By organisation
NLPLAB - Natural Language Processing LaboratoryThe Institute of Technology
Language Technology (Computational Linguistics)Computer ScienceGeneral Language Studies and Linguistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 2095 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1207 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf