liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Vs and OOVs: Two Problems for Translation between German and English
Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
2010 (English)In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR (WMT'10), 2010, 183-188 p.Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we report on experiments with three preprocessing strategies for improving translation output in a statistical MT system. In training, two reordering strategies were studied: (i) reorder on thebasis of the alignments from Giza++, and (ii) reorder by moving all verbs to the end of segments. In translation, out-of-vocabulary words were preprocessed in a knowledge-lite fashion to identify a likely equivalent. All three strategies were implemented for our English-German systems submitted to the WMT10 shared task. Combining them lead to improvements in both language directions.

Place, publisher, year, edition, pages
2010. 183-188 p.
Keyword [en]
Machine translation, reordering, Out-of-vocabulary words
National Category
Language Technology (Computational Linguistics) Computer Science
Identifiers
URN: urn:nbn:se:liu:diva-58979ISBN: 978-1-932432-71-8 (print)ISBN: 1-932432-71-X (print)OAI: oai:DiVA.org:liu-58979DiVA: diva2:347920
Conference
The Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, 15-16 July 2010 Uppsala, Sweden
Available from: 2010-09-03 Created: 2010-09-03 Last updated: 2012-05-14Bibliographically approved
In thesis
1. Text Harmonization Strategies for Phrase-Based Statistical Machine Translation
Open this publication in new window or tab >>Text Harmonization Strategies for Phrase-Based Statistical Machine Translation
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In this thesis I aim to improve phrase-based statistical machine translation (PBSMT) in a number of ways by the use of text harmonization strategies. PBSMT systems are built by training statistical models on large corpora of human translations. This architecture generally performs well for languages with similar structure. If the languages are different for example with respect to word order or morphological complexity, however, the standard methods do not tend to work well. I address this problem through text harmonization, by making texts more similar before training and applying a PBSMT system.

I investigate how text harmonization can be used to improve PBSMT with a focus on four areas: compounding, definiteness, word order, and unknown words. For the first three areas, the focus is on linguistic differences between languages, which I address by applying transformation rules, using either rule-based or machine learning-based techniques, to the source or target data. For the last area, unknown words, I harmonize the translation input to the training data by replacing unknown words with known alternatives.

I show that translation into languages with closed compounds can be improved by splitting and merging compounds. I develop new merging algorithms that outperform previously suggested algorithms and show how part-of-speech tags can be used to improve the order of compound parts. Scandinavian definite noun phrases are identified as a problem forPBSMT in translation into Scandinavian languages and I propose a preprocessing approach that addresses this problem and gives large improvements over a baseline. Several previous proposals for how to handle differences in reordering exist; I propose two types of extensions, iterating reordering and word alignment and using automatically induced word classes, which allow these methods to be used for less-resourced languages. Finally I identify several ways of replacing unknown words in the translation input, most notably a spell checking-inspired algorithm, which can be trained using character-based PBSMT techniques.

Overall I present several approaches for extending PBSMT by the use of pre- and postprocessing techniques for text harmonization, and show experimentally that these methods work. Text harmonization methods are an efficient way to improve statistical machine translation within the phrase-based approach, without resorting to more complex models.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2012. 95 p.
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1451
Keyword
Statistical machine translation, text harmonization, compound words, definiteness, reordering, unknown words
National Category
Language Technology (Computational Linguistics) Computer Science General Language Studies and Linguistics
Identifiers
urn:nbn:se:liu:diva-76766 (URN)978-91-7519-887-3 (ISBN)
Public defence
2012-06-11, Visionen, Hus B, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2012-05-14 Created: 2012-04-19 Last updated: 2012-05-15Bibliographically approved

Open Access in DiVA

No full text

Other links

Link to the Proceeding

Authority records BETA

Stymne, SaraHolmqvist, MariaAhrenberg, Lars

Search in DiVA

By author/editor
Stymne, SaraHolmqvist, MariaAhrenberg, Lars
By organisation
NLPLAB - Natural Language Processing LaboratoryThe Institute of Technology
Language Technology (Computational Linguistics)Computer Science

Search outside of DiVA

GoogleGoogle Scholar

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 60 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf