liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Compound Processing for Phrase-Based Statistical Machine Translation
Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
2009 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

In this thesis I explore how compound processing can be used to improve phrase-based statistical machine translation (PBSMT) between English and German/Swedish. Both German and Swedish generally use closed compounds, which are written as one word without spaces or other indicators of word boundaries. Compounding is both common and productive, which makes it problematic for PBSMT, mainly due to sparse data problems.

The adopted strategy for compound processing is to split compounds into their component parts before training and translation. For translation into Swedish and German the parts are merged after translation. I investigate the effect of different splitting algorithms for translation between English and German, and of different merging algorithms for German. I also apply these methods to a different language pair, English--Swedish. Overall the studies show that compound processing is useful, especially for translation from English into German or Swedish. But there are improvements for translation into English as well, such as a reduction of unknown words.

I show that for translation between English and German different splitting algorithms work best for different translation directions. I also design and evaluate a novel merging algorithm based on part-of-speech matching, which outperforms previous methods for compound merging, showing the need for information that is carried through the translation process, rather than only external knowledge sources such as word lists. Most of the methods for compound processing were originally developed for German. I show that these methods can be applied to Swedish as well, with similar results.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press , 2009. , 64 p.
Series
Linköping Studies in Science and Technology. Thesis, ISSN 0280-7971 ; 1421
Keyword [en]
Machine translation, compounds, factored translation, statistical machine translation, computational linguistics
National Category
Language Technology (Computational Linguistics) Language Technology (Computational Linguistics) Computer Science
Identifiers
URN: urn:nbn:se:liu:diva-51416ISBN: 978-91-7393-501-2 (print)OAI: oai:DiVA.org:liu-51416DiVA: diva2:279475
Presentation
2009-12-18, Alan Turing, Hus E, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2009-12-07 Created: 2009-10-30 Last updated: 2009-12-07Bibliographically approved
List of papers
1. German Compounds in Factored Statistical Machine Translation
Open this publication in new window or tab >>German Compounds in Factored Statistical Machine Translation
2008 (English)In: -, Berlin, Germany: Springer , 2008, 464-475 p.Conference paper, Published paper (Refereed)
Abstract [en]

An empirical method for splitting German compounds is explored by varying it in a number of ways to investigate the consequences for factored statistical machine translation between English and German in both directions. Compound splitting is incorporated into translation in a preprocessing step, performed on training data and on German translation input. For translation into German, compounds are merged based on part-of-speech in a postprocessing step. Compound parts are marked, to separate them from ordinary words. Translation quality is improved in both translation directions and the number of untranslated words in the English output is reduced. Different versions of the splitting algorithm performs best in the two different translation directions.

Place, publisher, year, edition, pages
Berlin, Germany: Springer, 2008
Keyword
machine translation, compounds
National Category
Computer Science
Identifiers
urn:nbn:se:liu:diva-44110 (URN)10.1007/978-3-540-85287-2_44 (DOI)75561 (Local ID)75561 (Archive number)75561 (OAI)
Conference
6th International Conference on Natural Language Processing GoTAL, 2008
Available from: 2009-10-10 Created: 2009-10-10 Last updated: 2009-12-07Bibliographically approved
2. Processing of Swedish Compounds for Phrase-Based Statistical Machine Translation
Open this publication in new window or tab >>Processing of Swedish Compounds for Phrase-Based Statistical Machine Translation
2008 (English)In: Proceedings of the 12th European Association for Machine Translation Conference, Hamburg, Germany: HITEC e.V , 2008, 182-191 p.Conference paper, Published paper (Refereed)
Abstract [en]

We investigated the effects of processing Swedish compounds for phrase-based SMT between Swedish and English. Compounds were split in a pre-processing step using an unsupervised empirical method. After translation into Swedish, compounds were merged, using a novel merging algorithm. We investigated two ways of handling compound parts, by marking them as compound parts or by normalizing them to a canonical form. We found that compound splitting did improve translation into Swedish, according to automatic metrics. For translation into English the results were not consistent across automatic metrics. However, error analysis of compound translation showed a small improvement in the systems that used splitting. The number of untranslated words in the English output was reduced by 50%.

Place, publisher, year, edition, pages
Hamburg, Germany: HITEC e.V, 2008
Keyword
computational linguistics, statistical machine translation
National Category
Computer Science
Identifiers
urn:nbn:se:liu:diva-44126 (URN)75720 (Local ID)978-300025770-4 (ISBN)75720 (Archive number)75720 (OAI)
Conference
12th European Machine Translation Conference, 22-23 September 2008, Hamburg, Germany
Available from: 2009-10-10 Created: 2009-10-10 Last updated: 2012-09-14Bibliographically approved
3. A Comparison of Merging Strategies for Translation of German Compounds
Open this publication in new window or tab >>A Comparison of Merging Strategies for Translation of German Compounds
2009 (English)In: Proceedings of the Student Research Workshop at the 12th Conference of the European Chapter of the ACL (EACL 2009), Association for Computational Linguistics , 2009, 61-69 p.Conference paper, Published paper (Refereed)
Abstract [en]

In this article, compound processing for translation into German in a factored statistical MT system is investigated. Compound sare handled by splitting them prior to training, and merging the parts after translation. I have explored eight merging strategies using different combinations of external knowledge sources, such as word lists, and internal sources that are carried through the translation process, such as symbols or parts-of-speech. I show that for merging to be successful, some internal knowledge source is needed. I also show that an extra sequence model for part-ofspeech is useful in order to improve the order of compound parts in the output. The best merging results are achieved by a matching scheme for part-of-speech tags.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2009
Keyword
Natural language processing, machine translation, compounds
National Category
Computer Science Language Technology (Computational Linguistics) Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:liu:diva-20318 (URN)
Available from: 2009-09-03 Created: 2009-09-03 Last updated: 2009-12-07Bibliographically approved

Open Access in DiVA

Compound Processing for Phrase-Based Statistical Machine Translation(559 kB)1365 downloads
File information
File name FULLTEXT01.pdfFile size 559 kBChecksum SHA-512
26f96b3828594f0d42edde6961771c7050f64860844d00f97b98d971096b12d37b5683bf75164efad81fb3db03b52309b6f50f883bd9b0516315ce0a9c39c449
Type fulltextMimetype application/pdf
Cover(42 kB)27 downloads
File information
File name COVER01.pdfFile size 42 kBChecksum SHA-512
778882600feac43231dfb89d31b2e3a68fdb3d8faef9cb61d2694029e7abaa05ab51053b25757f71ae1d5530a80f765ddf20ccd281b0dc6483eec9ef4b93f9ad
Type coverMimetype application/pdf

Authority records BETA

Stymne, Sara

Search in DiVA

By author/editor
Stymne, Sara
By organisation
NLPLAB - Natural Language Processing LaboratoryThe Institute of Technology
Language Technology (Computational Linguistics)Language Technology (Computational Linguistics)Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 1365 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 444 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf