liu.seSök publikationer i DiVA
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Generation of Compound Words in Statistical Machine Translation into Compounding Languages
Linköpings universitet, Institutionen för datavetenskap, NLPLAB - Laboratoriet för databehandling av naturligt språk. Linköpings universitet, Tekniska högskolan.
Xerox Research Centre Europe. (Machine Learning for Document Access and Translation)
Linköpings universitet, Institutionen för datavetenskap, NLPLAB - Laboratoriet för databehandling av naturligt språk. Linköpings universitet, Tekniska högskolan.
2013 (Engelska)Ingår i: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 39, nr 4, s. 1067-1108Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.

Ort, förlag, år, upplaga, sidor
MIT PRESS , 2013. Vol. 39, nr 4, s. 1067-1108
Nyckelord [en]
Machine translation, compound words, compounding languages
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling) Jämförande språkvetenskap och allmän lingvistik Datavetenskap (datalogi)
Identifikatorer
URN: urn:nbn:se:liu:diva-76689DOI: 10.1162/COLI_a_00162ISI: 000327124700008OAI: oai:DiVA.org:liu-76689DiVA, id: diva2:515868
Tillgänglig från: 2012-04-16 Skapad: 2012-04-16 Senast uppdaterad: 2018-01-12
Ingår i avhandling
1. Text Harmonization Strategies for Phrase-Based Statistical Machine Translation
Öppna denna publikation i ny flik eller fönster >>Text Harmonization Strategies for Phrase-Based Statistical Machine Translation
2012 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

In this thesis I aim to improve phrase-based statistical machine translation (PBSMT) in a number of ways by the use of text harmonization strategies. PBSMT systems are built by training statistical models on large corpora of human translations. This architecture generally performs well for languages with similar structure. If the languages are different for example with respect to word order or morphological complexity, however, the standard methods do not tend to work well. I address this problem through text harmonization, by making texts more similar before training and applying a PBSMT system.

I investigate how text harmonization can be used to improve PBSMT with a focus on four areas: compounding, definiteness, word order, and unknown words. For the first three areas, the focus is on linguistic differences between languages, which I address by applying transformation rules, using either rule-based or machine learning-based techniques, to the source or target data. For the last area, unknown words, I harmonize the translation input to the training data by replacing unknown words with known alternatives.

I show that translation into languages with closed compounds can be improved by splitting and merging compounds. I develop new merging algorithms that outperform previously suggested algorithms and show how part-of-speech tags can be used to improve the order of compound parts. Scandinavian definite noun phrases are identified as a problem forPBSMT in translation into Scandinavian languages and I propose a preprocessing approach that addresses this problem and gives large improvements over a baseline. Several previous proposals for how to handle differences in reordering exist; I propose two types of extensions, iterating reordering and word alignment and using automatically induced word classes, which allow these methods to be used for less-resourced languages. Finally I identify several ways of replacing unknown words in the translation input, most notably a spell checking-inspired algorithm, which can be trained using character-based PBSMT techniques.

Overall I present several approaches for extending PBSMT by the use of pre- and postprocessing techniques for text harmonization, and show experimentally that these methods work. Text harmonization methods are an efficient way to improve statistical machine translation within the phrase-based approach, without resorting to more complex models.

Ort, förlag, år, upplaga, sidor
Linköping: Linköping University Electronic Press, 2012. s. 95
Serie
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1451
Nyckelord
Statistical machine translation, text harmonization, compound words, definiteness, reordering, unknown words
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling) Datavetenskap (datalogi) Jämförande språkvetenskap och allmän lingvistik
Identifikatorer
urn:nbn:se:liu:diva-76766 (URN)978-91-7519-887-3 (ISBN)
Disputation
2012-06-11, Visionen, Hus B, Campus Valla, Linköpings universitet, Linköping, 13:15 (Engelska)
Opponent
Handledare
Tillgänglig från: 2012-05-14 Skapad: 2012-04-19 Senast uppdaterad: 2018-01-12Bibliografiskt granskad

Open Access i DiVA

fulltext(699 kB)865 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 699 kBChecksumma SHA-512
2bea195bb298c7433eefeb583ad29072f5ea62a9eb282560f31124f85acd78cb070534cd6bc4ecff85f42ac5fbfe50c68ac7a5dd1a55d9f3908f00c13a08c4de
Typ fulltextMimetyp application/pdf

Övriga länkar

Förlagets fulltext

Personposter BETA

Stymne, SaraAhrenberg, Lars

Sök vidare i DiVA

Av författaren/redaktören
Stymne, SaraAhrenberg, Lars
Av organisationen
NLPLAB - Laboratoriet för databehandling av naturligt språkTekniska högskolan
I samma tidskrift
Computational linguistics - Association for Computational Linguistics (Print)
Språkteknologi (språkvetenskaplig databehandling)Jämförande språkvetenskap och allmän lingvistikDatavetenskap (datalogi)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 865 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 224 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf