Creating a medical dictionary using word alignment: The influence of sources and resources
2007 (English)In: BMC Medical Informatics and Decision Making, ISSN 1472-6947, Vol. 7, no 37Article in journal (Refereed) Published
Background. Automatic word alignment of parallel texts with the same content in different languages is among other things used to generate dictionaries for new translations. The quality of the generated word alignment depends on the quality of the input resources. In this paper we report on automatic word alignment of the English and Swedish versions of the medical terminology systems ICD-10, ICF, NCSP, KSH97-P and parts of MeSH and how the terminology systems and type of resources influence the quality. Methods. We automatically word aligned the terminology systems using static resources, like dictionaries, statistical resources, like statistically derived dictionaries, and training resources, which were generated from manual word alignment. We varied which part of the terminology systems that we used to generate the resources, which parts that we word aligned and which types of resources we used in the alignment process to explore the influence the different terminology systems and resources have on the recall and precision. After the analysis, we used the best configuration of the automatic word alignment for generation of candidate term pairs. We then manually verified the candidate term pairs and included the correct pairs in an English-Swedish dictionary. Results. The results indicate that more resources and resource types give better results but the size of the parts used to generate the resources only partly affects the quality. The most generally useful resources were generated from ICD-10 and resources generated from MeSH were not as general as other resources. Systematic inter-language differences in the structure of the terminology system rubrics make the rubrics harder to align. Manually created training resources give nearly as good results as a union of static resources, statistical resources and training resources and noticeably better results than a union of static resources and statistical resources. The verified English-Swedish dictionary contains 24,000 term pairs in base forms. Conclusion. More resources give better results in the automatic word alignment, but some resources only give small improvements. The most important type of resource is training and the most general resources were generated from ICD-10. © 2007 Nyström et al, licensee BioMed Central Ltd.
Place, publisher, year, edition, pages
2007. Vol. 7, no 37
Medical and Health Sciences
IdentifiersURN: urn:nbn:se:liu:diva-40825DOI: 10.1186/1472-6947-7-37Local ID: 54255OAI: oai:DiVA.org:liu-40825DiVA: diva2:261674
Mikael Nyström, Magnus Merkel, Håkan Petersson and Hans Åhlfeldt, Creating a medical dictionary using word alignment: The influence of sources and resources, 2007, BMC Medical Informatics and Decision Making, (7), 37.
Licensee: BioMed Central