Automatic Extraction and Manual Validation of Hierarchical Patent Terminology
2009 (English)In: NORDTERM 16. Ontologier og taksonomier.: Rapport fra NORDTERM 2009 / [ed] B. Nistrup Madsen & H. Erdman Thomsen, Copenhagen, Denmark: Copenhagen Business School Press, 2009, 249-262 p.Conference paper (Refereed)
Several methods can be applied to create a set of validated terms from existing documents. In this paper we describe an automatic bilingual term candidate extraction method, and the validation process used to create a hierarchical patent terminology. The process described was used to extract terms from patent texts, commissioned by the Swedish Patent Office with the purpose of using the terms for machine translation. Information on the correct linguistic inflection patterns and hierarchical partitioning of terms based on their use are of utmost importance.The process contains six phases, 1) Analysis of the source material and system configuration; 2) Term candidate extraction; 3) Term candidate filtering and initial linguistic validation; 4) Manual validation by domain experts; 5) Final linguistic validation; and 6) Publishing the validated terms.Input to the extraction process consisted of more than 91 000 patent document pairs in English and Swedish, 565 million words in English and 450 million words in Swedish. The English documents were supplied in EBD SGML format and the Swedish documents were supplied in OCR processed scans of patent documents. After grammatical and statistical analysis, the documents were word-aligned. Using the word-aligned material, candidate terms were extracted based on linguistic patterns. 750 000 term candidates were extracted and stored in a relational database. The term candidates were processed in 8 months resulting in 181 000 unique validated term pairs that were exported into several hierarchically organized OLIF files.
Place, publisher, year, edition, pages
Copenhagen, Denmark: Copenhagen Business School Press, 2009. 249-262 p.
automatic term extraction, computational terminology, patent terminology
National CategoryLanguage Technology (Computational Linguistics)
IdentifiersURN: urn:nbn:se:liu:diva-75236ISBN: 978-87-994577-0-0OAI: oai:DiVA.org:liu-75236DiVA: diva2:505322
NORDTERM 2009, København, Danmark 9‐12. juni 2009