liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automatic Extraction and Manual Validation of Hierarchical Patent Terminology
Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
Fodina Language Technology AB.
Fodina Language Technology AB.
Show others and affiliations
2009 (English)In: NORDTERM 16. Ontologier og taksonomier.: Rapport fra NORDTERM 2009 / [ed] B. Nistrup Madsen & H. Erdman Thomsen, Copenhagen, Denmark: Copenhagen Business School Press, 2009, 249-262 p.Conference paper, Published paper (Refereed)
Abstract [en]

Several methods can be applied to create a set of validated terms from existing documents. In this paper we describe an automatic bilingual term candidate extraction method, and the validation process used to create a hierarchical patent terminology. The process described was used to extract terms from patent texts, commissioned by the Swedish Patent Office with the purpose of using the terms for machine translation. Information on the correct linguistic inflection patterns and hierarchical partitioning of terms based on their use are of utmost importance.The process contains six phases, 1) Analysis of the source material and system configuration; 2) Term candidate extraction; 3) Term candidate filtering and initial linguistic validation; 4) Manual validation by domain experts; 5) Final linguistic validation; and 6) Publishing the validated terms.Input to the extraction process consisted of more than 91 000 patent document pairs in English and Swedish, 565 million words in English and 450 million words in Swedish. The English documents were supplied in EBD SGML format and the Swedish documents were supplied in OCR processed scans of patent documents. After grammatical and statistical analysis, the documents were word-aligned. Using the word-aligned material, candidate terms were extracted based on linguistic patterns. 750 000 term candidates were extracted and stored in a relational database. The term candidates were processed in 8 months resulting in 181 000 unique validated term pairs that were exported into several hierarchically organized OLIF files.

Place, publisher, year, edition, pages
Copenhagen, Denmark: Copenhagen Business School Press, 2009. 249-262 p.
Keyword [en]
automatic term extraction, computational terminology, patent terminology
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-75236ISBN: 978-87-994577-0-0 (print)OAI: oai:DiVA.org:liu-75236DiVA: diva2:505322
Conference
NORDTERM 2009, København, Danmark 9‐12. juni 2009
Available from: 2012-02-23 Created: 2012-02-22 Last updated: 2012-03-07Bibliographically approved
In thesis
1. Computational Terminology: Exploring Bilingual and Monolingual Term Extraction
Open this publication in new window or tab >>Computational Terminology: Exploring Bilingual and Monolingual Term Extraction
2012 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Terminologies are becoming more important to modern day society as technology and science continue to grow at an accelerating rate in a globalized environment. Agreeing upon which terms should be used to represent which concepts and how those terms should be translated into different languages is important if we wish to be able to communicate with as little confusion and misunderstandings as possible.

Since the 1990s, an increasing amount of terminology research has been devoted to facilitating and augmenting terminology-related tasks by using computers and computational methods. One focus for this research is Automatic Term Extraction (ATE).

In this compilation thesis, studies on both bilingual and monolingual ATE are presented. First, two publications reporting on how bilingual ATE using the align-extract approach can be used to extract patent terms. The result in this case was 181,000 manually validated English-Swedish patent terms which were to be used in a machine translation system for patent documents. A critical component of the method used is the Q-value metric, presented in the third paper, which can be used to rank extracted term candidates (TC) in an order that correlates with TC precision. The use of Machine Learning (ML) in monolingual ATE is the topic of the two final contributions. The first ML-related publication shows that rule induction based ML can be used to generate linguistic term selection patterns, and in the second ML-related publication, contrastive n-gram language models are used in conjunction with SVM ML to improve the precision of term candidates selected using linguistic patterns.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2012. 68 p.
Series
Linköping Studies in Science and Technology. Thesis, ISSN 0280-7971 ; 1523
Keyword
terminology, automatic term extraction, automatic term recognition, computational terminology, terminology management
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:liu:diva-75243 (URN)978-91-7519-944-3 (ISBN)
Presentation
2012-04-04, Alan Turing, Hus E, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2012-03-07 Created: 2012-02-23 Last updated: 2012-03-07Bibliographically approved

Open Access in DiVA

No full text

Other links

Fulltext

Authority records BETA

Merkel, MagnusFoo, Jody

Search in DiVA

By author/editor
Merkel, MagnusFoo, Jody
By organisation
NLPLAB - Natural Language Processing LaboratoryThe Institute of Technology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 156 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf