liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Probabilistic Tagging Module Based on Surface Pattern Matching
Stockholm University, Department of Computational Linguistics, Institute of Linguistics.ORCID iD: 0000-0003-3734-0757
1993 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

A problem with automatic tagging and lexical analysis is that it is never 100 % accurate. In order to arrive at better figures, one needs to study the character of what is left untagged by automatic taggers. In this paper untagged residue outputted by the automatic analyser SWETWOL (Karlsson 1992) at Helsinki is studied. SWETWOL assigns tags to words in Swedish texts mainly through dictionary lookup. The contents of the untagged residue files are described and discussed, and possible ways of solving different problems are proposed. One method of tagging residual output is proposed and implemented: the left-stripping method, through which untagged words are bereaved their left-most letters, searched in a dictionary, and if found, tagged according to the information found in the said dictionary. If the stripped word is not found in the dictionary, a match is searched in ending lexica containing statistical information about word classes associated with that particular word form (i.e., final letter cluster, be this a grammatical suffix or not), and the relative frequency of each word class. If a match is found, the word is given graduated tagging according to the statistical information in the ending lexicon. If a match is not found, the word is stripped of what is now its left-most letter and is recursively searched in a dictionary and ending lexica (in that order). The ending lexica employed in this paper are retrieved from a reversed version of Nusvensk Frekvensordbok (Allén 1970), and contain endings of between one and seven letters. The contents of the ending lexica are to a certain degree described and discussed. The programs working according to the principles described are run on files of untagged residual output. Appendices include, among other things, LISP source code, untagged and tagged files, the ending lexica containing one and two letter endings and excerpts from ending lexica containing three to seven letters.

Place, publisher, year, edition, pages
Stockholm University, 1993. , p. 96
Keywords [en]
Tagging, computational linguistics, word-class, probabilistic, morphology, swetwol, statistical, corpus linguistics, corpora, endings, suffixes, word class frequency, lexical analysis
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-135294OAI: oai:DiVA.org:liu-135294DiVA, id: diva2:1080416
Supervisors
Available from: 2017-03-10 Created: 2017-03-10 Last updated: 2018-01-13Bibliographically approved

Open Access in DiVA

A Probabilistic Tagging Module Based on Surface Pattern Matching(414 kB)17 downloads
File information
File name FULLTEXT01.pdfFile size 414 kBChecksum SHA-512
e59d92fc7691d53e00d1d5b522207af5825f708b79bcb0e1a0f9132dac809ff7aa980750d008bd55f99afcde1a25af7881c36fab43cf27303c4f80480144f787
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Eklund, Robert
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 17 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 43 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf