liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A robust text processing technique applied to lexical error recovery
Linköping University, Department of Computer and Information Science. Linköping University, The Institute of Technology.
1997 (English)Licentiate thesis, monograph (Other academic)
Abstract [en]

This thesis addresses automatic lexical error recovery and tokenization of corrupt text input. We propose a technique that can automatically correct misspellings, segmentation errors and real-word errors in a unified framework that uses both a model of language production and a model of the typing behavior, and which makes tokenization part of the recovery process.

The typing process is modeled as a noisy channel where Hidden Markov Models are used to model the channel characteristics. Weak statistical language models are used to predict what sentences are likely to be transmitted through the channel. These components are held together in the Token Passing framework which provides the desired tight coupling between orthographic pattern matching and linguistic expectation.

The system, CTR (Connected Text Recognition), has been tested on two corpora derived from two different applications, a natural language dialogue system and a transcription typing scenario. Experiments show that CTR can automatically correct a considerable portion of the errors in the test sets without introducing too much noise. The segmentation error correction rate is virtually faultless.

Place, publisher, year, edition, pages
Linköping: Linköpings universitet , 1997. , p. 101
Series
Linköping Studies in Science and Technology. Thesis, ISSN 0280-7971 ; 599
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:liu:diva-145886Libris ID: 7671819Local ID: LiU-TEK-LIC-1996:64ISBN: 9178718910 (print)OAI: oai:DiVA.org:liu-145886DiVA, id: diva2:1201212
Presentation
1997-01-31, Esatraden, hus E, Campus Valla, Linköping, Sweden, 13:15 (Swedish)
Note

This work has been supported by the Swedish National Board for Industrial and Technical Development (NUTEK) and the Swedish Council for Research in the Humanities and Social Sciences (HSFR).

Available from: 2018-04-25 Created: 2018-04-25 Last updated: 2023-03-16Bibliographically approved

Open Access in DiVA

No full text in DiVA

By organisation
Department of Computer and Information ScienceThe Institute of Technology
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 23 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf