LiU Electronic Press
Full-text not available in DiVA
Arpteg, Anders (Linköping University, Department of Computer and Information Science) (Linköping University, The Institute of Technology)
Intelligent semi-structured information extraction: a user-driven approach to information extraction
Linköping University, Department of Computer and Information Science
Linköping University, The Institute of Technology
Publication type:
Doctoral thesis, monograph (Other academic)
Place of publ.: Linköping Publisher: Linköping University Electronic Press
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524; 946
Year of publ.:
Permanent link:
Local ID:
Subject category:
Computer Science
SVEP category:
Computer science
Keywords(sv) :
Artificiell intelligens, Informaitonsåtervinning, Artificial intelligence
Abstract(en) :

The number of domains and tasks where information extraction tools can be used needs to be increased. One way to reach this goal is to design user-driven information extraction systems where non-expert users are able to adapt them to new domains and tasks. It is difficult to design general extraction systems that do not require expert skills or a large amount of work from the user. Therefore, it is difficult to increase the number of domains and tasks. A possible alternative is to design user-driven systems, which solve that problem by letting a large number of non-expert users adapt the systems themselves. To accomplish this goal, the systems need to become more intelligent and able to learn to extract with as little given information as possible.

The type of information extraction system that is in focus for this thesis is semi-structured information extraction. The term semi-structured refers to documents that not only contain natural language text but also additional structural information. The typical application is information extraction from World Wide Web hypertext documents. By making effective use of not only the link structure but also the structural information within each such document, user-driven extraction systems with high performance can be built.

There are two different approaches presented in this thesis to solve the user-driven extraction problem. The first takes a machine learning approach and tries to solve the problem using a modified $Q(\lambda)$ reinforcement learning algorithm. A problem with the first approach was that it was difficult to handle extraction from the hidden Web. Since the hidden Web is about 500 times larger than the visible Web, it would be very useful to be able to extract information from that part of the Web as well. The second approach is called the hidden observation approach and tries to also solve the problem of extracting from the hidden Web. The goal is to have a user-driven information extraction system that is also able to handle the hidden Web. The second approach uses a large part of the system developed for the first approach, but the additional information that is silently obtained from the user presents other problems and possibilities.

An agent-oriented system was designed to evaluate the approaches presented in this thesis. A set of experiments was conducted and the results indicate that a user-driven information extraction system is possible and no longer just a concept. However, additional work and research is necessary before a fully-fledged user-driven system can be designed.


This work has been supported by University of Kalmar and the Knowledge Foundation.

Public defence:
2005-05-20, Key 1, Hus Key, Campus Valla, Linköpings universitet, Linköping, 10:15 (English)
Doctor of Philosophy (PhD)
Available from:
Last updated:
58 hits