The number of domains and tasks where information extraction tools can be used needs to be increased. One way to reach this goal is to design user-driven information extraction systems where non-expert users are able to adapt them to new domains and tasks. It is difficult to design general extraction systems that do not require expert skills or a large amount of work from the user. Therefore, it is difficult to increase the number of domains and tasks. A possible alternative is to design user-driven systems, which solve that problem by letting a large number of non-expert users adapt the systems themselves. To accomplish this goal, the systems need to become more intelligent and able to learn to extract with as little given information as possible.
The type of information extraction system that is in focus for this thesis is semi-structured information extraction. The term semi-structured refers to documents that not only contain natural language text but also additional structural information. The typical application is information extraction from World Wide Web hypertext documents. By making effective use of not only the link structure but also the structural information within each such document, user-driven extraction systems with high performance can be built.
There are two different approaches presented in this thesis to solve the user-driven extraction problem. The first takes a machine learning approach and tries to solve the problem using a modified $Q(\lambda)$ reinforcement learning algorithm. A problem with the first approach was that it was difficult to handle extraction from the hidden Web. Since the hidden Web is about 500 times larger than the visible Web, it would be very useful to be able to extract information from that part of the Web as well. The second approach is called the hidden observation approach and tries to also solve the problem of extracting from the hidden Web. The goal is to have a user-driven information extraction system that is also able to handle the hidden Web. The second approach uses a large part of the system developed for the first approach, but the additional information that is silently obtained from the user presents other problems and possibilities.
An agent-oriented system was designed to evaluate the approaches presented in this thesis. A set of experiments was conducted and the results indicate that a user-driven information extraction system is possible and no longer just a concept. However, additional work and research is necessary before a fully-fledged user-driven system can be designed.