liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automatic Document Classification Applied to Swedish News
Linköping University, Department of Computer and Information Science.
2005 (English)Independent thesis Basic level (professional degree), 20 points / 30 hpStudent thesis
Abstract [en]

The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN).

The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories.

This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories.

The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment.

The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.

Place, publisher, year, edition, pages
Institutionen för datavetenskap , 2005. , 86 p.
Keyword [en]
ELIN, automatic text classification, Naïve Bayes network, Rocchio
Identifiers
URN: urn:nbn:se:liu:diva-3065ISRN: LITH-IDA-EX--05/038--SEOAI: oai:DiVA.org:liu-3065DiVA: diva2:20343
Supervisors
Examiners
Available from: 2005-08-15 Created: 2005-08-15

Open Access in DiVA

fulltext(650 kB)409 downloads
File information
File name FULLTEXT01.pdfFile size 650 kBChecksum SHA-1
b1c79d7b1efe0f067ce76f5e5e31ff025a07b812cffab60ba3ce38178ff4b03992be4db9
Type fulltextMimetype application/pdf

By organisation
Department of Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 409 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 378 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf