liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards a Model of General Text Complexity for Swedish
Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. (NLPLAB)ORCID iD: 0000-0002-6357-4461
2018 (English)Licentiate thesis, monograph (Other academic)
Abstract [en]

In an increasingly networked world, where the amount of written information is growing at a rate never before seen, the ability to read and absorb written information is of utmost importance for anything but a superficial understanding of life's complexities. That is an example of a sentence which is not very easy to read. It can be said to have a relatively high degree of text complexity. Nevertheless, the sentence is also true. It is important to be able to read and understand written materials. While not everyone might have a job where they have to read a lot, access to written material is necessary in order to participate in modern society. Most information, from news reporting, to medical information, to governmental information, come primarily in a written form.

But what makes the sentence at the start of this abstract so complex? We can probably all agree that the length is part of it. But then what? Researches in the field of readability and text complexity analysis have been studying this question for almost 100 years. That research has over time come to include many computational and data driven methods within the field of computational linguistics.

This thesis cover some of my contributions to this field of research, though with a main focus on Swedish rather than English text. It aims to explore two primary questions (1) Which linguistic features are most important when assessing text complexity in Swedish? and (2) How can we deal with the problem of data sparsity with regards to complexity annotated texts in Swedish?

The first issue is tackled by exploring the task of identifying easy-to-read ("lättläst") text using classification with Support Vector Machines. A large set of linguistic features is evaluated with regards to predictive performance and is shown to separate easy-to-read texts from regular texts with a very high accuracy. Meanwhile, using a genetic algorithm for variable selection, we find that almost the same accuracy can be reached with only 8 features. This implies that this classification problem is not very hard and that results might not generalize to comparing less easy-to-read texts.

This, in turn, brings us to the second question. Except for easy-to-read labeled texts, the data with text complexity annotations is very sparse. It consist of multiple small corpora using different scales to label documents. To deal with this problem, we propose a novel statistical model. The model belongs to the larger family of Probit models and is implemented in a Bayesian fashion and estimated using a Gibbs sampler based on extending a well established Gibbs sampler for the Ordered Probit model. This model is evaluated using both simulated and real world readability data with very promising results.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. , p. 103
Series
Linköping Studies in Science and Technology. Licentiate Thesis, ISSN 0280-7971 ; 1827
National Category
Computer Sciences Language Technology (Computational Linguistics) Probability Theory and Statistics Specific Languages
Identifiers
URN: urn:nbn:se:liu:diva-152495ISBN: 9789176851555 (print)OAI: oai:DiVA.org:liu-152495DiVA, id: diva2:1269476
Presentation
2018-12-19, Alan Turing, E-huset, Campus Valla, Linköping, 10:15 (English)
Opponent
Supervisors
Funder
Marcus and Amalia Wallenberg FoundationVINNOVAAvailable from: 2018-12-10 Created: 2018-12-10 Last updated: 2018-12-10Bibliographically approved

Open Access in DiVA

omslag(22 kB)7 downloads
File information
File name COVER01.pdfFile size 22 kBChecksum SHA-512
51c4aa9600e44ebbd66e8bd62a6187ed146e849d360f22c8e5ac264ac586470f41253e356bd4e076d334a9c624c9603d2d61bd4374a55821d401ce2447b29e73
Type coverMimetype application/pdf

Authority records BETA

Falkenjack, Johan

Search in DiVA

By author/editor
Falkenjack, Johan
By organisation
Human-Centered systemsFaculty of Science & Engineering
Computer SciencesLanguage Technology (Computational Linguistics)Probability Theory and StatisticsSpecific Languages

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 494 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf