liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Similarity-Based Alignment of Monolingual Corpora for Text Simplification
Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. SICS East Swedish ICT AB, Linköping, Sweden. (NLPLAB)
Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. SICS East Swedish ICT AB, Linköping, Sweden. (NLPLAB)ORCID iD: 0000-0002-0932-7048
Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Arts and Sciences. SICS East Swedish ICT AB, Linköping, Sweden.ORCID iD: 0000-0003-4899-588X
2016 (English)In: CL4LC 2016 - Computational Linguistics for Linguistic Complexity: Proceedings of the Workshop, 2016, 154-163 p.Conference paper, Published paper (Refereed)
Abstract [en]

Comparable or parallel corpora are beneficial for many NLP tasks.  The automatic collection of corpora enables large-scale resources, even for less-resourced languages, which in turn can be useful for deducing rules and patterns for text rewriting algorithms, a subtask of automatic text simplification. We present two methods for the alignment of Swedish easy-to-read text segments to text segments from a reference corpus.  The first method (M1) was originally developed for the task of text reuse detection, measuring sentence similarity by a modified version of a TF-IDF vector space model. A second method (M2), also accounting for part-of-speech tags, was devel- oped, and the methods were compared.  For evaluation, a crowdsourcing platform was built for human judgement data collection, and preliminary results showed that cosine similarity relates better to human ranks than the Dice coefficient. We also saw a tendency that including syntactic context to the TF-IDF vector space model is beneficial for this kind of paraphrase alignment task.

Place, publisher, year, edition, pages
2016. 154-163 p.
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-133782ISBN: 9784879747099 (electronic)OAI: oai:DiVA.org:liu-133782DiVA: diva2:1063075
Conference
Coling 2016, Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), Osaka, Japan, Sunday, 11 December 2016
Available from: 2017-01-09 Created: 2017-01-09 Last updated: 2017-09-29Bibliographically approved

Open Access in DiVA

No full text

Other links

Link to publication

Search in DiVA

By author/editor
Albertsson, SarahRennes, EvelinaJönsson, Arne
By organisation
Human-Centered systemsFaculty of Science & EngineeringFaculty of Arts and Sciences
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

Total: 74 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf