liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Web Corpus for eCare: Collection, Lay Annotation and Learning - First Results
RISE SICS East Linköping, Sweden.
Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Arts and Sciences. RISE SICS East Linköping, Sweden.ORCID iD: 0000-0003-4899-588X
Linköping University, Department of Biomedical Engineering. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0001-6468-2432
Örebro University, Örebro, Sweden.
2017 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In this position paper, we put forward two claims: 1) it is possible to design a dynamic and extensible corpus without running the risk of getting into scalability problems; 2) it is possible to devise noise-resistant Language Technology applications without affecting performance. To support our claims, we describe the design, construction and limitations of a very specialized medical web corpus, called eCare_Sv_01, and we present two experiments on lay-specialized text classification. eCare_Sv_01 is a small corpus of web documents written in Swedish. The corpus contains documents about chronic diseases. The sublanguage used in each document has been labelled as “lay” or “specialized” by a lay annotator. The corpus is designed as a flexible text resource, where additional medical documents will be appended over time. Experiments show that the lay-specialized labels assigned by the lay annotator are reliably learned by standard classifiers. More specifically, Experiment 1 shows that scalability is not an issue when increasing the size of the datasets to be learned from 156 up to 801 documents. Experiment 2 shows that lay-specialized labels can be learned regardless of the large amount of disturbing factors, such as machine translated documents or low-quality texts that are numerous in the corpus

Place, publisher, year, edition, pages
2017.
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-141054OAI: oai:DiVA.org:liu-141054DiVA: diva2:1143424
Conference
2nd International Workshop on Language Technologies and Applications (LTA'17), Prague, Czech Republic, 3-6 September, 2017
Available from: 2017-09-21 Created: 2017-09-21 Last updated: 2017-09-27Bibliographically approved

Open Access in DiVA

No full text

Authority records BETA

Jönsson, ArneNyström, Mikael

Search in DiVA

By author/editor
Jönsson, ArneNyström, Mikael
By organisation
Human-Centered systemsFaculty of Arts and SciencesDepartment of Biomedical EngineeringFaculty of Science & Engineering
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 58 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf