liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Can We Quantify Domainhood?: Exploring Measures to Assess Domain-Specificity in Web Corpora
RISE SICS Linköping, Sweden.
Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. RISE Research Institutes of Sweden AB.
Linköping University, Department of Biomedical Engineering, Division of Biomedical Engineering. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0001-6468-2432
Örebro University, Örebro, Sweden.
Show others and affiliations
2018 (English)In: Communications in Computer and Information Science, vol 903. Springer, Cham / [ed] Elloumi M. et al., 2018, Vol. 903Conference paper, Published paper (Refereed)
Abstract [en]

Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

Place, publisher, year, edition, pages
2018. Vol. 903
Series
Communications in Computer and Information Science, ISSN 1865-0929, E-ISSN 1865-0937 ; 903
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:liu:diva-151423DOI: 10.1007/978-3-319-99133-7_17ISI: 000460552400017ISBN: 978-3-319-99132-0 (print)ISBN: 978-3-319-99133-7 (electronic)OAI: oai:DiVA.org:liu-151423DiVA, id: diva2:1249836
Conference
Database and Expert Systems Applications. DEXA 2018.
Note

Funding agencies: E-care@home, a "SIDUS - Strong Distributed Research Environment" project - Swedish Knowledge Foundation

Available from: 2018-09-20 Created: 2018-09-20 Last updated: 2025-02-07

Open Access in DiVA

fulltext(360 kB)319 downloads
File information
File name FULLTEXT01.pdfFile size 360 kBChecksum SHA-512
0b032d9449364669e690964e17048cfe64489d8210db1df5113ee4e1954e97e5805105bd9851fe5aecd8db9e8639d1ccc05542965587c651e7afeb1785c59b2e
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Nyström, MikaelJönsson, Arne

Search in DiVA

By author/editor
Strandqvist, WiktorNyström, MikaelJönsson, Arne
By organisation
Human-Centered systemsFaculty of Science & EngineeringDivision of Biomedical Engineering
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 319 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 179 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf