liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Profiling specialized web corpus qualities: A progress report on "Domainhood"
RISE Research Institutes of Sweden, Gothenburg, Sweden.
Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. RISE Research Institutes of Sweden, Gothenburg, Sweden.
Linköping University, Department of Computer and Information Science, Human-Centered systems. Linköping University, Faculty of Science & Engineering. RISE Research Institutes of Sweden, Gothenburg, Sweden.ORCID iD: 0000-0003-4899-588X
2019 (English)In: Argentinian Journal of Applied Linguistics, ISSN 0136-006X, E-ISSN 1478-6362, Vol. 7, no 1, p. 8-26Article in journal (Refereed) Published
Abstract [en]

In this article we describe ways to profile the domain specificity, a.k.a. domainhood, of specialized web corpora in English and in Swedish. Several studies have been carried out to measure the "qualities" of general-purpose web corpora. On the contrary, less attention has been paid to the evaluation of specialized or domain-specific web corpora. To fill this gap, in this article we present case studies where we explore the effectiveness of several statistical measures – i.e. rank correlation coefficients (Kendall and Spearman), Kullback–Leibler divergence, log-likelihood and burstiness - to assess domainhood. Our findings indicate that it is possible to profile the domainhood quality of a corpus. However, further research is needed to generalize on the results.

Abstract [es]

En este artículo describimos formas de trazar la especificidad del dominio ("domainhood") de los corpus de webs especializados en inglés y en sueco. Muchos estudios se han llevado a cabo para medir las "cualidades" de los corpus de webs de carácter general. Sin embargo, se ha prestado menos atención a la evaluación de corpus de web especializados o de dominios específicos. Para llenar este vacío, en este artículo presentamos estudios de caso donde exploramos la efectividad de diferentes medidas estadísticas, a saber, coeficientes de correlación de rango (Kendall and Spearman), divergencia Kullback–Leibler, probabilidad de registro y burstiness – para evaluar la especificidad del dominio. Nuestros resultados indican que es posible perfilar la calidad de dominio de un corpus. Sin embargo, es necesaria una mayor investigación para generalizar en los resultados.

Place, publisher, year, edition, pages
KMK Scientific Press Ltd., 2019. Vol. 7, no 1, p. 8-26
Keywords [en]
corpus evaluation; term extraction; log- likelihood; rank correlation; Kullback-Leibler divergence
Keywords [es]
evaluación de corpus; extracción de términos; probabilidad de registro correlación de rango; divergencia Kullback-Leibler
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-158299OAI: oai:DiVA.org:liu-158299DiVA, id: diva2:1332611
Available from: 2019-06-28 Created: 2019-06-28 Last updated: 2019-08-08Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Link to publication

Authority records BETA

Strandqvist, WiktorJönsson, Arne

Search in DiVA

By author/editor
Strandqvist, WiktorJönsson, Arne
By organisation
Human-Centered systemsFaculty of Science & Engineering
In the same journal
Argentinian Journal of Applied Linguistics
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 2 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf