liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions
Univ Waterloo, Canada.
Western Univ, Canada.
Linköping University, Faculty of Medicine and Health Sciences. Linköping University, Department of Health, Medicine and Caring Sciences, Division of Society and Health. Region Östergötland, Regionledningskontoret, Enheten för folkhälsa.ORCID iD: 0000-0002-6281-7783
2021 (English)In: Social science computer review, ISSN 0894-4393, E-ISSN 1552-8286, Vol. 36, no 4, p. 562-572Article in journal (Refereed) Published
Abstract [en]

Text data from open-ended questions in surveys are challenging to analyze and are often ignored. Open-ended questions are important though because they do not constrain respondents answers. Where open-ended questions are necessary, often human coders manually code answers. When data sets are large, it is impractical or too costly to manually code all answer texts. Instead, text answers can be converted into numerical variables, and a statistical/machine learning algorithm can be trained on a subset of manually coded data. This statistical model is then used to predict the codes of the remainder. We consider open-ended questions where the answers are coded into multiple labels (all-that-apply questions). For example, in the open-ended question in our Happy example respondents are explicitly told they may list multiple things that make them happy. Algorithms for multilabel data take into account the correlation among the answer codes and may therefore give better prediction results. For example, when giving examples of civil disobedience, respondents talking about "minor nonviolent offenses" were also likely to talk about "crimes." We compare the performance of two different multilabel algorithms (random k-labelsets [RAKEL], classifier chains [CC]) to the default method of binary relevance (BR) which applies single-label algorithms to each code separately. Performance is evaluated on data from three open-ended questions (Happy, Civil Disobedience, and Immigrant). We found weak bivariate label correlations in the Happy data (90th percentile: 7.6%), and stronger bivariate label correlations in the Civil Disobedience (90th percentile: 17.2%) and Immigrant (90th percentile: 19.2%) data. For the data with stronger correlations, we found both multilabel methods performed substantially better than BR using 0/1 loss ("at least one label is incorrect") and had little effect when using Hamming loss (average error). For data with weak label correlations, we found no difference in performance between multilabel methods and BR. We conclude that automatic classification of open-ended questions that allow multiple answers may benefit from using multilabel algorithms for 0/1 loss. The degree of correlations among the labels may be a useful prognostic tool.

Place, publisher, year, edition, pages
Sage Publications, 2021. Vol. 36, no 4, p. 562-572
Keywords [en]
open-ended questions; multilabel; check-all-that-apply; machine learning; statistical learning; text
National Category
Information Systems
Identifiers
URN: urn:nbn:se:liu:diva-160426DOI: 10.1177/0894439319869210ISI: 000483411100001OAI: oai:DiVA.org:liu-160426DiVA, id: diva2:1353428
Note

Funding Agencies|Social Sciences and Humanities Research Council of Canada (SSHRC) [435-2013-0128]

Available from: 2019-09-23 Created: 2019-09-23 Last updated: 2022-04-26

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Wenemark, Marika

Search in DiVA

By author/editor
Wenemark, Marika
By organisation
Faculty of Medicine and Health SciencesDivision of Society and HealthEnheten för folkhälsa
In the same journal
Social science computer review
Information Systems

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 185 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf