liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
Cornell University Ithaca, NY, USA.
Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Science & Engineering.
Cornell University Ithaca, NY, USA.
2017 (English)In: 15th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of Conference, volume 2: Short Papers, Stroudsburg: Association for Computational Linguistics (ACL) , 2017, Vol. 2, p. 432-436Conference paper, Published paper (Other academic)
Abstract [en]

It is often assumed that topic models benefit from the use of a manually curated stopword list. Constructing this list is time-consuming and often subject to user judgments about what kinds of words are important to the model and the application. Although stopword removal clearly affects which word types appear as most probable terms in topics, we argue that this improvement is superficial, and that topic inference benefits little from the practice of removing stopwords beyond very frequent terms. Removing corpus-specific stopwords after model inference is more transparent and produces similar results to removing those words prior to inference.

Place, publisher, year, edition, pages
Stroudsburg: Association for Computational Linguistics (ACL) , 2017. Vol. 2, p. 432-436
National Category
Probability Theory and Statistics General Language Studies and Linguistics Specific Languages
Identifiers
URN: urn:nbn:se:liu:diva-147612ISBN: 9781945626357 (print)OAI: oai:DiVA.org:liu-147612DiVA, id: diva2:1201948
Conference
15th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of Conference, volume 2: Short Papers April 3-7, 2017, Valencia, Spain
Available from: 2018-04-27 Created: 2018-04-27 Last updated: 2018-04-27Bibliographically approved
In thesis
1. Scalable and Efficient Probabilistic Topic Model Inference for Textual Data
Open this publication in new window or tab >>Scalable and Efficient Probabilistic Topic Model Inference for Textual Data
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Probabilistic topic models have proven to be an extremely versatile class of mixed-membership models for discovering the thematic structure of text collections. There are many possible applications, covering a broad range of areas of study: technology, natural science, social science and the humanities.

In this thesis, a new efficient parallel Markov Chain Monte Carlo inference algorithm is proposed for Bayesian inference in large topic models. The proposed methods scale well with the corpus size and can be used for other probabilistic topic models and other natural language processing applications. The proposed methods are fast, efficient, scalable, and will converge to the true posterior distribution.

In addition, in this thesis a supervised topic model for high-dimensional text classification is also proposed, with emphasis on interpretable document prediction using the horseshoe shrinkage prior in supervised topic models.

Finally, we develop a model and inference algorithm that can model agenda and framing of political speeches over time with a priori defined topics. We apply the approach to analyze the evolution of immigration discourse in the Swedish parliament by combining theory from political science and communication science with a probabilistic topic model.

Abstract [sv]

Probabilistiska ämnesmodeller (topic models) är en mångsidig klass av modeller för att estimera ämnessammansättningar i större corpusar. Applikationer finns i ett flertal vetenskapsområden som teknik, naturvetenskap, samhällsvetenskap och humaniora. I denna avhandling föreslås nya effektiva och parallella Markov Chain Monte Carlo algoritmer för Bayesianska ämnesmodeller. De föreslagna metoderna skalar väl med storleken på corpuset och kan användas för flera olika ämnesmodeller och liknande modeller inom språkteknologi. De föreslagna metoderna är snabba, effektiva, skalbara och konvergerar till den sanna posteriorfördelningen.

Dessutom föreslås en ämnesmodell för högdimensionell textklassificering, med tonvikt på tolkningsbar dokumentklassificering genom att använda en kraftigt regulariserande priorifördelningar.

Slutligen utvecklas en ämnesmodell för att analyzera "agenda" och "framing" för ett förutbestämt ämne. Med denna metod analyserar vi invandringsdiskursen i Sveriges Riksdag över tid, genom att kombinera teori från statsvetenskap, kommunikationsvetenskap och probabilistiska ämnesmodeller.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. p. 53
Series
Linköping Studies in Arts and Sciences, ISSN 0282-9800 ; 743Linköping Studies in Statistics, ISSN 1651-1700 ; 14
Keywords
Text analysis, Bayesian inference, Markov chain Monte Carlo, topic models, Textanalys, Bayesiansk inferens, Markov chain Monte Carlo, temamodeller
National Category
Probability Theory and Statistics Language Technology (Computational Linguistics) Computer Sciences
Identifiers
urn:nbn:se:liu:diva-146964 (URN)10.3384/diss.diva-146964 (DOI)9789176852880 (ISBN)
Public defence
2018-06-05, Ada Lovelace, hus B, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2018-04-27 Created: 2018-04-27 Last updated: 2019-09-26Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records

Magnusson, Måns

Search in DiVA

By author/editor
Magnusson, Måns
By organisation
The Division of Statistics and Machine LearningFaculty of Science & Engineering
Probability Theory and StatisticsGeneral Language Studies and LinguisticsSpecific Languages

Search outside of DiVA

GoogleGoogle Scholar

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 173 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf