liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Scalable and Efficient Probabilistic Topic Model Inference for Textual Data
Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning. Linköping University, Faculty of Arts and Sciences.
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Probabilistic topic models have proven to be an extremely versatile class of mixed-membership models for discovering the thematic structure of text collections. There are many possible applications, covering a broad range of areas of study: technology, natural science, social science and the humanities.

In this thesis, a new efficient parallel Markov Chain Monte Carlo inference algorithm is proposed for Bayesian inference in large topic models. The proposed methods scale well with the corpus size and can be used for other probabilistic topic models and other natural language processing applications. The proposed methods are fast, efficient, scalable, and will converge to the true posterior distribution.

In addition, in this thesis a supervised topic model for high-dimensional text classification is also proposed, with emphasis on interpretable document prediction using the horseshoe shrinkage prior in supervised topic models.

Finally, we develop a model and inference algorithm that can model agenda and framing of political speeches over time with a priori defined topics. We apply the approach to analyze the evolution of immigration discourse in the Swedish parliament by combining theory from political science and communication science with a probabilistic topic model.

Abstract [sv]

Probabilistiska ämnesmodeller (topic models) är en mångsidig klass av modeller för att estimera ämnessammansättningar i större corpusar. Applikationer finns i ett flertal vetenskapsområden som teknik, naturvetenskap, samhällsvetenskap och humaniora. I denna avhandling föreslås nya effektiva och parallella Markov Chain Monte Carlo algoritmer för Bayesianska ämnesmodeller. De föreslagna metoderna skalar väl med storleken på corpuset och kan användas för flera olika ämnesmodeller och liknande modeller inom språkteknologi. De föreslagna metoderna är snabba, effektiva, skalbara och konvergerar till den sanna posteriorfördelningen.

Dessutom föreslås en ämnesmodell för högdimensionell textklassificering, med tonvikt på tolkningsbar dokumentklassificering genom att använda en kraftigt regulariserande priorifördelningar.

Slutligen utvecklas en ämnesmodell för att analyzera "agenda" och "framing" för ett förutbestämt ämne. Med denna metod analyserar vi invandringsdiskursen i Sveriges Riksdag över tid, genom att kombinera teori från statsvetenskap, kommunikationsvetenskap och probabilistiska ämnesmodeller.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. , p. 53
Series
Linköping Studies in Arts and Sciences, ISSN 0282-9800 ; 743Linköping Studies in Statistics, ISSN 1651-1700 ; 14
Keywords [en]
Text analysis, Bayesian inference, Markov chain Monte Carlo, topic models
Keywords [sv]
Textanalys, Bayesiansk inferens, Markov chain Monte Carlo, temamodeller
National Category
Probability Theory and Statistics Language Technology (Computational Linguistics) Computer Sciences
Identifiers
URN: urn:nbn:se:liu:diva-146964 DOI: 10.3384/diss.diva-146964ISBN: 9789176852880 (print)OAI: oai:DiVA.org:liu-146964DiVA, id: diva2:1201965
Public defence
2018-06-05, Ada Lovelace, hus B, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2018-04-27 Created: 2018-04-27 Last updated: 2019-09-26Bibliographically approved
List of papers
1. Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
Open this publication in new window or tab >>Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
2018 (English)In: Journal of Computational And Graphical Statistics, ISSN 1061-8600, E-ISSN 1537-2715, Vol. 27, no 2, p. 449-463Article in journal (Refereed) Published
Abstract [en]

Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler.

Place, publisher, year, edition, pages
Taylor & Francis, 2018
Keywords
Bayesian inference, Gibbs sampling, Latent Dirichlet Allocation, Massive Data Sets, Parallel Computing, Computational complexity
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:liu:diva-140872 (URN)10.1080/10618600.2017.1366913 (DOI)000435688200018 ()
Funder
Swedish Foundation for Strategic Research , SSFRIT 15-0097
Available from: 2017-09-13 Created: 2017-09-13 Last updated: 2018-07-20Bibliographically approved
2. Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems using Bayesian Classification
Open this publication in new window or tab >>Automatic Localization of Bugs to Faulty Components in Large Scale Software Systems using Bayesian Classification
Show others...
2016 (English)In: 2016 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2016), IEEE , 2016, p. 425-432Conference paper, Published paper (Refereed)
Abstract [en]

We suggest a Bayesian approach to the problem of reducing bug turnaround time in large software development organizations. Our approach is to use classification to predict where bugs are located in components. This classification is a form of automatic fault localization (AFL) at the component level. The approach only relies on historical bug reports and does not require detailed analysis of source code or detailed test runs. Our approach addresses two problems identified in user studies of AFL tools. The first problem concerns the trust in which the user can put in the results of the tool. The second problem concerns understanding how the results were computed. The proposed model quantifies the uncertainty in its predictions and all estimated model parameters. Additionally, the output of the model explains why a result was suggested. We evaluate the approach on more than 50000 bugs.

Place, publisher, year, edition, pages
IEEE, 2016
Keywords
Machine Learning; Fault Detection; Fault Location; Software Maintenance; Software Debugging; Software Engineering
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-132879 (URN)10.1109/QRS.2016.54 (DOI)000386751700044 ()978-1-5090-4127-5 (ISBN)
Conference
IEEE International Conference on Software Quality, Reliability and Security (QRS)
Available from: 2016-12-06 Created: 2016-11-30 Last updated: 2018-05-17
3. Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
Open this publication in new window or tab >>Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
2017 (English)In: 15th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of Conference, volume 2: Short Papers, Stroudsburg: Association for Computational Linguistics (ACL) , 2017, Vol. 2, p. 432-436Conference paper, Published paper (Other academic)
Abstract [en]

It is often assumed that topic models benefit from the use of a manually curated stopword list. Constructing this list is time-consuming and often subject to user judgments about what kinds of words are important to the model and the application. Although stopword removal clearly affects which word types appear as most probable terms in topics, we argue that this improvement is superficial, and that topic inference benefits little from the practice of removing stopwords beyond very frequent terms. Removing corpus-specific stopwords after model inference is more transparent and produces similar results to removing those words prior to inference.

Place, publisher, year, edition, pages
Stroudsburg: Association for Computational Linguistics (ACL), 2017
National Category
Probability Theory and Statistics General Language Studies and Linguistics Specific Languages
Identifiers
urn:nbn:se:liu:diva-147612 (URN)9781945626357 (ISBN)
Conference
15th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of Conference, volume 2: Short Papers April 3-7, 2017, Valencia, Spain
Available from: 2018-04-27 Created: 2018-04-27 Last updated: 2018-04-27Bibliographically approved

Open Access in DiVA

Scalable and Efficient Probabilistic Topic Model Inference for Textual Data(897 kB)363 downloads
File information
File name FULLTEXT01.pdfFile size 897 kBChecksum SHA-512
5fabba9272a35e44045216eb5714ea35c8b22c896ae76cbc47b06b03bc73da6b2ceb6056f82e2d4b207132e35999a60b14d4fc415fd5f4776c266a2e7108e410
Type fulltextMimetype application/pdf
omslag(89 kB)17 downloads
File information
File name COVER01.pdfFile size 89 kBChecksum SHA-512
5c34767d407eed5f73d12a0d9b5480b9e1e2cc83671b3f9df01fde3b2fc165418fd702bf61dd350ef0ffc26f550d68cf20da622d152d8efd5ea242c39364598f
Type coverMimetype application/pdf
Order online >>

Other links

Publisher's full text

Authority records BETA

Magnusson, Måns

Search in DiVA

By author/editor
Magnusson, Måns
By organisation
The Division of Statistics and Machine LearningFaculty of Arts and Sciences
Probability Theory and StatisticsLanguage Technology (Computational Linguistics)Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 363 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1898 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf