liu.seSök publikationer i DiVA
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
Linköpings universitet, Institutionen för datavetenskap, Statistik och maskininlärning. Linköpings universitet, Tekniska fakulteten.
Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska fakulteten.
Linköpings universitet, Institutionen för datavetenskap, Statistik och maskininlärning. Linköpings universitet, Tekniska fakulteten. (STIMA)
School of Information and Communication Technology, Royal Institute of Technology KTH, Stockholm, Sweden.
Antal upphovsmän: 42018 (Engelska)Ingår i: Journal of Computational And Graphical Statistics, ISSN 1061-8600, E-ISSN 1537-2715, Vol. 27, nr 2, s. 449-463Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler.

Ort, förlag, år, upplaga, sidor
Taylor & Francis, 2018. Vol. 27, nr 2, s. 449-463
Nyckelord [en]
Bayesian inference, Gibbs sampling, Latent Dirichlet Allocation, Massive Data Sets, Parallel Computing, Computational complexity
Nationell ämneskategori
Sannolikhetsteori och statistik
Identifikatorer
URN: urn:nbn:se:liu:diva-140872DOI: 10.1080/10618600.2017.1366913ISI: 000435688200018OAI: oai:DiVA.org:liu-140872DiVA, id: diva2:1141079
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), SSFRIT 15-0097Tillgänglig från: 2017-09-13 Skapad: 2017-09-13 Senast uppdaterad: 2018-07-20Bibliografiskt granskad
Ingår i avhandling
1. Scalable and Efficient Probabilistic Topic Model Inference for Textual Data
Öppna denna publikation i ny flik eller fönster >>Scalable and Efficient Probabilistic Topic Model Inference for Textual Data
2018 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Probabilistic topic models have proven to be an extremely versatile class of mixed-membership models for discovering the thematic structure of text collections. There are many possible applications, covering a broad range of areas of study: technology, natural science, social science and the humanities.

In this thesis, a new efficient parallel Markov Chain Monte Carlo inference algorithm is proposed for Bayesian inference in large topic models. The proposed methods scale well with the corpus size and can be used for other probabilistic topic models and other natural language processing applications. The proposed methods are fast, efficient, scalable, and will converge to the true posterior distribution.

In addition, in this thesis a supervised topic model for high-dimensional text classification is also proposed, with emphasis on interpretable document prediction using the horseshoe shrinkage prior in supervised topic models.

Finally, we develop a model and inference algorithm that can model agenda and framing of political speeches over time with a priori defined topics. We apply the approach to analyze the evolution of immigration discourse in the Swedish parliament by combining theory from political science and communication science with a probabilistic topic model.

Abstract [sv]

Probabilistiska ämnesmodeller (topic models) är en mångsidig klass av modeller för att estimera ämnessammansättningar i större corpusar. Applikationer finns i ett flertal vetenskapsområden som teknik, naturvetenskap, samhällsvetenskap och humaniora. I denna avhandling föreslås nya effektiva och parallella Markov Chain Monte Carlo algoritmer för Bayesianska ämnesmodeller. De föreslagna metoderna skalar väl med storleken på corpuset och kan användas för flera olika ämnesmodeller och liknande modeller inom språkteknologi. De föreslagna metoderna är snabba, effektiva, skalbara och konvergerar till den sanna posteriorfördelningen.

Dessutom föreslås en ämnesmodell för högdimensionell textklassificering, med tonvikt på tolkningsbar dokumentklassificering genom att använda en kraftigt regulariserande priorifördelningar.

Slutligen utvecklas en ämnesmodell för att analyzera "agenda" och "framing" för ett förutbestämt ämne. Med denna metod analyserar vi invandringsdiskursen i Sveriges Riksdag över tid, genom att kombinera teori från statsvetenskap, kommunikationsvetenskap och probabilistiska ämnesmodeller.

Ort, förlag, år, upplaga, sidor
Linköping: Linköping University Electronic Press, 2018. s. 53
Serie
Linköping Studies in Arts and Sciences, ISSN 0282-9800 ; 743Linköping Studies in Statistics, ISSN 1651-1700 ; 14
Nyckelord
Text analysis, Bayesian inference, Markov chain Monte Carlo, topic models, Textanalys, Bayesiansk inferens, Markov chain Monte Carlo, temamodeller
Nationell ämneskategori
Sannolikhetsteori och statistik Språkteknologi (språkvetenskaplig databehandling) Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:liu:diva-146964 (URN)10.3384/diss.diva-146964 (DOI)9789176852880 (ISBN)
Disputation
2018-06-05, Ada Lovelace, hus B, Campus Valla, Linköping, 13:15 (Engelska)
Opponent
Handledare
Tillgänglig från: 2018-04-27 Skapad: 2018-04-27 Senast uppdaterad: 2019-09-26Bibliografiskt granskad
2. Machine Learning-Based Bug Handling in Large-Scale Software Development
Öppna denna publikation i ny flik eller fönster >>Machine Learning-Based Bug Handling in Large-Scale Software Development
2018 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

This thesis investigates the possibilities of automating parts of the bug handling process in large-scale software development organizations. The bug handling process is a large part of the mostly manual, and very costly, maintenance of software systems. Automating parts of this time consuming and very laborious process could save large amounts of time and effort wasted on dealing with bug reports. In this thesis we focus on two aspects of the bug handling process, bug assignment and fault localization. Bug assignment is the process of assigning a newly registered bug report to a design team or developer. Fault localization is the process of finding where in a software architecture the fault causing the bug report should be solved. The main reason these tasks are not automated is that they are considered hard to automate, requiring human expertise and creativity. This thesis examines the possi- bility of using machine learning techniques for automating at least parts of these processes. We call these automated techniques Automated Bug Assignment (ABA) and Automatic Fault Localization (AFL), respectively. We treat both of these problems as classification problems. In ABA, the classes are the design teams in the development organization. In AFL, the classes consist of the software components in the software architecture. We focus on a high level fault localization that it is suitable to integrate into the initial support flow of large software development organizations.

The thesis consists of six papers that investigate different aspects of the AFL and ABA problems. The first two papers are empirical and exploratory in nature, examining the ABA problem using existing machine learning techniques but introducing ensembles into the ABA context. In the first paper we show that, like in many other contexts, ensembles such as the stacked generalizer (or stacking) improves classification accuracy compared to individual classifiers when evaluated using cross fold validation. The second paper thor- oughly explore many aspects such as training set size, age of bug reports and different types of evaluation of the ABA problem in the context of stacking. The second paper also expands upon the first paper in that the number of industry bug reports, roughly 50,000, from two large-scale industry software development contexts. It is still as far as we are aware, the largest study on real industry data on this topic to this date. The third and sixth papers, are theoretical, improving inference in a now classic machine learning tech- nique for topic modeling called Latent Dirichlet Allocation (LDA). We show that, unlike the currently dominating approximate approaches, we can do parallel inference in the LDA model with a mathematically correct algorithm, without sacrificing efficiency or speed. The approaches are evaluated on standard research datasets, measuring various aspects such as sampling efficiency and execution time. Paper four, also theoretical, then builds upon the LDA model and introduces a novel supervised Bayesian classification model that we call DOLDA. The DOLDA model deals with both textual content and, structured numeric, and nominal inputs in the same model. The approach is evaluated on a new data set extracted from IMDb which have the structure of containing both nominal and textual data. The model is evaluated using two approaches. First, by accuracy, using cross fold validation. Second, by comparing the simplicity of the final model with that of other approaches. In paper five we empirically study the performance, in terms of prediction accuracy, of the DOLDA model applied to the AFL problem. The DOLDA model was designed with the AFL problem in mind, since it has the exact structure of a mix of nominal and numeric inputs in combination with unstructured text. We show that our DOLDA model exhibits many nice properties, among others, interpretability, that the research community has iden- tified as missing in current models for AFL.

Ort, förlag, år, upplaga, sidor
Linköping: Linköping University Electronic Press, 2018. s. 120
Serie
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 1936
Nyckelord
machine learning, bug reports, large scale software development
Nationell ämneskategori
Teknik och teknologier Programvaruteknik
Identifikatorer
urn:nbn:se:liu:diva-147059 (URN)10.3384/diss.diva-147059 (DOI)9789176853061 (ISBN)
Disputation
2018-06-12, Ada Lovelace, Hus B, Campus Valla, Linköping, 13:15 (Engelska)
Opponent
Handledare
Tillgänglig från: 2018-05-17 Skapad: 2018-05-17 Senast uppdaterad: 2019-09-30Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltext

Personposter BETA

Magnusson, MånsVillani, Mattias

Sök vidare i DiVA

Av författaren/redaktören
Magnusson, MånsJonsson, LeifVillani, Mattias
Av organisationen
Statistik och maskininlärningTekniska fakultetenInstitutionen för datavetenskap
I samma tidskrift
Journal of Computational And Graphical Statistics
Sannolikhetsteori och statistik

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 118 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf