liu.seSearch for publications in DiVA
Change search
Link to record
Permanent link

Direct link
BETA
Magnusson, Måns
Publications (5 of 5) Show all publications
Magnusson, M., Jonsson, L. & Villani, M. (2019). DOLDA: a regularized supervised topic model for high-dimensional multi-class regression. Computational statistics (Zeitschrift)
Open this publication in new window or tab >>DOLDA: a regularized supervised topic model for high-dimensional multi-class regression
2019 (English)In: Computational statistics (Zeitschrift), ISSN 0943-4062, E-ISSN 1613-9658Article in journal (Refereed) Epub ahead of print
Abstract [en]

Generating user interpretable multi-class predictions in data-rich environments with many classes and explanatory covariates is a daunting task. We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised topic model for multi-class classification that can handle many classes as well as many covariates. To handle many classes we use the recently proposed Diagonal Orthant probit model (Johndrow et al., in: Proceedings of the sixteenth international conference on artificial intelligence and statistics, 2013) together with an efficient Horseshoe prior for variable selection/shrinkage (Carvalho et al. in Biometrika 97:465–480, 2010). We propose a computationally efficient parallel Gibbs sampler for the new model. An important advantage of DOLDA is that learned topics are directly connected to individual classes without the need for a reference class. We evaluate the model’s predictive accuracy and scalability, and demonstrate DOLDA’s advantage in interpreting the generated predictions.

Place, publisher, year, edition, pages
Springer, 2019
Keywords
Text classification, Latent Dirichlet Allocation, Horseshoe prior, Diagonal Orthant probit model, Interpretable models
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:liu:diva-159217 (URN)10.1007/s00180-019-00891-1 (DOI)2-s2.0-85067414496 (Scopus ID)
Available from: 2019-08-05 Created: 2019-08-05 Last updated: 2019-11-14Bibliographically approved
Magnusson, M. (2018). Scalable and Efficient Probabilistic Topic Model Inference for Textual Data. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Scalable and Efficient Probabilistic Topic Model Inference for Textual Data
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Probabilistic topic models have proven to be an extremely versatile class of mixed-membership models for discovering the thematic structure of text collections. There are many possible applications, covering a broad range of areas of study: technology, natural science, social science and the humanities.

In this thesis, a new efficient parallel Markov Chain Monte Carlo inference algorithm is proposed for Bayesian inference in large topic models. The proposed methods scale well with the corpus size and can be used for other probabilistic topic models and other natural language processing applications. The proposed methods are fast, efficient, scalable, and will converge to the true posterior distribution.

In addition, in this thesis a supervised topic model for high-dimensional text classification is also proposed, with emphasis on interpretable document prediction using the horseshoe shrinkage prior in supervised topic models.

Finally, we develop a model and inference algorithm that can model agenda and framing of political speeches over time with a priori defined topics. We apply the approach to analyze the evolution of immigration discourse in the Swedish parliament by combining theory from political science and communication science with a probabilistic topic model.

Abstract [sv]

Probabilistiska ämnesmodeller (topic models) är en mångsidig klass av modeller för att estimera ämnessammansättningar i större corpusar. Applikationer finns i ett flertal vetenskapsområden som teknik, naturvetenskap, samhällsvetenskap och humaniora. I denna avhandling föreslås nya effektiva och parallella Markov Chain Monte Carlo algoritmer för Bayesianska ämnesmodeller. De föreslagna metoderna skalar väl med storleken på corpuset och kan användas för flera olika ämnesmodeller och liknande modeller inom språkteknologi. De föreslagna metoderna är snabba, effektiva, skalbara och konvergerar till den sanna posteriorfördelningen.

Dessutom föreslås en ämnesmodell för högdimensionell textklassificering, med tonvikt på tolkningsbar dokumentklassificering genom att använda en kraftigt regulariserande priorifördelningar.

Slutligen utvecklas en ämnesmodell för att analyzera "agenda" och "framing" för ett förutbestämt ämne. Med denna metod analyserar vi invandringsdiskursen i Sveriges Riksdag över tid, genom att kombinera teori från statsvetenskap, kommunikationsvetenskap och probabilistiska ämnesmodeller.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. p. 53
Series
Linköping Studies in Arts and Sciences, ISSN 0282-9800 ; 743Linköping Studies in Statistics, ISSN 1651-1700 ; 14
Keywords
Text analysis, Bayesian inference, Markov chain Monte Carlo, topic models, Textanalys, Bayesiansk inferens, Markov chain Monte Carlo, temamodeller
National Category
Probability Theory and Statistics Language Technology (Computational Linguistics) Computer Sciences
Identifiers
urn:nbn:se:liu:diva-146964 (URN)10.3384/diss.diva-146964 (DOI)9789176852880 (ISBN)
Public defence
2018-06-05, Ada Lovelace, hus B, Campus Valla, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2018-04-27 Created: 2018-04-27 Last updated: 2019-09-26Bibliographically approved
Magnusson, M., Jonsson, L., Villani, M. & Broman, D. (2018). Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models. Journal of Computational And Graphical Statistics, 27(2), 449-463
Open this publication in new window or tab >>Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
2018 (English)In: Journal of Computational And Graphical Statistics, ISSN 1061-8600, E-ISSN 1537-2715, Vol. 27, no 2, p. 449-463Article in journal (Refereed) Published
Abstract [en]

Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler.

Place, publisher, year, edition, pages
Taylor & Francis, 2018
Keywords
Bayesian inference, Gibbs sampling, Latent Dirichlet Allocation, Massive Data Sets, Parallel Computing, Computational complexity
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:liu:diva-140872 (URN)10.1080/10618600.2017.1366913 (DOI)000435688200018 ()
Funder
Swedish Foundation for Strategic Research , SSFRIT 15-0097
Available from: 2017-09-13 Created: 2017-09-13 Last updated: 2018-07-20Bibliographically approved
Schofield, A., Magnusson, M. & Mimno, D. (2017). Pulling Out the Stops: Rethinking Stopword Removal for Topic Models. In: 15th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of Conference, volume 2: Short Papers. Paper presented at 15th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of Conference, volume 2: Short Papers April 3-7, 2017, Valencia, Spain (pp. 432-436). Stroudsburg: Association for Computational Linguistics (ACL), 2
Open this publication in new window or tab >>Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
2017 (English)In: 15th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of Conference, volume 2: Short Papers, Stroudsburg: Association for Computational Linguistics (ACL) , 2017, Vol. 2, p. 432-436Conference paper, Published paper (Other academic)
Abstract [en]

It is often assumed that topic models benefit from the use of a manually curated stopword list. Constructing this list is time-consuming and often subject to user judgments about what kinds of words are important to the model and the application. Although stopword removal clearly affects which word types appear as most probable terms in topics, we argue that this improvement is superficial, and that topic inference benefits little from the practice of removing stopwords beyond very frequent terms. Removing corpus-specific stopwords after model inference is more transparent and produces similar results to removing those words prior to inference.

Place, publisher, year, edition, pages
Stroudsburg: Association for Computational Linguistics (ACL), 2017
National Category
Probability Theory and Statistics General Language Studies and Linguistics Specific Languages
Identifiers
urn:nbn:se:liu:diva-147612 (URN)9781945626357 (ISBN)
Conference
15th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of Conference, volume 2: Short Papers April 3-7, 2017, Valencia, Spain
Available from: 2018-04-27 Created: 2018-04-27 Last updated: 2018-04-27Bibliographically approved
Hansdotter, F. I., Magnusson, M., Kuhlmann-Berenzon, S., Hulth, A., Sundstrom, K., Hedlund, K.-O. & Andersson, Y. (2015). The incidence of acute gastrointestinal illness in Sweden. Scandinavian Journal of Public Health, 43(5), 540-547
Open this publication in new window or tab >>The incidence of acute gastrointestinal illness in Sweden
Show others...
2015 (English)In: Scandinavian Journal of Public Health, ISSN 1403-4948, E-ISSN 1651-1905, Vol. 43, no 5, p. 540-547Article in journal (Refereed) Published
Abstract [en]

Aims: The aim of this study was to estimate the self-reported domestic incidence of acute gastrointestinal illness in the Swedish population irrespective of route of transmission or type of pathogen causing the disease. Previous studies in Sweden have primarily focused on incidence of acute gastrointestinal illness related to consumption of contaminated food and drinking water. Methods: In May 2009, we sent a questionnaire to 4000 randomly selected persons aged 0-85 years, asking about the number of episodes of stomach disease during the last 12 months. To validate the data on symptoms, we compared the study results with anonymous queries submitted to a Swedish medical website. Results: The response rate was 64%. We estimated that a total number of 2744,778 acute gastrointestinal illness episodes (95% confidence intervals 2475,641-3013,915) occurred between 1 May 2008 and 30 April 2009. Comparing the number of reported episodes with web queries indicated that the low number of episodes during the first 6 months was an effect of seasonality rather than recall bias. Further, the result of the recall bias analysis suggested that the survey captured approximately 65% of the true number of episodes among the respondents. Conclusions: The estimated number of Swedish acute gastrointestinal illness cases in this study is about five times higher than previous estimates. This study provides valuable information on the incidence of gastrointestinal symptoms in Sweden, irrespective of route of transmission, indicating a high burden of acute gastrointestinal illness, especially among children, and large societal costs, primarily due to production losses.

Place, publisher, year, edition, pages
SAGE Publications (UK and US), 2015
Keywords
Estimating disease incidence; acute gastrointestinal illness; diarrhoea; public health; recall bias; syndromic surveillance; web query-based surveillance
National Category
Mathematics
Identifiers
urn:nbn:se:liu:diva-120353 (URN)10.1177/1403494815576787 (DOI)000357581300014 ()25969165 (PubMedID)
Available from: 2015-07-31 Created: 2015-07-31 Last updated: 2017-12-04
Organisations

Search in DiVA

Show all publications