liu.seSök publikationer i DiVA
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Magnusson, Måns
Publikationer (5 of 5) Visa alla publikationer
Magnusson, M., Jonsson, L. & Villani, M. (2020). DOLDA: a regularized supervised topic model for high-dimensional multi-class regression. Computational statistics (Zeitschrift), 35(1), 175-201
Öppna denna publikation i ny flik eller fönster >>DOLDA: a regularized supervised topic model for high-dimensional multi-class regression
2020 (Engelska)Ingår i: Computational statistics (Zeitschrift), ISSN 0943-4062, E-ISSN 1613-9658, Vol. 35, nr 1, s. 175-201Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Generating user interpretable multi-class predictions in data-rich environments with many classes and explanatory covariates is a daunting task. We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised topic model for multi-class classification that can handle many classes as well as many covariates. To handle many classes we use the recently proposed Diagonal Orthant probit model (Johndrow et al., in: Proceedings of the sixteenth international conference on artificial intelligence and statistics, 2013) together with an efficient Horseshoe prior for variable selection/shrinkage (Carvalho et al. in Biometrika 97:465–480, 2010). We propose a computationally efficient parallel Gibbs sampler for the new model. An important advantage of DOLDA is that learned topics are directly connected to individual classes without the need for a reference class. We evaluate the model’s predictive accuracy and scalability, and demonstrate DOLDA’s advantage in interpreting the generated predictions.

Ort, förlag, år, upplaga, sidor
Springer, 2020
Nyckelord
Text classification, Latent Dirichlet Allocation, Horseshoe prior, Diagonal Orthant probit model, Interpretable models
Nationell ämneskategori
Sannolikhetsteori och statistik
Identifikatorer
urn:nbn:se:liu:diva-159217 (URN)10.1007/s00180-019-00891-1 (DOI)000516561400012 ()2-s2.0-85067414496 (Scopus ID)
Anmärkning

Funding agencies: Aalto University

Tillgänglig från: 2019-08-05 Skapad: 2019-08-05 Senast uppdaterad: 2020-03-19Bibliografiskt granskad
Magnusson, M. (2018). Scalable and Efficient Probabilistic Topic Model Inference for Textual Data. (Doctoral dissertation). Linköping: Linköping University Electronic Press
Öppna denna publikation i ny flik eller fönster >>Scalable and Efficient Probabilistic Topic Model Inference for Textual Data
2018 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

Probabilistic topic models have proven to be an extremely versatile class of mixed-membership models for discovering the thematic structure of text collections. There are many possible applications, covering a broad range of areas of study: technology, natural science, social science and the humanities.

In this thesis, a new efficient parallel Markov Chain Monte Carlo inference algorithm is proposed for Bayesian inference in large topic models. The proposed methods scale well with the corpus size and can be used for other probabilistic topic models and other natural language processing applications. The proposed methods are fast, efficient, scalable, and will converge to the true posterior distribution.

In addition, in this thesis a supervised topic model for high-dimensional text classification is also proposed, with emphasis on interpretable document prediction using the horseshoe shrinkage prior in supervised topic models.

Finally, we develop a model and inference algorithm that can model agenda and framing of political speeches over time with a priori defined topics. We apply the approach to analyze the evolution of immigration discourse in the Swedish parliament by combining theory from political science and communication science with a probabilistic topic model.

Abstract [sv]

Probabilistiska ämnesmodeller (topic models) är en mångsidig klass av modeller för att estimera ämnessammansättningar i större corpusar. Applikationer finns i ett flertal vetenskapsområden som teknik, naturvetenskap, samhällsvetenskap och humaniora. I denna avhandling föreslås nya effektiva och parallella Markov Chain Monte Carlo algoritmer för Bayesianska ämnesmodeller. De föreslagna metoderna skalar väl med storleken på corpuset och kan användas för flera olika ämnesmodeller och liknande modeller inom språkteknologi. De föreslagna metoderna är snabba, effektiva, skalbara och konvergerar till den sanna posteriorfördelningen.

Dessutom föreslås en ämnesmodell för högdimensionell textklassificering, med tonvikt på tolkningsbar dokumentklassificering genom att använda en kraftigt regulariserande priorifördelningar.

Slutligen utvecklas en ämnesmodell för att analyzera "agenda" och "framing" för ett förutbestämt ämne. Med denna metod analyserar vi invandringsdiskursen i Sveriges Riksdag över tid, genom att kombinera teori från statsvetenskap, kommunikationsvetenskap och probabilistiska ämnesmodeller.

Ort, förlag, år, upplaga, sidor
Linköping: Linköping University Electronic Press, 2018. s. 53
Serie
Linköping Studies in Arts and Sciences, ISSN 0282-9800 ; 743Linköping Studies in Statistics, ISSN 1651-1700 ; 14
Nyckelord
Text analysis, Bayesian inference, Markov chain Monte Carlo, topic models, Textanalys, Bayesiansk inferens, Markov chain Monte Carlo, temamodeller
Nationell ämneskategori
Sannolikhetsteori och statistik Språkteknologi (språkvetenskaplig databehandling) Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:liu:diva-146964 (URN)10.3384/diss.diva-146964 (DOI)9789176852880 (ISBN)
Disputation
2018-06-05, Ada Lovelace, hus B, Campus Valla, Linköping, 13:15 (Engelska)
Opponent
Handledare
Tillgänglig från: 2018-04-27 Skapad: 2018-04-27 Senast uppdaterad: 2019-09-26Bibliografiskt granskad
Magnusson, M., Jonsson, L., Villani, M. & Broman, D. (2018). Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models. Journal of Computational And Graphical Statistics, 27(2), 449-463
Öppna denna publikation i ny flik eller fönster >>Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models
2018 (Engelska)Ingår i: Journal of Computational And Graphical Statistics, ISSN 1061-8600, E-ISSN 1537-2715, Vol. 27, nr 2, s. 449-463Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler.

Ort, förlag, år, upplaga, sidor
Taylor & Francis, 2018
Nyckelord
Bayesian inference, Gibbs sampling, Latent Dirichlet Allocation, Massive Data Sets, Parallel Computing, Computational complexity
Nationell ämneskategori
Sannolikhetsteori och statistik
Identifikatorer
urn:nbn:se:liu:diva-140872 (URN)10.1080/10618600.2017.1366913 (DOI)000435688200018 ()
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), SSFRIT 15-0097
Tillgänglig från: 2017-09-13 Skapad: 2017-09-13 Senast uppdaterad: 2022-04-11Bibliografiskt granskad
Schofield, A., Magnusson, M. & Mimno, D. (2017). Pulling Out the Stops: Rethinking Stopword Removal for Topic Models. In: 15th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of Conference, volume 2: Short Papers. Paper presented at 15th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of Conference, volume 2: Short Papers April 3-7, 2017, Valencia, Spain (pp. 432-436). Stroudsburg: Association for Computational Linguistics (ACL), 2
Öppna denna publikation i ny flik eller fönster >>Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
2017 (Engelska)Ingår i: 15th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of Conference, volume 2: Short Papers, Stroudsburg: Association for Computational Linguistics (ACL) , 2017, Vol. 2, s. 432-436Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

It is often assumed that topic models benefit from the use of a manually curated stopword list. Constructing this list is time-consuming and often subject to user judgments about what kinds of words are important to the model and the application. Although stopword removal clearly affects which word types appear as most probable terms in topics, we argue that this improvement is superficial, and that topic inference benefits little from the practice of removing stopwords beyond very frequent terms. Removing corpus-specific stopwords after model inference is more transparent and produces similar results to removing those words prior to inference.

Ort, förlag, år, upplaga, sidor
Stroudsburg: Association for Computational Linguistics (ACL), 2017
Nationell ämneskategori
Sannolikhetsteori och statistik Jämförande språkvetenskap och allmän lingvistik Studier av enskilda språk
Identifikatorer
urn:nbn:se:liu:diva-147612 (URN)9781945626357 (ISBN)
Konferens
15th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of Conference, volume 2: Short Papers April 3-7, 2017, Valencia, Spain
Tillgänglig från: 2018-04-27 Skapad: 2018-04-27 Senast uppdaterad: 2018-04-27Bibliografiskt granskad
Hansdotter, F. I., Magnusson, M., Kuhlmann-Berenzon, S., Hulth, A., Sundstrom, K., Hedlund, K.-O. & Andersson, Y. (2015). The incidence of acute gastrointestinal illness in Sweden. Scandinavian Journal of Public Health, 43(5), 540-547
Öppna denna publikation i ny flik eller fönster >>The incidence of acute gastrointestinal illness in Sweden
Visa övriga...
2015 (Engelska)Ingår i: Scandinavian Journal of Public Health, ISSN 1403-4948, E-ISSN 1651-1905, Vol. 43, nr 5, s. 540-547Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Aims: The aim of this study was to estimate the self-reported domestic incidence of acute gastrointestinal illness in the Swedish population irrespective of route of transmission or type of pathogen causing the disease. Previous studies in Sweden have primarily focused on incidence of acute gastrointestinal illness related to consumption of contaminated food and drinking water. Methods: In May 2009, we sent a questionnaire to 4000 randomly selected persons aged 0-85 years, asking about the number of episodes of stomach disease during the last 12 months. To validate the data on symptoms, we compared the study results with anonymous queries submitted to a Swedish medical website. Results: The response rate was 64%. We estimated that a total number of 2744,778 acute gastrointestinal illness episodes (95% confidence intervals 2475,641-3013,915) occurred between 1 May 2008 and 30 April 2009. Comparing the number of reported episodes with web queries indicated that the low number of episodes during the first 6 months was an effect of seasonality rather than recall bias. Further, the result of the recall bias analysis suggested that the survey captured approximately 65% of the true number of episodes among the respondents. Conclusions: The estimated number of Swedish acute gastrointestinal illness cases in this study is about five times higher than previous estimates. This study provides valuable information on the incidence of gastrointestinal symptoms in Sweden, irrespective of route of transmission, indicating a high burden of acute gastrointestinal illness, especially among children, and large societal costs, primarily due to production losses.

Ort, förlag, år, upplaga, sidor
SAGE Publications (UK and US), 2015
Nyckelord
Estimating disease incidence; acute gastrointestinal illness; diarrhoea; public health; recall bias; syndromic surveillance; web query-based surveillance
Nationell ämneskategori
Matematik
Identifikatorer
urn:nbn:se:liu:diva-120353 (URN)10.1177/1403494815576787 (DOI)000357581300014 ()25969165 (PubMedID)
Tillgänglig från: 2015-07-31 Skapad: 2015-07-31 Senast uppdaterad: 2017-12-04
Organisationer

Sök vidare i DiVA

Visa alla publikationer