DOLDA: a regularized supervised topic model for high-dimensional multi-class regression
2020 (English)In: Computational statistics (Zeitschrift), ISSN 0943-4062, E-ISSN 1613-9658, Vol. 35, no 1, p. 175-201Article in journal (Refereed) Published
Abstract [en]
Generating user interpretable multi-class predictions in data-rich environments with many classes and explanatory covariates is a daunting task. We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised topic model for multi-class classification that can handle many classes as well as many covariates. To handle many classes we use the recently proposed Diagonal Orthant probit model (Johndrow et al., in: Proceedings of the sixteenth international conference on artificial intelligence and statistics, 2013) together with an efficient Horseshoe prior for variable selection/shrinkage (Carvalho et al. in Biometrika 97:465–480, 2010). We propose a computationally efficient parallel Gibbs sampler for the new model. An important advantage of DOLDA is that learned topics are directly connected to individual classes without the need for a reference class. We evaluate the model’s predictive accuracy and scalability, and demonstrate DOLDA’s advantage in interpreting the generated predictions.
Place, publisher, year, edition, pages
Springer, 2020. Vol. 35, no 1, p. 175-201
Keywords [en]
Text classification, Latent Dirichlet Allocation, Horseshoe prior, Diagonal Orthant probit model, Interpretable models
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:liu:diva-159217DOI: 10.1007/s00180-019-00891-1ISI: 000516561400012Scopus ID: 2-s2.0-85067414496OAI: oai:DiVA.org:liu-159217DiVA, id: diva2:1340533
Note
Funding agencies: Aalto University
2019-08-052019-08-052020-03-19Bibliographically approved