The Use of Distributional Semantics in Text Classification Models: Comparative performance analysis of popular word embeddings
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
In the field of Natural Language Processing, supervised machine learning is commonly used to solve classification tasks such as sentiment analysis and text categorization. The classical way of representing the text has been to use the well known Bag-Of-Words representation. However lately low-dimensional dense word vectors have come to dominate the input to state-of-the-art models. While few studies have made a fair comparison of the models' sensibility to the text representation, this thesis tries to fill that gap. We especially seek insight in the impact various unsupervised pre-trained vectors have on the performance. In addition, we take a closer look at the Random Indexing representation and try to optimize it jointly with the classification task. The results show that while low-dimensional pre-trained representations often have computational benefits and have also reported state-of-the-art performance, they do not necessarily outperform the classical representations in all cases.
Place, publisher, year, edition, pages
2016. , 44 p.
distributional semantics, text classification, cnn
IdentifiersURN: urn:nbn:se:liu:diva-127991ISRN: LiTH-ISY-EX--16/4926--SEOAI: oai:DiVA.org:liu-127991DiVA: diva2:928411
Subject / course
Computer Vision Laboratory