Towards a Standard Dataset of Swedish Word Vectors
2016 (English)In: Proceedings of the Sixth Swedish Language Technology Conference (SLTC), 2016Conference paper (Refereed)
Word vectors, embeddings of words into a low-dimensional space, have been shown to be useful for a large number of natural language processing tasks. Our goal with this paper is to provide a useful dataset of such vectors for Swedish. To this end, we investigate three standard embedding methods: the continuous bag-of-words and the skip-gram model with negative sampling of Mikolov et al. (2013a), and the global vectors of Pennington et al. (2014). We compare these methods using QVEC-CCA (Tsvetkov et al., 2016), an intrinsic evaluation measure that quantifies the correlation of learned word vectors with external linguistic resources. For this propose we use SALDO, the Swedish Association Lexicon (Borin et al., 2013). Our experiments show that the continuous bag-of-words model produces vectors that are most highly correlated to SALDO, with the skip-gram model very close behind. Our learned vectors will be provided for download at the paper’s website.
Place, publisher, year, edition, pages
Language Technology (Computational Linguistics)
IdentifiersURN: urn:nbn:se:liu:diva-134901OAI: oai:DiVA.org:liu-134901DiVA: diva2:1077779
Sixth Swedish Language Technology Conference (SLTC)