Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words. Kumar, V. & Sridhar, R.
Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words [pdf]Paper  Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words [link]Website  abstract   bibtex   
We present an unsupervised topic model for short texts that performs soft clustering over distributed representations of words. We model the low-dimensional seman-tic vector space represented by the dense distributed representations of words using Gaussian mixture models (GMMs) whose components capture the notion of latent topics. While conventional topic model-ing schemes such as probabilistic latent se-mantic analysis (pLSA) and latent Dirich-let allocation (LDA) need aggregation of short messages to avoid data sparsity in short documents, our framework works on large amounts of raw short texts (billions of words). In contrast with other topic modeling frameworks that use word co-occurrence statistics, our framework uses a vector space model that overcomes the issue of sparse word co-occurrence pat-terns. We demonstrate that our framework outperforms LDA on short texts through both subjective and objective evaluation. We also show the utility of our framework in learning topics and classifying short texts on Twitter data for English, Spanish, French, Portuguese and Russian.

Downloads: 0