Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling

Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. Mehrotra, R., Sanner, S., Buntine, W., & Xie, L.

Paper

Website abstract bibtex

Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic mod-els such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machin-ery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empir-ically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic co-herence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further im-proves on the hashtag pooling results for a subset of metrics. Over-all, these two novel schemes lead to significantly improved LDA topic models on Twitter content.

@article{
 title = {Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling},
 type = {article},
 keywords = {H33 [Information Storage And Retrieval],LDA,Microblogs,Natural Language Processing—Text analysis},
 websites = {http://users.cecs.anu.edu.au/~ssanner/Papers/sigir13.pdf},
 id = {5e3c782f-91d1-3880-8c4d-1bb5c474b07d},
 created = {2018-02-05T17:47:55.792Z},
 accessed = {2018-02-05},
 file_attached = {true},
 profile_id = {371589bb-c770-37ff-8193-93c6f25ffeb1},
 group_id = {f982cd63-7ceb-3aa2-ac7e-a953963d6716},
 last_modified = {2018-02-05T17:47:57.687Z},
 read = {false},
 starred = {false},
 authored = {false},
 confirmed = {false},
 hidden = {false},
 private_publication = {false},
 abstract = {Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic mod-els such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machin-ery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empir-ically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic co-herence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further im-proves on the hashtag pooling results for a subset of metrics. Over-all, these two novel schemes lead to significantly improved LDA topic models on Twitter content.},
 bibtype = {article},
 author = {Mehrotra, Rishabh and Sanner, Scott and Buntine, Wray and Xie, Lexing}
}

Downloads: 0

{"_id":"yujWirhQqXfRJGKDy","bibbaseid":"mehrotra-sanner-buntine-xie-improvingldatopicmodelsformicroblogsviatweetpoolingandautomaticlabeling","downloads":0,"creationDate":"2018-02-07T16:22:57.309Z","title":"Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling","author_short":["Mehrotra, R.","Sanner, S.","Buntine, W.","Xie, L."],"year":null,"bibtype":"article","biburl":null,"bibdata":{"title":"Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling","type":"article","keywords":"H33 [Information Storage And Retrieval],LDA,Microblogs,Natural Language Processing—Text analysis","websites":"http://users.cecs.anu.edu.au/~ssanner/Papers/sigir13.pdf","id":"5e3c782f-91d1-3880-8c4d-1bb5c474b07d","created":"2018-02-05T17:47:55.792Z","accessed":"2018-02-05","file_attached":"true","profile_id":"371589bb-c770-37ff-8193-93c6f25ffeb1","group_id":"f982cd63-7ceb-3aa2-ac7e-a953963d6716","last_modified":"2018-02-05T17:47:57.687Z","read":false,"starred":false,"authored":false,"confirmed":false,"hidden":false,"private_publication":false,"abstract":"Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic mod-els such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machin-ery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empir-ically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic co-herence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further im-proves on the hashtag pooling results for a subset of metrics. Over-all, these two novel schemes lead to significantly improved LDA topic models on Twitter content.","bibtype":"article","author":"Mehrotra, Rishabh and Sanner, Scott and Buntine, Wray and Xie, Lexing","bibtex":"@article{\n title = {Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling},\n type = {article},\n keywords = {H33 [Information Storage And Retrieval],LDA,Microblogs,Natural Language Processing—Text analysis},\n websites = {http://users.cecs.anu.edu.au/~ssanner/Papers/sigir13.pdf},\n id = {5e3c782f-91d1-3880-8c4d-1bb5c474b07d},\n created = {2018-02-05T17:47:55.792Z},\n accessed = {2018-02-05},\n file_attached = {true},\n profile_id = {371589bb-c770-37ff-8193-93c6f25ffeb1},\n group_id = {f982cd63-7ceb-3aa2-ac7e-a953963d6716},\n last_modified = {2018-02-05T17:47:57.687Z},\n read = {false},\n starred = {false},\n authored = {false},\n confirmed = {false},\n hidden = {false},\n private_publication = {false},\n abstract = {Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic mod-els such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machin-ery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empir-ically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic co-herence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further im-proves on the hashtag pooling results for a subset of metrics. Over-all, these two novel schemes lead to significantly improved LDA topic models on Twitter content.},\n bibtype = {article},\n author = {Mehrotra, Rishabh and Sanner, Scott and Buntine, Wray and Xie, Lexing}\n}","author_short":["Mehrotra, R.","Sanner, S.","Buntine, W.","Xie, L."],"urls":{"Paper":"http://bibbase.org/service/mendeley/371589bb-c770-37ff-8193-93c6f25ffeb1/file/0ee14192-a92f-46b5-5fb5-d741830adde3/Improving_LDA_Topic_Models_for_Microblogs_via_Tweet_Pooling_and_Automatic_Labeling.pdf.pdf","Website":"http://users.cecs.anu.edu.au/~ssanner/Papers/sigir13.pdf"},"bibbaseid":"mehrotra-sanner-buntine-xie-improvingldatopicmodelsformicroblogsviatweetpoolingandautomaticlabeling","role":"author","keyword":["H33 [Information Storage And Retrieval]","LDA","Microblogs","Natural Language Processing—Text analysis"],"downloads":0},"search_terms":["improving","lda","topic","models","microblogs","via","tweet","pooling","automatic","labeling","mehrotra","sanner","buntine","xie"],"keywords":["h33 [information storage and retrieval]","lda","microblogs","natural language processing—text analysis"],"authorIDs":[]}