Probabilistic author-topic models for information discovery

Probabilistic author-topic models for information discovery. Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. L In KDD '04: Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306–315, New York, NY, USA, 2004. ACM Press.
doi abstract bibtex

We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.

@inproceedings{Steyvers/etal:04,
	address = {New York, NY, USA},
	title = {Probabilistic author-topic models for information discovery},
	isbn = {1-58113-888-9},
	doi = {http://doi.acm.org/10.1145/1014052.1014087},
	abstract = {We propose a new unsupervised learning technique for extracting information
from large text collections. We model documents as if they were generated by
a two-stage stochastic process. Each author is represented by a probability
distribution over topics, and each topic is represented as a probability
distribution over words for that topic. The words in a multi-author paper
are assumed to be the result of a mixture of each authors' topic mixture.
The topic-word and author-topic distributions are learned from data in an
unsupervised manner using a Markov chain Monte Carlo algorithm. We apply
the methodology to a large corpus of 160,000 abstracts and 85,000 authors
from the well-known CiteSeer digital library, and learn a model with 300
topics. We discuss in detail the interpretation of the results discovered
by the system including specific topic and author models, ranking of authors
by topic and topics by author, significant trends in the computer science
literature between 1990 and 2002, parsing of abstracts by topics and authors
and detection of unusual papers by specific authors. An online query interface
to the model is also discussed that allows interactive exploration of
author-topic models for corpora such as CiteSeer.},
	booktitle = {{KDD} '04: {Proceedings} of the 2004 {ACM} {SIGKDD} international conference on {Knowledge} discovery and data mining},
	publisher = {ACM Press},
	author = {Steyvers, Mark and Smyth, Padhraic and Rosen-Zvi, Michal and Griffiths, Thomas L},
	year = {2004},
	pages = {306--315},
}

Downloads: 0

{"_id":"zojJxcyeKai9hYGQC","bibbaseid":"steyvers-smyth-rosenzvi-griffiths-probabilisticauthortopicmodelsforinformationdiscovery-2004","authorIDs":[],"author_short":["Steyvers, M.","Smyth, P.","Rosen-Zvi, M.","Griffiths, T. L"],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"New York, NY, USA","title":"Probabilistic author-topic models for information discovery","isbn":"1-58113-888-9","doi":"http://doi.acm.org/10.1145/1014052.1014087","abstract":"We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.","booktitle":"KDD '04: Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining","publisher":"ACM Press","author":[{"propositions":[],"lastnames":["Steyvers"],"firstnames":["Mark"],"suffixes":[]},{"propositions":[],"lastnames":["Smyth"],"firstnames":["Padhraic"],"suffixes":[]},{"propositions":[],"lastnames":["Rosen-Zvi"],"firstnames":["Michal"],"suffixes":[]},{"propositions":[],"lastnames":["Griffiths"],"firstnames":["Thomas","L"],"suffixes":[]}],"year":"2004","pages":"306–315","bibtex":"@inproceedings{Steyvers/etal:04,\n\taddress = {New York, NY, USA},\n\ttitle = {Probabilistic author-topic models for information discovery},\n\tisbn = {1-58113-888-9},\n\tdoi = {http://doi.acm.org/10.1145/1014052.1014087},\n\tabstract = {We propose a new unsupervised learning technique for extracting information\nfrom large text collections. We model documents as if they were generated by\na two-stage stochastic process. Each author is represented by a probability\ndistribution over topics, and each topic is represented as a probability\ndistribution over words for that topic. The words in a multi-author paper\nare assumed to be the result of a mixture of each authors' topic mixture.\nThe topic-word and author-topic distributions are learned from data in an\nunsupervised manner using a Markov chain Monte Carlo algorithm. We apply\nthe methodology to a large corpus of 160,000 abstracts and 85,000 authors\nfrom the well-known CiteSeer digital library, and learn a model with 300\ntopics. We discuss in detail the interpretation of the results discovered\nby the system including specific topic and author models, ranking of authors\nby topic and topics by author, significant trends in the computer science\nliterature between 1990 and 2002, parsing of abstracts by topics and authors\nand detection of unusual papers by specific authors. An online query interface\nto the model is also discussed that allows interactive exploration of\nauthor-topic models for corpora such as CiteSeer.},\n\tbooktitle = {{KDD} '04: {Proceedings} of the 2004 {ACM} {SIGKDD} international conference on {Knowledge} discovery and data mining},\n\tpublisher = {ACM Press},\n\tauthor = {Steyvers, Mark and Smyth, Padhraic and Rosen-Zvi, Michal and Griffiths, Thomas L},\n\tyear = {2004},\n\tpages = {306--315},\n}\n\n","author_short":["Steyvers, M.","Smyth, P.","Rosen-Zvi, M.","Griffiths, T. L"],"key":"Steyvers/etal:04","id":"Steyvers/etal:04","bibbaseid":"steyvers-smyth-rosenzvi-griffiths-probabilisticauthortopicmodelsforinformationdiscovery-2004","role":"author","urls":{},"metadata":{"authorlinks":{}},"html":""},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero/ifromm","creationDate":"2019-06-29T20:32:00.294Z","downloads":0,"keywords":[],"search_terms":["probabilistic","author","topic","models","information","discovery","steyvers","smyth","rosen-zvi","griffiths"],"title":"Probabilistic author-topic models for information discovery","year":2004,"dataSources":["R5kPzuC6AgTCJiyMD","N4kJAiLiJ7kxfNsoh"]}