Name-ethnicity classification from open sources. Ambekar, A., Ward, C., Mohammed, J., Male, S., & Skiena, S. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, of KDD '09, pages 49–58, New York, NY, USA, June, 2009. Association for Computing Machinery.
Name-ethnicity classification from open sources [link]Paper  doi  abstract   bibtex   
The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.
@inproceedings{ambekar_name-ethnicity_2009,
	address = {New York, NY, USA},
	series = {{KDD} '09},
	title = {Name-ethnicity classification from open sources},
	isbn = {978-1-60558-495-9},
	url = {https://doi.org/10.1145/1557019.1557032},
	doi = {10.1145/1557019.1557032},
	abstract = {The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.},
	urldate = {2021-03-22},
	booktitle = {Proceedings of the 15th {ACM} {SIGKDD} international conference on {Knowledge} discovery and data mining},
	publisher = {Association for Computing Machinery},
	author = {Ambekar, Anurag and Ward, Charles and Mohammed, Jahangir and Male, Swapna and Skiena, Steven},
	month = jun,
	year = {2009},
	keywords = {ethnicity detection, name classification, news analysis, social science research},
	pages = {49--58},
}

Downloads: 0