Chinese Word Segmentation and Named Entity Recognition by Character Tagging

Chinese Word Segmentation and Named Entity Recognition by Character Tagging. Yu, K. Computational Linguistics, 2006.
abstract bibtex

This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathemati- cal framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted sepa- rately in other systems. Finally, we do not assume the existence of a universal word segmenta- tion standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might re- quire different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word seg- menter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models, and provides a uni- fied approach to the five fundamental features of word-level Chinese language processing: lexi- con word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of the former to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.

@article{
 title = {Chinese Word Segmentation and Named Entity Recognition by Character Tagging},
 type = {article},
 year = {2006},
 pages = {146-149},
 id = {2d6dfa98-6f8d-3da6-a8f5-43f8e48026ce},
 created = {2012-01-21T12:35:31.000Z},
 file_attached = {false},
 profile_id = {5284e6aa-156c-3ce5-bc0e-b80cf09f3ef6},
 group_id = {066b42c8-f712-3fc3-abb2-225c158d2704},
 last_modified = {2017-03-14T14:36:19.698Z},
 tags = {named entity recognition},
 read = {false},
 starred = {false},
 authored = {false},
 confirmed = {true},
 hidden = {false},
 citation_key = {Yu2006},
 private_publication = {false},
 abstract = {This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathemati- cal framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted sepa- rately in other systems. Finally, we do not assume the existence of a universal word segmenta- tion standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might re- quire different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word seg- menter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models, and provides a uni- fied approach to the five fundamental features of word-level Chinese language processing: lexi- con word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of the former to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.},
 bibtype = {article},
 author = {Yu, Kun},
 journal = {Computational Linguistics},
 number = {July}
}

Downloads: 0

{"_id":"5Et2paXvTmp5Kzqtt","bibbaseid":"yu-chinesewordsegmentationandnamedentityrecognitionbycharactertagging-2006","authorIDs":[],"author_short":["Yu, K."],"bibdata":{"title":"Chinese Word Segmentation and Named Entity Recognition by Character Tagging","type":"article","year":"2006","pages":"146-149","id":"2d6dfa98-6f8d-3da6-a8f5-43f8e48026ce","created":"2012-01-21T12:35:31.000Z","file_attached":false,"profile_id":"5284e6aa-156c-3ce5-bc0e-b80cf09f3ef6","group_id":"066b42c8-f712-3fc3-abb2-225c158d2704","last_modified":"2017-03-14T14:36:19.698Z","tags":"named entity recognition","read":false,"starred":false,"authored":false,"confirmed":"true","hidden":false,"citation_key":"Yu2006","private_publication":false,"abstract":"This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathemati- cal framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted sepa- rately in other systems. Finally, we do not assume the existence of a universal word segmenta- tion standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might re- quire different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word seg- menter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models, and provides a uni- fied approach to the five fundamental features of word-level Chinese language processing: lexi- con word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of the former to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.","bibtype":"article","author":"Yu, Kun","journal":"Computational Linguistics","number":"July","bibtex":"@article{\n title = {Chinese Word Segmentation and Named Entity Recognition by Character Tagging},\n type = {article},\n year = {2006},\n pages = {146-149},\n id = {2d6dfa98-6f8d-3da6-a8f5-43f8e48026ce},\n created = {2012-01-21T12:35:31.000Z},\n file_attached = {false},\n profile_id = {5284e6aa-156c-3ce5-bc0e-b80cf09f3ef6},\n group_id = {066b42c8-f712-3fc3-abb2-225c158d2704},\n last_modified = {2017-03-14T14:36:19.698Z},\n tags = {named entity recognition},\n read = {false},\n starred = {false},\n authored = {false},\n confirmed = {true},\n hidden = {false},\n citation_key = {Yu2006},\n private_publication = {false},\n abstract = {This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathemati- cal framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted sepa- rately in other systems. Finally, we do not assume the existence of a universal word segmenta- tion standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might re- quire different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word seg- menter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models, and provides a uni- fied approach to the five fundamental features of word-level Chinese language processing: lexi- con word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of the former to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.},\n bibtype = {article},\n author = {Yu, Kun},\n journal = {Computational Linguistics},\n number = {July}\n}","author_short":["Yu, K."],"bibbaseid":"yu-chinesewordsegmentationandnamedentityrecognitionbycharactertagging-2006","role":"author","urls":{},"downloads":0,"html":""},"bibtype":"article","creationDate":"2020-02-06T23:48:11.934Z","downloads":0,"keywords":[],"search_terms":["chinese","word","segmentation","named","entity","recognition","character","tagging","yu"],"title":"Chinese Word Segmentation and Named Entity Recognition by Character Tagging","year":2006}