A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages. Cardenas, R., Lin, Y., Ji, H., & May, J. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2428–2439, Minneapolis, Minnesota, June, 2019. Association for Computational Linguistics. Paper abstract bibtex 1 download Unsupervised part of speech (POS) tagging is often framed as a clustering problem, but practical taggers need to ground their clusters as well. Grounding generally requires reference labeled data, a luxury a low-resource language might not have. In this work, we describe an approach for low-resource unsupervised POS tagging that yields fully grounded output and requires no labeled training data. We find the classic method of Brown et al. (1992) clusters well in our use case and employ a decipherment-based approach to grounding. This approach presumes a sequence of cluster IDs is a `ciphertext' and seeks a POS tag-to-cluster ID mapping that will reveal the POS sequence. We show intrinsically that, despite the difficulty of the task, we obtain reasonable performance across a variety of languages. We also show extrinsically that incorporating our POS tagger into a name tagger leads to state-of-the-art tagging performance in Sinhalese and Kinyarwanda, two languages with nearly no labeled POS data available. We further demonstrate our tagger's utility by incorporating it into a true `zero-resource' variant of the MALOPA (Ammar et al., 2016) dependency parser model that removes the current reliance on multilingual resources and gold POS tags for new languages. Experiments show that including our tagger makes up much of the accuracy lost when gold POS tags are unavailable.
@InProceedings{cardenas-EtAl:2019:N19-1,
author = {Cardenas, Ronald and Lin, Ying and Ji, Heng and May, Jonathan},
title = {A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages},
booktitle = {Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
month = {June},
year = {2019},
address = {Minneapolis, Minnesota},
publisher = {Association for Computational Linguistics},
pages = {2428--2439},
abstract = {Unsupervised part of speech (POS) tagging is often framed as a clustering problem, but practical taggers need to ground their clusters as well. Grounding generally requires reference labeled data, a luxury a low-resource language might not have. In this work, we describe an approach for low-resource unsupervised POS tagging that yields fully grounded output and requires no labeled training data. We find the classic method of Brown et al. (1992) clusters well in our use case and employ a decipherment-based approach to grounding. This approach presumes a sequence of cluster IDs is a `ciphertext' and seeks a POS tag-to-cluster ID mapping that will reveal the POS sequence. We show intrinsically that, despite the difficulty of the task, we obtain reasonable performance across a variety of languages. We also show extrinsically that incorporating our POS tagger into a name tagger leads to state-of-the-art tagging performance in Sinhalese and Kinyarwanda, two languages with nearly no labeled POS data available. We further demonstrate our tagger's utility by incorporating it into a true `zero-resource' variant of the MALOPA (Ammar et al., 2016) dependency parser model that removes the current reliance on multilingual resources and gold POS tags for new languages. Experiments show that including our tagger makes up much of the accuracy lost when gold POS tags are unavailable.},
url = {http://www.aclweb.org/anthology/N19-1252}
}
Downloads: 1
{"_id":"A5XhF5fsxMZgexp47","bibbaseid":"cardenas-lin-ji-may-agroundedunsuperviseduniversalpartofspeechtaggerforlowresourcelanguages-2019","author_short":["Cardenas, R.","Lin, Y.","Ji, H.","May, J."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"propositions":[],"lastnames":["Cardenas"],"firstnames":["Ronald"],"suffixes":[]},{"propositions":[],"lastnames":["Lin"],"firstnames":["Ying"],"suffixes":[]},{"propositions":[],"lastnames":["Ji"],"firstnames":["Heng"],"suffixes":[]},{"propositions":[],"lastnames":["May"],"firstnames":["Jonathan"],"suffixes":[]}],"title":"A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages","booktitle":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","month":"June","year":"2019","address":"Minneapolis, Minnesota","publisher":"Association for Computational Linguistics","pages":"2428–2439","abstract":"Unsupervised part of speech (POS) tagging is often framed as a clustering problem, but practical taggers need to ground their clusters as well. Grounding generally requires reference labeled data, a luxury a low-resource language might not have. In this work, we describe an approach for low-resource unsupervised POS tagging that yields fully grounded output and requires no labeled training data. We find the classic method of Brown et al. (1992) clusters well in our use case and employ a decipherment-based approach to grounding. This approach presumes a sequence of cluster IDs is a `ciphertext' and seeks a POS tag-to-cluster ID mapping that will reveal the POS sequence. We show intrinsically that, despite the difficulty of the task, we obtain reasonable performance across a variety of languages. We also show extrinsically that incorporating our POS tagger into a name tagger leads to state-of-the-art tagging performance in Sinhalese and Kinyarwanda, two languages with nearly no labeled POS data available. We further demonstrate our tagger's utility by incorporating it into a true `zero-resource' variant of the MALOPA (Ammar et al., 2016) dependency parser model that removes the current reliance on multilingual resources and gold POS tags for new languages. Experiments show that including our tagger makes up much of the accuracy lost when gold POS tags are unavailable.","url":"http://www.aclweb.org/anthology/N19-1252","bibtex":"@InProceedings{cardenas-EtAl:2019:N19-1,\n author = {Cardenas, Ronald and Lin, Ying and Ji, Heng and May, Jonathan},\n title = {A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages},\n booktitle = {Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},\n month = {June},\n year = {2019},\n address = {Minneapolis, Minnesota},\n publisher = {Association for Computational Linguistics},\n pages = {2428--2439},\n abstract = {Unsupervised part of speech (POS) tagging is often framed as a clustering problem, but practical taggers need to ground their clusters as well. Grounding generally requires reference labeled data, a luxury a low-resource language might not have. In this work, we describe an approach for low-resource unsupervised POS tagging that yields fully grounded output and requires no labeled training data. We find the classic method of Brown et al. (1992) clusters well in our use case and employ a decipherment-based approach to grounding. This approach presumes a sequence of cluster IDs is a `ciphertext' and seeks a POS tag-to-cluster ID mapping that will reveal the POS sequence. We show intrinsically that, despite the difficulty of the task, we obtain reasonable performance across a variety of languages. We also show extrinsically that incorporating our POS tagger into a name tagger leads to state-of-the-art tagging performance in Sinhalese and Kinyarwanda, two languages with nearly no labeled POS data available. We further demonstrate our tagger's utility by incorporating it into a true `zero-resource' variant of the MALOPA (Ammar et al., 2016) dependency parser model that removes the current reliance on multilingual resources and gold POS tags for new languages. Experiments show that including our tagger makes up much of the accuracy lost when gold POS tags are unavailable.},\n url = {http://www.aclweb.org/anthology/N19-1252}\n}\n\n","author_short":["Cardenas, R.","Lin, Y.","Ji, H.","May, J."],"key":"cardenas-EtAl:2019:N19-1","id":"cardenas-EtAl:2019:N19-1","bibbaseid":"cardenas-lin-ji-may-agroundedunsuperviseduniversalpartofspeechtaggerforlowresourcelanguages-2019","role":"author","urls":{"Paper":"http://www.aclweb.org/anthology/N19-1252"},"metadata":{"authorlinks":{}},"downloads":1},"bibtype":"inproceedings","biburl":"https://jonmay.github.io/webpage/cutelabname/cutelabname.bib","dataSources":["ZdhKtP2cSp3Aki2ge","X5WBAKQabka5TW5z7","hbZSwot2msWk92m5B","fcWjcoAgajPvXWcp7","GvHfaAWP6AfN6oLQE","j3Qzx9HAAC6WtJDHS","5eM3sAccSEpjSDHHQ"],"keywords":[],"search_terms":["grounded","unsupervised","universal","part","speech","tagger","low","resource","languages","cardenas","lin","ji","may"],"title":"A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages","year":2019,"downloads":2}