Language identification based on n-gram frequency ranking. de Córdoba, R.; D'Haro, L. F.; Fernández Martínez, F.; Macías, J.; and Ferreiros, J. In Interspeech 2007. Proceedings of the 8th Annual Conference of the International Speech Communication Association, pages 354-357. Antwerp, Belgium, August 27-31, 2007.
Language identification based on n-gram frequency ranking [pdf]Paper  abstract   bibtex   
We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to be able to model reliably a longer span than PPRLM, namely 5-gram instead of trigram, because this ranking will need less training data for a reliable estimation. We demonstrate that this approach overcomes PPRLM (6% relative improvement) due to the inclusion of 4-gram and 5-gram in the classifier. We present two alternatives: ranking with absolute values for the number of occurrences and ranking with discriminative values (11% relative improvement).
@incollection{de_cordoba_language_2007,
	Author = {de Córdoba, Ricardo and D'Haro, Luis Fernando and Fernández Martínez, Fernando and Macías, Javier and Ferreiros, Javier},
	Booktitle = {Interspeech 2007. Proceedings of the 8th Annual Conference of the International Speech Communication Association},
	Date = {2007},
	Date-Modified = {2016-09-24 18:56:01 +0000},
	Keywords = {language identification, Spanish, speech technology},
	Pages = {354-357},
	Publisher = {Antwerp, Belgium, August 27-31, 2007},
	Title = {Language identification based on n-gram frequency ranking},
	Url = {http://www-gth.die.upm.es/research/documentation/AG-056Lan-07.pdf},
	Abstract = {We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to be able to model reliably a longer span than PPRLM, namely 5-gram instead of trigram, because this ranking will need less training data for a reliable estimation. We demonstrate that this approach overcomes PPRLM (6\% relative improvement) due to the inclusion of 4-gram and 5-gram in the classifier. We present two alternatives: ranking with absolute values for the number of occurrences and ranking with discriminative values (11\% relative improvement).},
	Bdsk-File-1 = {YnBsaXN0MDDUAQIDBAUGJCVYJHZlcnNpb25YJG9iamVjdHNZJGFyY2hpdmVyVCR0b3ASAAGGoKgHCBMUFRYaIVUkbnVsbNMJCgsMDxJXTlMua2V5c1pOUy5vYmplY3RzViRjbGFzc6INDoACgAOiEBGABIAFgAdccmVsYXRpdmVQYXRoWWFsaWFzRGF0YW8QWwAuAC4ALwAuAC4ALwAuAC4ALwBCAGkAYgBsAGkAbwBnAHIAYQBmAGkAYQAvAFAAYQBwAGUAcgBzAC8AQwBvAwEAcgBkAG8AYgBhAC8ATABhAG4AZwB1AGEAZwBlACAAaQBkAGUAbgB0AGkAZgBpAGMAYQB0AGkAbwBuACAAYgBhAHMAZQBkACAAbwBuACAAbgAtAGcAcgBhAG0AIABmAHIAZQBxAHUAZQBuAGMAeQAuAHAAZABm0hcLGBlXTlMuZGF0YU8RAkoAAAAAAkoAAgAADE1hY2ludG9zaCBIRAAAAAAAAAAAAAAAAAAAAMv2H85IKwAAEIZprx9MYW5ndWFnZSBpZGVudGlmaWMjMTA4NjY5QjMucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQhmmz0+o7IwAAAAAAAAAAAAMABAAACSAAAAAAAAAAAAAAAAAAAAAHQ5dyZG9iYQAAEAAIAADL9gOuAAAAEQAIAADT6h8DAAAAAQAUEIZprxCGZY4ABfxHAAX7mAAAwEYAAgBlTWFjaW50b3NoIEhEOlVzZXJzOgBqb2FxdWltX2xsaXN0ZXJyaToAQmlibGlvZ3JhZmlhOgBQYXBlcnM6AEOXcmRvYmE6AExhbmd1YWdlIGlkZW50aWZpYyMxMDg2NjlCMy5wZGYAAA4AbAA1AEwAYQBuAGcAdQBhAGcAZQAgAGkAZABlAG4AdABpAGYAaQBjAGEAdABpAG8AbgAgAGIAYQBzAGUAZAAgAG8AbgAgAG4ALQBnAHIAYQBtACAAZgByAGUAcQB1AGUAbgBjAHkALgBwAGQAZgAPABoADABNAGEAYwBpAG4AdABvAHMAaAAgAEgARAASAGtVc2Vycy9qb2FxdWltX2xsaXN0ZXJyaS9CaWJsaW9ncmFmaWEvUGFwZXJzL0NvzIFyZG9iYS9MYW5ndWFnZSBpZGVudGlmaWNhdGlvbiBiYXNlZCBvbiBuLWdyYW0gZnJlcXVlbmN5LnBkZgAAEwABLwAAFQACABj//wAAgAbSGxwdHlokY2xhc3NuYW1lWCRjbGFzc2VzXU5TTXV0YWJsZURhdGGjHR8gVk5TRGF0YVhOU09iamVjdNIbHCIjXE5TRGljdGlvbmFyeaIiIF8QD05TS2V5ZWRBcmNoaXZlctEmJ1Ryb290gAEACAARABoAIwAtADIANwBAAEYATQBVAGAAZwBqAGwAbgBxAHMAdQB3AIQAjgFHAUwBVAOiA6QDqQO0A70DywPPA9YD3wPkA/ED9AQGBAkEDgAAAAAAAAIBAAAAAAAAACgAAAAAAAAAAAAAAAAAAAQQ},
	Bdsk-Url-1 = {http://www-gth.die.upm.es/research/documentation/AG-056Lan-07.pdf}}
Downloads: 0