Dynamic N-Gram System Based on an Online Croatian Spellchecking Service

Dynamic N-Gram System Based on an Online Croatian Spellchecking Service. Gledec, G., Soic, R., & Dembitz, S. IEEE ACCESS, 7:149988–149995, 2019. Place: 445 HOES LANE, PISCATAWAY, NJ 08855-4141 USA Publisher: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC Type: Article
doi abstract bibtex

As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the worlds largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker \textlessitalic\textgreaterHascheck\textless/italic\textgreater, a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.

@article{gledec_dynamic_2019,
	title = {Dynamic {N}-{Gram} {System} {Based} on an {Online} {Croatian} {Spellchecking} {Service}},
	volume = {7},
	issn = {2169-3536},
	doi = {10.1109/ACCESS.2019.2947898},
	abstract = {As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the worlds largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker {\textless}italic{\textgreater}Hascheck{\textless}/italic{\textgreater}, a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.},
	language = {English},
	journal = {IEEE ACCESS},
	author = {Gledec, Gordan and Soic, Renato and Dembitz, Sandor},
	year = {2019},
	note = {Place: 445 HOES LANE, PISCATAWAY, NJ 08855-4141 USA
Publisher: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Type: Article},
	keywords = {Biological system modeling, Croatian language, Dictionaries, Google, Heaps' law, Licenses, Linguistics, Natural language processing, Tools, language modeling, lexical n-gram, n-gram system comparison},
	pages = {149988--149995},
}

Downloads: 0

{"_id":"9JcNsyLz8LdYdjKEd","bibbaseid":"gledec-soic-dembitz-dynamicngramsystembasedonanonlinecroatianspellcheckingservice-2019","author_short":["Gledec, G.","Soic, R.","Dembitz, S."],"bibdata":{"bibtype":"article","type":"article","title":"Dynamic N-Gram System Based on an Online Croatian Spellchecking Service","volume":"7","issn":"2169-3536","doi":"10.1109/ACCESS.2019.2947898","abstract":"As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the worlds largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker \\textlessitalic\\textgreaterHascheck\\textless/italic\\textgreater, a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.","language":"English","journal":"IEEE ACCESS","author":[{"propositions":[],"lastnames":["Gledec"],"firstnames":["Gordan"],"suffixes":[]},{"propositions":[],"lastnames":["Soic"],"firstnames":["Renato"],"suffixes":[]},{"propositions":[],"lastnames":["Dembitz"],"firstnames":["Sandor"],"suffixes":[]}],"year":"2019","note":"Place: 445 HOES LANE, PISCATAWAY, NJ 08855-4141 USA Publisher: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC Type: Article","keywords":"Biological system modeling, Croatian language, Dictionaries, Google, Heaps' law, Licenses, Linguistics, Natural language processing, Tools, language modeling, lexical n-gram, n-gram system comparison","pages":"149988–149995","bibtex":"@article{gledec_dynamic_2019,\n\ttitle = {Dynamic {N}-{Gram} {System} {Based} on an {Online} {Croatian} {Spellchecking} {Service}},\n\tvolume = {7},\n\tissn = {2169-3536},\n\tdoi = {10.1109/ACCESS.2019.2947898},\n\tabstract = {As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the worlds largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker {\\textless}italic{\\textgreater}Hascheck{\\textless}/italic{\\textgreater}, a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.},\n\tlanguage = {English},\n\tjournal = {IEEE ACCESS},\n\tauthor = {Gledec, Gordan and Soic, Renato and Dembitz, Sandor},\n\tyear = {2019},\n\tnote = {Place: 445 HOES LANE, PISCATAWAY, NJ 08855-4141 USA\nPublisher: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC\nType: Article},\n\tkeywords = {Biological system modeling, Croatian language, Dictionaries, Google, Heaps' law, Licenses, Linguistics, Natural language processing, Tools, language modeling, lexical n-gram, n-gram system comparison},\n\tpages = {149988--149995},\n}\n\n","author_short":["Gledec, G.","Soic, R.","Dembitz, S."],"key":"gledec_dynamic_2019","id":"gledec_dynamic_2019","bibbaseid":"gledec-soic-dembitz-dynamicngramsystembasedonanonlinecroatianspellcheckingservice-2019","role":"author","urls":{},"keyword":["Biological system modeling","Croatian language","Dictionaries","Google","Heaps' law","Licenses","Linguistics","Natural language processing","Tools","language modeling","lexical n-gram","n-gram system comparison"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://bibbase.org/zotero/mtucakovic","dataSources":["DY3AeP9t2QujfB78L"],"keywords":["biological system modeling","croatian language","dictionaries","google","heaps' law","licenses","linguistics","natural language processing","tools","language modeling","lexical n-gram","n-gram system comparison"],"search_terms":["dynamic","gram","system","based","online","croatian","spellchecking","service","gledec","soic","dembitz"],"title":"Dynamic N-Gram System Based on an Online Croatian Spellchecking Service","year":2019}