The statistics of word cooccurrences: Word pairs and collocations

The statistics of word cooccurrences: Word pairs and collocations. Evert, S. Ph.D. Thesis, Universität Stuttgart, 2005.
abstract bibtex

You shall know a word by the company it keeps!\textbackslashtextquotedbl With this slogan, J. R. Firth drew attention to a fact that language scholars had intuitively known for a long time: In natural language, words are not combined randomly into phrases and sentences, constrained only by the rules of syntax. They have a tendency to appear in certain recurrent combinations. As there are many possible reasons for words to go together, a broad range of linguistic and extra-linguistic phenomena can be found among the recurrent combinations, making them a goldmine of information for linguistics, natural language processing and related fields. There are compound nouns (\textbackslashtextquotedblblack box\textbackslashtextquotedbl), fixed and opaque idioms (\textbackslashtextquotedblkick the bucket\textbackslashtextquotedbl), lexical selection (\textbackslashtextquotedbla pride of lions\textbackslashtextquotedbl, \textbackslashtextquotedblheavy smoker\textbackslashtextquotedbl) and formulaic expressions (\textbackslashtextquotedblhave a nice day\textbackslashtextquotedbl). They can often tell us something about the meaning of a word or even the concept behind the word (think of combinations like \textbackslashtextquotedbldark night\textbackslashtextquotedbl and \textbackslashtextquotedblbright day\textbackslashtextquotedbl), an idea that has inspired latent semantic analysis and similar vector space models of word meaning. With modern computers it is easy to extract evidence for recurrent word pairs from huge text corpora, often aided by linguistic pre-processing and annotation (so that specific combinations, e.g. noun+verb can be targeted). However, the raw data - in the form of frequency counts for word pairs – are not always meaningful as a measure for the amount of \textbackslashtextquotedblglue\textbackslashtextquotedbl between two words. Provided that both words are sufficiently frequent, their cooccurrences might be pure coincidence. Therefore, a statistical interpretation of the frequency data is necessary, which determines the degree of statistical association between the words and whether there is enough evidence to rule out chance as a factor. For this purpose, association measures are applied, which assign a score to each word pair based on the observed frequency data. The higher this score is, the stronger and more certain the association between the two words. Even forty years ago, at the Symposium on Statistical Association Methods for Mechanized Documentation, there was a bewildering multitude of measures to choose from, but hardly any guidelines to help with the decision. This situation hasn't changed very much over the last forty years. We are still far away from a thorough understanding of association measures and there is not even a standard reference where one could look up precise definitions and related information. My thesis aims to fill this gap. The first, encyclopedic part of the thesis begins with a description of the formal and statistical prerequisites. Intended primarily as a reference for students and researchers, it also addresses the limits of the statistical models. The following chapter presents a comprehensive repository of association measures, which are organised into thematic groups. An explicit equation is given for each measure, using a consistent notation in terms of observed and expected frequencies. The second, methodological part suggests new approaches to the study of association measures, with an emphasis on empirical results and intuitive understanding. A cornerstone of this approach is a geometric interpretation of cooccurrence data and association measure. Measures are visualised as surfaces in a three-dimensional \textbackslashtextquotedblcoordinate space\textbackslashtextquotedbl. The properties of each measure are determined by the geometric shapes of the respective surfaces. Empirical results are obtained from evaluation studies, which test the performance of association measures in a collocation extraction task. In addition to its relevance for real-life applications, a carefully designed evaluation can reveal important properties of the association measures. Unfortunately, it is becoming clear the evaluation results cannot easily be generalised. For this reason it is desirable to carry out more evaluation experiments under different conditions. In order to reduce the necessary amount of manual work, evaluation can be performed on random samples from a set of candidates. Appropriate significance tests correct for the higher degree of uncertainty. Finally, there is a third, computational aspect to the thesis. It is accompanied by an open-source software toolkit, which was used to perform experiments and produce graphs for the thesis. The unique feature of this software toolkit is that the current release includes all the data, scripts and explanations needed to replicate (almost) all the results found in the book.

@phdthesis{evert_statistics_2005,
type = {Dissertation},
title = {The statistics of word cooccurrences: {Word} pairs and collocations},
abstract = {You shall know a word by the company it keeps!{\textbackslash}textquotedbl With this slogan, J. R. Firth drew attention to a fact that language scholars had intuitively known for a long time: In natural language, words are not combined randomly into phrases and sentences, constrained only by the rules of syntax. They have a tendency to appear in certain recurrent combinations. As there are many possible reasons for words to go together, a broad range of linguistic and extra-linguistic phenomena can be found among the recurrent combinations, making them a goldmine of information for linguistics, natural language processing and related fields. There are compound nouns ({\textbackslash}textquotedblblack box{\textbackslash}textquotedbl), fixed and opaque idioms ({\textbackslash}textquotedblkick the bucket{\textbackslash}textquotedbl), lexical selection ({\textbackslash}textquotedbla pride of lions{\textbackslash}textquotedbl, {\textbackslash}textquotedblheavy smoker{\textbackslash}textquotedbl) and formulaic expressions ({\textbackslash}textquotedblhave a nice day{\textbackslash}textquotedbl). They can often tell us something about the meaning of a word or even the concept behind the word (think of combinations like {\textbackslash}textquotedbldark night{\textbackslash}textquotedbl and {\textbackslash}textquotedblbright day{\textbackslash}textquotedbl), an idea that has inspired latent semantic analysis and similar vector space models of word meaning. With modern computers it is easy to extract evidence for recurrent word pairs from huge text corpora, often aided by linguistic pre-processing and annotation (so that specific combinations, e.g. noun+verb can be targeted). However, the raw data - in the form of frequency counts for word pairs – are not always meaningful as a measure for the amount of {\textbackslash}textquotedblglue{\textbackslash}textquotedbl between two words. Provided that both words are sufficiently frequent, their cooccurrences might be pure coincidence. Therefore, a statistical interpretation of the frequency data is necessary, which determines the degree of statistical association between the words and whether there is enough evidence to rule out chance as a factor. For this purpose, association measures are applied, which assign a score to each word pair based on the observed frequency data. The higher this score is, the stronger and more certain the association between the two words. Even forty years ago, at the Symposium on Statistical Association Methods for Mechanized Documentation, there was a bewildering multitude of measures to choose from, but hardly any guidelines to help with the decision. This situation hasn't changed very much over the last forty years. We are still far away from a thorough understanding of association measures and there is not even a standard reference where one could look up precise definitions and related information. My thesis aims to fill this gap. The first, encyclopedic part of the thesis begins with a description of the formal and statistical prerequisites. Intended primarily as a reference for students and researchers, it also addresses the limits of the statistical models. The following chapter presents a comprehensive repository of association measures, which are organised into thematic groups. An explicit equation is given for each measure, using a consistent notation in terms of observed and expected frequencies. The second, methodological part suggests new approaches to the study of association measures, with an emphasis on empirical results and intuitive understanding. A cornerstone of this approach is a geometric interpretation of cooccurrence data and association measure. Measures are visualised as surfaces in a three-dimensional {\textbackslash}textquotedblcoordinate space{\textbackslash}textquotedbl. The properties of each measure are determined by the geometric shapes of the respective surfaces. Empirical results are obtained from evaluation studies, which test the performance of association measures in a collocation extraction task. In addition to its relevance for real-life applications, a carefully designed evaluation can reveal important properties of the association measures. Unfortunately, it is becoming clear the evaluation results cannot easily be generalised. For this reason it is desirable to carry out more evaluation experiments under different conditions. In order to reduce the necessary amount of manual work, evaluation can be performed on random samples from a set of candidates. Appropriate significance tests correct for the higher degree of uncertainty. Finally, there is a third, computational aspect to the thesis. It is accompanied by an open-source software toolkit, which was used to perform experiments and produce graphs for the thesis. The unique feature of this software toolkit is that the current release includes all the data, scripts and explanations needed to replicate (almost) all the results found in the book.},
urldate = {2018-01-25},
school = {Universität Stuttgart},
author = {Evert, S.},
year = {2005},
}

Downloads: 0

{"_id":"bcCMPQaBPXPjdisRd","bibbaseid":"evert-thestatisticsofwordcooccurrenceswordpairsandcollocations-2005","authorIDs":[],"author_short":["Evert, S."],"bibdata":{"bibtype":"phdthesis","type":"Dissertation","title":"The statistics of word cooccurrences: Word pairs and collocations","abstract":"You shall know a word by the company it keeps!\\textbackslashtextquotedbl With this slogan, J. R. Firth drew attention to a fact that language scholars had intuitively known for a long time: In natural language, words are not combined randomly into phrases and sentences, constrained only by the rules of syntax. They have a tendency to appear in certain recurrent combinations. As there are many possible reasons for words to go together, a broad range of linguistic and extra-linguistic phenomena can be found among the recurrent combinations, making them a goldmine of information for linguistics, natural language processing and related fields. There are compound nouns (\\textbackslashtextquotedblblack box\\textbackslashtextquotedbl), fixed and opaque idioms (\\textbackslashtextquotedblkick the bucket\\textbackslashtextquotedbl), lexical selection (\\textbackslashtextquotedbla pride of lions\\textbackslashtextquotedbl, \\textbackslashtextquotedblheavy smoker\\textbackslashtextquotedbl) and formulaic expressions (\\textbackslashtextquotedblhave a nice day\\textbackslashtextquotedbl). They can often tell us something about the meaning of a word or even the concept behind the word (think of combinations like \\textbackslashtextquotedbldark night\\textbackslashtextquotedbl and \\textbackslashtextquotedblbright day\\textbackslashtextquotedbl), an idea that has inspired latent semantic analysis and similar vector space models of word meaning. With modern computers it is easy to extract evidence for recurrent word pairs from huge text corpora, often aided by linguistic pre-processing and annotation (so that specific combinations, e.g. noun+verb can be targeted). However, the raw data - in the form of frequency counts for word pairs – are not always meaningful as a measure for the amount of \\textbackslashtextquotedblglue\\textbackslashtextquotedbl between two words. Provided that both words are sufficiently frequent, their cooccurrences might be pure coincidence. Therefore, a statistical interpretation of the frequency data is necessary, which determines the degree of statistical association between the words and whether there is enough evidence to rule out chance as a factor. For this purpose, association measures are applied, which assign a score to each word pair based on the observed frequency data. The higher this score is, the stronger and more certain the association between the two words. Even forty years ago, at the Symposium on Statistical Association Methods for Mechanized Documentation, there was a bewildering multitude of measures to choose from, but hardly any guidelines to help with the decision. This situation hasn't changed very much over the last forty years. We are still far away from a thorough understanding of association measures and there is not even a standard reference where one could look up precise definitions and related information. My thesis aims to fill this gap. The first, encyclopedic part of the thesis begins with a description of the formal and statistical prerequisites. Intended primarily as a reference for students and researchers, it also addresses the limits of the statistical models. The following chapter presents a comprehensive repository of association measures, which are organised into thematic groups. An explicit equation is given for each measure, using a consistent notation in terms of observed and expected frequencies. The second, methodological part suggests new approaches to the study of association measures, with an emphasis on empirical results and intuitive understanding. A cornerstone of this approach is a geometric interpretation of cooccurrence data and association measure. Measures are visualised as surfaces in a three-dimensional \\textbackslashtextquotedblcoordinate space\\textbackslashtextquotedbl. The properties of each measure are determined by the geometric shapes of the respective surfaces. Empirical results are obtained from evaluation studies, which test the performance of association measures in a collocation extraction task. In addition to its relevance for real-life applications, a carefully designed evaluation can reveal important properties of the association measures. Unfortunately, it is becoming clear the evaluation results cannot easily be generalised. For this reason it is desirable to carry out more evaluation experiments under different conditions. In order to reduce the necessary amount of manual work, evaluation can be performed on random samples from a set of candidates. Appropriate significance tests correct for the higher degree of uncertainty. Finally, there is a third, computational aspect to the thesis. It is accompanied by an open-source software toolkit, which was used to perform experiments and produce graphs for the thesis. The unique feature of this software toolkit is that the current release includes all the data, scripts and explanations needed to replicate (almost) all the results found in the book.","urldate":"2018-01-25","school":"Universität Stuttgart","author":[{"propositions":[],"lastnames":["Evert"],"firstnames":["S."],"suffixes":[]}],"year":"2005","bibtex":"@phdthesis{evert_statistics_2005,\n\ttype = {Dissertation},\n\ttitle = {The statistics of word cooccurrences: {Word} pairs and collocations},\n\tabstract = {You shall know a word by the company it keeps!{\\textbackslash}textquotedbl With this slogan, J. R. Firth drew attention to a fact that language scholars had intuitively known for a long time: In natural language, words are not combined randomly into phrases and sentences, constrained only by the rules of syntax. They have a tendency to appear in certain recurrent combinations. As there are many possible reasons for words to go together, a broad range of linguistic and extra-linguistic phenomena can be found among the recurrent combinations, making them a goldmine of information for linguistics, natural language processing and related fields. There are compound nouns ({\\textbackslash}textquotedblblack box{\\textbackslash}textquotedbl), fixed and opaque idioms ({\\textbackslash}textquotedblkick the bucket{\\textbackslash}textquotedbl), lexical selection ({\\textbackslash}textquotedbla pride of lions{\\textbackslash}textquotedbl, {\\textbackslash}textquotedblheavy smoker{\\textbackslash}textquotedbl) and formulaic expressions ({\\textbackslash}textquotedblhave a nice day{\\textbackslash}textquotedbl). They can often tell us something about the meaning of a word or even the concept behind the word (think of combinations like {\\textbackslash}textquotedbldark night{\\textbackslash}textquotedbl and {\\textbackslash}textquotedblbright day{\\textbackslash}textquotedbl), an idea that has inspired latent semantic analysis and similar vector space models of word meaning. With modern computers it is easy to extract evidence for recurrent word pairs from huge text corpora, often aided by linguistic pre-processing and annotation (so that specific combinations, e.g. noun+verb can be targeted). However, the raw data - in the form of frequency counts for word pairs – are not always meaningful as a measure for the amount of {\\textbackslash}textquotedblglue{\\textbackslash}textquotedbl between two words. Provided that both words are sufficiently frequent, their cooccurrences might be pure coincidence. Therefore, a statistical interpretation of the frequency data is necessary, which determines the degree of statistical association between the words and whether there is enough evidence to rule out chance as a factor. For this purpose, association measures are applied, which assign a score to each word pair based on the observed frequency data. The higher this score is, the stronger and more certain the association between the two words. Even forty years ago, at the Symposium on Statistical Association Methods for Mechanized Documentation, there was a bewildering multitude of measures to choose from, but hardly any guidelines to help with the decision. This situation hasn't changed very much over the last forty years. We are still far away from a thorough understanding of association measures and there is not even a standard reference where one could look up precise definitions and related information. My thesis aims to fill this gap. The first, encyclopedic part of the thesis begins with a description of the formal and statistical prerequisites. Intended primarily as a reference for students and researchers, it also addresses the limits of the statistical models. The following chapter presents a comprehensive repository of association measures, which are organised into thematic groups. An explicit equation is given for each measure, using a consistent notation in terms of observed and expected frequencies. The second, methodological part suggests new approaches to the study of association measures, with an emphasis on empirical results and intuitive understanding. A cornerstone of this approach is a geometric interpretation of cooccurrence data and association measure. Measures are visualised as surfaces in a three-dimensional {\\textbackslash}textquotedblcoordinate space{\\textbackslash}textquotedbl. The properties of each measure are determined by the geometric shapes of the respective surfaces. Empirical results are obtained from evaluation studies, which test the performance of association measures in a collocation extraction task. In addition to its relevance for real-life applications, a carefully designed evaluation can reveal important properties of the association measures. Unfortunately, it is becoming clear the evaluation results cannot easily be generalised. For this reason it is desirable to carry out more evaluation experiments under different conditions. In order to reduce the necessary amount of manual work, evaluation can be performed on random samples from a set of candidates. Appropriate significance tests correct for the higher degree of uncertainty. Finally, there is a third, computational aspect to the thesis. It is accompanied by an open-source software toolkit, which was used to perform experiments and produce graphs for the thesis. The unique feature of this software toolkit is that the current release includes all the data, scripts and explanations needed to replicate (almost) all the results found in the book.},\n\turldate = {2018-01-25},\n\tschool = {Universität Stuttgart},\n\tauthor = {Evert, S.},\n\tyear = {2005},\n}\n\n","author_short":["Evert, S."],"key":"evert_statistics_2005","id":"evert_statistics_2005","bibbaseid":"evert-thestatisticsofwordcooccurrenceswordpairsandcollocations-2005","role":"author","urls":{},"downloads":0},"bibtype":"phdthesis","creationDate":"2020-02-06T23:48:11.787Z","downloads":0,"keywords":[],"search_terms":["statistics","word","cooccurrences","word","pairs","collocations","evert"],"title":"The statistics of word cooccurrences: Word pairs and collocations","year":2005,"biburl":"https://bibbase.org/zotero/maleficus","dataSources":["6XCq3LCkQtmZo5xjE"]}