\n \n \n
\n
\n\n \n \n \n \n \n \n Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance.\n \n \n \n \n\n\n \n Shira Wein; Christopher Homan; Lora Aroyo; and Chris Welty.\n\n\n \n\n\n\n In
Findings of the Association for Computational Linguistics: ACL 2023, pages 3138–3161, Toronto, Canada, July 2023. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 4 downloads\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{wein-etal-2023-follow,\n title = "Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance",\n author = "Wein, Shira and\n Homan, Christopher and\n Aroyo, Lora and\n Welty, Chris",\n booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",\n month = jul,\n year = "2023",\n address = "Toronto, Canada",\n publisher = "Association for Computational Linguistics",\n url = "https://aclanthology.org/2023.findings-acl.196",\n doi = "10.18653/v1/2023.findings-acl.196",\n pages = "3138--3161",\n abstract = "Among the problems with leaderboard culture in NLP has been the widespread lack of confidence estimation in reported results. In this work, we present a framework and simulator for estimating p-values for comparisons between the results of two systems, in order to understand the confidence that one is actually better (i.e. ranked higher) than the other. What has made this difficult in the past is that each system must itself be evaluated by comparison to a gold standard. We define a null hypothesis that each system{'}s metric scores are drawn from the same distribution, using variance found naturally (though rarely reported) in test set items and individual labels on an item (responses) to produce the metric distributions. We create a test set that evenly mixes the responses of the two systems under the assumption the null hypothesis is true. Exploring how to best estimate the true p-value from a single test set under different metrics, tests, and sampling methods, we find that the presence of response variance (from multiple raters or multiple model versions) has a profound impact on p-value estimates for model comparison, and that choice of metric and sampling method is critical to providing statistical guarantees on model comparisons.",\n}\n\n
\n
\n\n\n
\n Among the problems with leaderboard culture in NLP has been the widespread lack of confidence estimation in reported results. In this work, we present a framework and simulator for estimating p-values for comparisons between the results of two systems, in order to understand the confidence that one is actually better (i.e. ranked higher) than the other. What has made this difficult in the past is that each system must itself be evaluated by comparison to a gold standard. We define a null hypothesis that each system's metric scores are drawn from the same distribution, using variance found naturally (though rarely reported) in test set items and individual labels on an item (responses) to produce the metric distributions. We create a test set that evenly mixes the responses of the two systems under the assumption the null hypothesis is true. Exploring how to best estimate the true p-value from a single test set under different metrics, tests, and sampling methods, we find that the presence of response variance (from multiple raters or multiple model versions) has a profound impact on p-value estimates for model comparison, and that choice of metric and sampling method is critical to providing statistical guarantees on model comparisons.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Human Raters Cannot Distinguish English Translations from Original English Texts.\n \n \n \n \n\n\n \n Shira Wein.\n\n\n \n\n\n\n In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12266–12272, Singapore, December 2023. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 4 downloads\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{wein-2023-human,\n title = "Human Raters Cannot Distinguish {E}nglish Translations from Original {E}nglish Texts",\n author = "Wein, Shira",\n booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",\n month = dec,\n year = "2023",\n address = "Singapore",\n publisher = "Association for Computational Linguistics",\n url = "https://aclanthology.org/2023.emnlp-main.754",\n doi = "10.18653/v1/2023.emnlp-main.754",\n pages = "12266--12272",\n abstract = "The term translationese describes the set of linguistic features unique to translated texts, which appear regardless of translation quality. Though automatic classifiers designed to distinguish translated texts achieve high accuracy and prior work has identified common hallmarks of translationese, human accuracy of identifying translated text is understudied. In this work, we perform a human evaluation of English original/translated texts in order to explore raters{'} ability to classify texts as being original or translated English and the features that lead a rater to judge text as being translated. Ultimately, we find that, regardless of the annotators{'} native language or the source language of the text, annotators are unable to distinguish translations from original English texts and also have low agreement. Our results provide critical insight into work in translation studies and context for assessments of translationese classifiers.",\n}\n\n
\n
\n\n\n
\n The term translationese describes the set of linguistic features unique to translated texts, which appear regardless of translation quality. Though automatic classifiers designed to distinguish translated texts achieve high accuracy and prior work has identified common hallmarks of translationese, human accuracy of identifying translated text is understudied. In this work, we perform a human evaluation of English original/translated texts in order to explore raters' ability to classify texts as being original or translated English and the features that lead a rater to judge text as being translated. Ultimately, we find that, regardless of the annotators' native language or the source language of the text, annotators are unable to distinguish translations from original English texts and also have low agreement. Our results provide critical insight into work in translation studies and context for assessments of translationese classifiers.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Measuring Fine-Grained Semantic Equivalence with Abstract Meaning Representation.\n \n \n \n \n\n\n \n Shira Wein; Zhuxin Wang; and Nathan Schneider.\n\n\n \n\n\n\n In
Proceedings of the 15th International Conference on Computational Semantics, pages 144–154, Nancy, France, June 2023. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{wein-etal-2023-measuring,\n title = "Measuring Fine-Grained Semantic Equivalence with {A}bstract {M}eaning {R}epresentation",\n author = "Wein, Shira and\n Wang, Zhuxin and\n Schneider, Nathan",\n booktitle = "Proceedings of the 15th International Conference on Computational Semantics",\n month = jun,\n year = "2023",\n address = "Nancy, France",\n publisher = "Association for Computational Linguistics",\n url = "https://aclanthology.org/2023.iwcs-1.16",\n pages = "144--154",\n abstract = "Identifying semantically equivalent sentences is important for many NLP tasks. Current approaches to semantic equivalence take a loose, sentence-level approach to {``}equivalence,{''} despite evidence that fine-grained differences and implicit content have an effect on human understanding and system performance. In this work, we introduce a novel, more sensitive method of characterizing cross-lingual semantic equivalence that leverages Abstract Meaning Representation graph structures. We find that parsing sentences into AMRs and comparing the AMR graphs enables finer-grained equivalence measurement than comparing the sentences themselves. We demonstrate that when using gold or even automatically parsed AMR annotations, our solution is finer-grained than existing corpus filtering methods and more accurate at predicting strictly equivalent sentences than existing semantic similarity metrics.",\n}\n\n
\n
\n\n\n
\n Identifying semantically equivalent sentences is important for many NLP tasks. Current approaches to semantic equivalence take a loose, sentence-level approach to ``equivalence,'' despite evidence that fine-grained differences and implicit content have an effect on human understanding and system performance. In this work, we introduce a novel, more sensitive method of characterizing cross-lingual semantic equivalence that leverages Abstract Meaning Representation graph structures. We find that parsing sentences into AMRs and comparing the AMR graphs enables finer-grained equivalence measurement than comparing the sentences themselves. We demonstrate that when using gold or even automatically parsed AMR annotations, our solution is finer-grained than existing corpus filtering methods and more accurate at predicting strictly equivalent sentences than existing semantic similarity metrics.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n AMR4NLI: Interpretable and robust NLI measures from semantic graphs.\n \n \n \n \n\n\n \n Juri Opitz; Shira Wein; Julius Steen; Anette Frank; and Nathan Schneider.\n\n\n \n\n\n\n In
Proceedings of the 15th International Conference on Computational Semantics, pages 275–283, Nancy, France, June 2023. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{opitz-etal-2023-amr4nli,\n title = "{AMR}4{NLI}: Interpretable and robust {NLI} measures from semantic graphs",\n author = "Opitz, Juri and\n Wein, Shira and\n Steen, Julius and\n Frank, Anette and\n Schneider, Nathan",\n booktitle = "Proceedings of the 15th International Conference on Computational Semantics",\n month = jun,\n year = "2023",\n address = "Nancy, France",\n publisher = "Association for Computational Linguistics",\n url = "https://aclanthology.org/2023.iwcs-1.29",\n pages = "275--283",\n abstract = "The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of *contextualized embeddings* and *semantic graphs* (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.",\n}\n\n
\n
\n\n\n
\n The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of *contextualized embeddings* and *semantic graphs* (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Comparing UMR and Cross-lingual Adaptations of AMR.\n \n \n \n \n\n\n \n Shira Wein; and Julia Bonn.\n\n\n \n\n\n\n In
Proceedings of the Fourth International Workshop on Designing Meaning Representations, pages 23–33, Nancy, France, June 2023. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{wein-bonn-2023-comparing,\n title = "Comparing {UMR} and Cross-lingual Adaptations of {AMR}",\n author = "Wein, Shira and\n Bonn, Julia",\n booktitle = "Proceedings of the Fourth International Workshop on Designing Meaning Representations",\n month = jun,\n year = "2023",\n address = "Nancy, France",\n publisher = "Association for Computational Linguistics",\n url = "https://aclanthology.org/2023.dmr-1.3",\n pages = "23--33",\n abstract = "Abstract Meaning Representation (AMR) is a popular semantic annotation schema that presents sentence meaning as a graph while abstracting away from syntax. It was originally designed for English, but has since been extended to a variety of non-English versions of AMR. These cross-lingual adaptations, to varying degrees, incorporate language-specific features necessary to effectively capture the semantics of the language being annotated. Uniform Meaning Representation (UMR) on the other hand, the multilingual extension of AMR, was designed specifically for cross-lingual applications. In this work, we discuss these two approaches to extending AMR beyond English. We describe both approaches, compare the information they capture for a case language (Spanish), and outline implications for future work.",\n}\n\n
\n
\n\n\n
\n Abstract Meaning Representation (AMR) is a popular semantic annotation schema that presents sentence meaning as a graph while abstracting away from syntax. It was originally designed for English, but has since been extended to a variety of non-English versions of AMR. These cross-lingual adaptations, to varying degrees, incorporate language-specific features necessary to effectively capture the semantics of the language being annotated. Uniform Meaning Representation (UMR) on the other hand, the multilingual extension of AMR, was designed specifically for cross-lingual applications. In this work, we discuss these two approaches to extending AMR beyond English. We describe both approaches, compare the information they capture for a case language (Spanish), and outline implications for future work.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n UMR Annotation of Multiword Expressions.\n \n \n \n \n\n\n \n Julia Bonn; Andrew Cowell; Jan Hajič; Alexis Palmer; Martha Palmer; James Pustejovsky; Haibo Sun; Zdenka Uresova; Shira Wein; Nianwen Xue; and Jin Zhao.\n\n\n \n\n\n\n In
Proceedings of the Fourth International Workshop on Designing Meaning Representations, pages 99–109, Nancy, France, June 2023. Association for Computational Linguistics\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{bonn-etal-2023-umr,\n title = "{UMR} Annotation of Multiword Expressions",\n author = "Bonn, Julia and\n Cowell, Andrew and\n Haji{\\v{c}}, Jan and\n Palmer, Alexis and\n Palmer, Martha and\n Pustejovsky, James and\n Sun, Haibo and\n Uresova, Zdenka and\n Wein, Shira and\n Xue, Nianwen and\n Zhao, Jin",\n booktitle = "Proceedings of the Fourth International Workshop on Designing Meaning Representations",\n month = jun,\n year = "2023",\n address = "Nancy, France",\n publisher = "Association for Computational Linguistics",\n url = "https://aclanthology.org/2023.dmr-1.10",\n pages = "99--109",\n abstract = "Rooted in AMR, Uniform Meaning Representation (UMR) is a graph-based formalism with nodes as concepts and edges as relations between them. When used to represent natural language semantics, UMR maps words in a sentence to concepts in the UMR graph. Multiword expressions (MWEs) pose a particular challenge to UMR annotation because they deviate from the default one-to-one mapping between words and concepts. There are different types of MWEs which require different kinds of annotation that must be specified in guidelines. This paper discusses the specific treatment for each type of MWE in UMR.",\n}\n\n
\n
\n\n\n
\n Rooted in AMR, Uniform Meaning Representation (UMR) is a graph-based formalism with nodes as concepts and edges as relations between them. When used to represent natural language semantics, UMR maps words in a sentence to concepts in the UMR graph. Multiword expressions (MWEs) pose a particular challenge to UMR annotation because they deviate from the default one-to-one mapping between words and concepts. There are different types of MWEs which require different kinds of annotation that must be specified in guidelines. This paper discusses the specific treatment for each type of MWE in UMR.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n How Many Raters Do You Need? Power Analysis for Foundation Models.\n \n \n \n \n\n\n \n Christopher M Homan; Shira Wein; Lora M Aroyo; and Chris Welty.\n\n\n \n\n\n\n In
Proceedings of I Can't Believe It's Not Better (ICBINB): Failure Modes in the Age of Foundation Models, 2023. NeurIPS\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n \n \n 2 downloads\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@inproceedings{homanmany,\n title={How Many Raters Do You Need? Power Analysis for Foundation Models},\n author={Homan, Christopher M and Wein, Shira and Aroyo, Lora M and Welty, Chris},\n year={2023},\n url="https://neurips.cc/virtual/2023/76515#details",\n booktitle="Proceedings of I Can't Believe It's Not Better (ICBINB): Failure Modes in the Age of Foundation Models",\n publisher = "NeurIPS"\n}\n\n
\n
\n\n\n\n
\n\n\n\n\n\n