Accounting for Language Effect in the Evaluation of Cross-lingual AMR Parsers. Wein, S. & Schneider, N. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3824–3834, Gyeongju, Republic of Korea, October, 2022. International Committee on Computational Linguistics.
Accounting for Language Effect in the Evaluation of Cross-lingual AMR Parsers [link]Paper  abstract   bibtex   1 download  
Cross-lingual Abstract Meaning Representation (AMR) parsers are currently evaluated in comparison to gold English AMRs, despite parsing a language other than English, due to the lack of multilingual AMR evaluation metrics. This evaluation practice is problematic because of the established effect of source language on AMR structure. In this work, we present three multilingual adaptations of monolingual AMR evaluation metrics and compare the performance of these metrics to sentence-level human judgments. We then use our most highly correlated metric to evaluate the output of state-of-the-art cross-lingual AMR parsers, finding that Smatch may still be a useful metric in comparison to gold English AMRs, while our multilingual adaptation of S2match (XS2match) is best for comparison with gold in-language AMRs.
@inproceedings{wein-schneider-2022-accounting,
    title = "Accounting for Language Effect in the Evaluation of Cross-lingual {AMR} Parsers",
    author = "Wein, Shira  and
      Schneider, Nathan",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.336",
    pages = "3824--3834",
    abstract = "Cross-lingual Abstract Meaning Representation (AMR) parsers are currently evaluated in comparison to gold English AMRs, despite parsing a language other than English, due to the lack of multilingual AMR evaluation metrics. This evaluation practice is problematic because of the established effect of source language on AMR structure. In this work, we present three multilingual adaptations of monolingual AMR evaluation metrics and compare the performance of these metrics to sentence-level human judgments. We then use our most highly correlated metric to evaluate the output of state-of-the-art cross-lingual AMR parsers, finding that Smatch may still be a useful metric in comparison to gold English AMRs, while our multilingual adaptation of S2match (XS2match) is best for comparison with gold in-language AMRs.",
}

Downloads: 1