Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge Probing. Yao, Z., Cao, Y., Yang, Z., & Yu, H. AMIA Summits on Translational Science Proceedings, 2023:592–601, June, 2023.
Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge Probing [link]Paper  abstract   bibtex   
Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs’ knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduced context variance into the prompt generation and proposed a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we proposed the concept of ”Misunderstand” in LAMA for the first time. Through experiments on 12 PLMs, we showed that our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric make BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle ”understand” from just ”read and copy”.
@article{yao_context_2023,
	title = {Context {Variance} {Evaluation} of {Pretrained} {Language} {Models} for {Prompt}-based {Biomedical} {Knowledge} {Probing}},
	volume = {2023},
	issn = {2153-4063},
	url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10283095/},
	abstract = {Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs’ knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduced context variance into the prompt generation and proposed a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we proposed the concept of ”Misunderstand” in LAMA for the first time. Through experiments on 12 PLMs, we showed that our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric make BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle ”understand” from just ”read and copy”.},
	urldate = {2023-11-14},
	journal = {AMIA Summits on Translational Science Proceedings},
	author = {Yao, Zonghai and Cao, Yi and Yang, Zhichao and Yu, Hong},
	month = jun,
	year = {2023},
	pmid = {37350903},
	pmcid = {PMC10283095},
	pages = {592--601},
}

Downloads: 0