CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias. Gupta, V., Venkit, P. N., Laurencon, H., Wilson, S., & Passonneau, R. J. In Proceedings of the 1st Conference on Language Modeling (COLM), Philadelphia, PA, July, 2024. OpenReview.

Paper abstract bibtex

As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of social biases. We use sixteen datasets for question-answering, sentiment analysis and natural language inference and filter them to produce 224 templates with high diversity (e.g., length, vocabulary). This helps us create a novel dataset of 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 LM series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here.

@inproceedings{gupta_calm_2024,
	address = {Philadelphia, PA},
	title = {{CALM} : {A} {Multi}-task {Benchmark} for {Comprehensive} {Assessment} of {Language} {Model} {Bias}},
	url = {https://openreview.net/forum?id=RLFca3arx7},
	abstract = {As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of social biases. We use sixteen datasets for question-answering, sentiment analysis and natural language inference and filter them to produce 224 templates with high diversity (e.g., length, vocabulary). This helps us create a novel dataset of 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 LM series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here.},
	booktitle = {Proceedings of the 1st {Conference} on {Language} {Modeling} ({COLM})},
	publisher = {OpenReview},
	author = {Gupta, Vipul and Venkit, Pranav Narayanan and Laurencon, Hugo and Wilson, Shomir and Passonneau, Rebecca J.},
	month = jul,
	year = {2024},
}

Downloads: 0

{"_id":"oYhrgfrg9J586x8Ge","bibbaseid":"gupta-venkit-laurencon-wilson-passonneau-calmamultitaskbenchmarkforcomprehensiveassessmentoflanguagemodelbias-2024","author_short":["Gupta, V.","Venkit, P. N.","Laurencon, H.","Wilson, S.","Passonneau, R. J."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Philadelphia, PA","title":"CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias","url":"https://openreview.net/forum?id=RLFca3arx7","abstract":"As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of social biases. We use sixteen datasets for question-answering, sentiment analysis and natural language inference and filter them to produce 224 templates with high diversity (e.g., length, vocabulary). This helps us create a novel dataset of 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 LM series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here.","booktitle":"Proceedings of the 1st Conference on Language Modeling (COLM)","publisher":"OpenReview","author":[{"propositions":[],"lastnames":["Gupta"],"firstnames":["Vipul"],"suffixes":[]},{"propositions":[],"lastnames":["Venkit"],"firstnames":["Pranav","Narayanan"],"suffixes":[]},{"propositions":[],"lastnames":["Laurencon"],"firstnames":["Hugo"],"suffixes":[]},{"propositions":[],"lastnames":["Wilson"],"firstnames":["Shomir"],"suffixes":[]},{"propositions":[],"lastnames":["Passonneau"],"firstnames":["Rebecca","J."],"suffixes":[]}],"month":"July","year":"2024","bibtex":"@inproceedings{gupta_calm_2024,\n\taddress = {Philadelphia, PA},\n\ttitle = {{CALM} : {A} {Multi}-task {Benchmark} for {Comprehensive} {Assessment} of {Language} {Model} {Bias}},\n\turl = {https://openreview.net/forum?id=RLFca3arx7},\n\tabstract = {As language models (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of Language Models (CALM) for robust measurement of social biases. We use sixteen datasets for question-answering, sentiment analysis and natural language inference and filter them to produce 224 templates with high diversity (e.g., length, vocabulary). This helps us create a novel dataset of 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 large language models, and find that for 2 LM series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here.},\n\tbooktitle = {Proceedings of the 1st {Conference} on {Language} {Modeling} ({COLM})},\n\tpublisher = {OpenReview},\n\tauthor = {Gupta, Vipul and Venkit, Pranav Narayanan and Laurencon, Hugo and Wilson, Shomir and Passonneau, Rebecca J.},\n\tmonth = jul,\n\tyear = {2024},\n}\n\n","author_short":["Gupta, V.","Venkit, P. N.","Laurencon, H.","Wilson, S.","Passonneau, R. J."],"key":"gupta_calm_2024","id":"gupta_calm_2024","bibbaseid":"gupta-venkit-laurencon-wilson-passonneau-calmamultitaskbenchmarkforcomprehensiveassessmentoflanguagemodelbias-2024","role":"author","urls":{"Paper":"https://openreview.net/forum?id=RLFca3arx7"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://api.zotero.org/users/5414994/collections/86KRVVHK/items?key=R4LGq2FsV4FTfuDPOljwMZOi&format=bibtex&limit=100","dataSources":["hP49mbAp85woEpWam","v7pkcTjCXS5Y8BD24","7MAbqkLtLFkexFCbL"],"keywords":[],"search_terms":["calm","multi","task","benchmark","comprehensive","assessment","language","model","bias","gupta","venkit","laurencon","wilson","passonneau"],"title":"CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias","year":2024}