Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data. Xu, X., Yao, B., Dong, Y., Gabriel, S., Yu, H., Hendler, J., Ghassemi, M., Dey, A. K., & Wang, D. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(1):1–32, March, 2024.

Paper doi abstract bibtex

Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.

@article{xu_mental-llm_2024,
	title = {Mental-{LLM}: {Leveraging} {Large} {Language} {Models} for {Mental} {Health} {Prediction} via {Online} {Text} {Data}},
	volume = {8},
	issn = {2474-9567},
	shorttitle = {Mental-{LLM}},
	url = {https://dl.acm.org/doi/10.1145/3643540},
	doi = {10.1145/3643540},
	abstract = {Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9\% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8\%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.},
	language = {en},
	number = {1},
	urldate = {2024-09-03},
	journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
	author = {Xu, Xuhai and Yao, Bingsheng and Dong, Yuanzhe and Gabriel, Saadia and Yu, Hong and Hendler, James and Ghassemi, Marzyeh and Dey, Anind K. and Wang, Dakuo},
	month = mar,
	year = {2024},
	pages = {1--32},
}

Downloads: 0

{"_id":"RciyhtqSTBJwFQhhj","bibbaseid":"xu-yao-dong-gabriel-yu-hendler-ghassemi-dey-etal-mentalllmleveraginglargelanguagemodelsformentalhealthpredictionviaonlinetextdata-2024","author_short":["Xu, X.","Yao, B.","Dong, Y.","Gabriel, S.","Yu, H.","Hendler, J.","Ghassemi, M.","Dey, A. K.","Wang, D."],"bibdata":{"bibtype":"article","type":"article","title":"Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data","volume":"8","issn":"2474-9567","shorttitle":"Mental-LLM","url":"https://dl.acm.org/doi/10.1145/3643540","doi":"10.1145/3643540","abstract":"Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.","language":"en","number":"1","urldate":"2024-09-03","journal":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","author":[{"propositions":[],"lastnames":["Xu"],"firstnames":["Xuhai"],"suffixes":[]},{"propositions":[],"lastnames":["Yao"],"firstnames":["Bingsheng"],"suffixes":[]},{"propositions":[],"lastnames":["Dong"],"firstnames":["Yuanzhe"],"suffixes":[]},{"propositions":[],"lastnames":["Gabriel"],"firstnames":["Saadia"],"suffixes":[]},{"propositions":[],"lastnames":["Yu"],"firstnames":["Hong"],"suffixes":[]},{"propositions":[],"lastnames":["Hendler"],"firstnames":["James"],"suffixes":[]},{"propositions":[],"lastnames":["Ghassemi"],"firstnames":["Marzyeh"],"suffixes":[]},{"propositions":[],"lastnames":["Dey"],"firstnames":["Anind","K."],"suffixes":[]},{"propositions":[],"lastnames":["Wang"],"firstnames":["Dakuo"],"suffixes":[]}],"month":"March","year":"2024","pages":"1–32","bibtex":"@article{xu_mental-llm_2024,\n\ttitle = {Mental-{LLM}: {Leveraging} {Large} {Language} {Models} for {Mental} {Health} {Prediction} via {Online} {Text} {Data}},\n\tvolume = {8},\n\tissn = {2474-9567},\n\tshorttitle = {Mental-{LLM}},\n\turl = {https://dl.acm.org/doi/10.1145/3643540},\n\tdoi = {10.1145/3643540},\n\tabstract = {Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9\\% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8\\%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.},\n\tlanguage = {en},\n\tnumber = {1},\n\turldate = {2024-09-03},\n\tjournal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},\n\tauthor = {Xu, Xuhai and Yao, Bingsheng and Dong, Yuanzhe and Gabriel, Saadia and Yu, Hong and Hendler, James and Ghassemi, Marzyeh and Dey, Anind K. and Wang, Dakuo},\n\tmonth = mar,\n\tyear = {2024},\n\tpages = {1--32},\n}\n\n","author_short":["Xu, X.","Yao, B.","Dong, Y.","Gabriel, S.","Yu, H.","Hendler, J.","Ghassemi, M.","Dey, A. K.","Wang, D."],"key":"xu_mental-llm_2024","id":"xu_mental-llm_2024","bibbaseid":"xu-yao-dong-gabriel-yu-hendler-ghassemi-dey-etal-mentalllmleveraginglargelanguagemodelsformentalhealthpredictionviaonlinetextdata-2024","role":"author","urls":{"Paper":"https://dl.acm.org/doi/10.1145/3643540"},"metadata":{"authorlinks":{}},"html":""},"bibtype":"article","biburl":"http://fenway.cs.uml.edu/papers/pubs-all.bib","dataSources":["TqaA9miSB65nRfS5H"],"keywords":[],"search_terms":["mental","llm","leveraging","large","language","models","mental","health","prediction","via","online","text","data","xu","yao","dong","gabriel","yu","hendler","ghassemi","dey","wang"],"title":"Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data","year":2024}