Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations. Yang, Z., Yao, Z., Tasmin, M., Vashisht, P., Jang, W. S., Ouyang, F., Wang, B., Berlowitz, D., & Yu, H. November, 2023. Pages: 2023.10.26.23297629

Paper doi abstract bibtex

Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images. Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. Results GPT-4V achieved high accuracies on USMLE (86.2%), AMBOSS (62.0%), and DRQCE (73.1%), outperforming ChatGPT and GPT-4 by relative increase of 131.8% and 64.5% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly. Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use. 1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.

@misc{yang_performance_2023,
	title = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}: {Potential} for {Imaging} {Diagnostic} {Support} with {Explanations}},
	copyright = {© 2023, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/},
	shorttitle = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}},
	url = {https://www.medrxiv.org/content/10.1101/2023.10.26.23297629v2},
	doi = {10.1101/2023.10.26.23297629},
	abstract = {Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images.
Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations.
Results GPT-4V achieved high accuracies on USMLE (86.2\%), AMBOSS (62.0\%), and DRQCE (73.1\%), outperforming ChatGPT and GPT-4 by relative increase of 131.8\% and 64.5\% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7\%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly.
Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use.
1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.},
	language = {en},
	urldate = {2023-11-14},
	publisher = {medRxiv},
	author = {Yang, Zhichao and Yao, Zonghai and Tasmin, Mahbuba and Vashisht, Parth and Jang, Won Seok and Ouyang, Feiyun and Wang, Beining and Berlowitz, Dan and Yu, Hong},
	month = nov,
	year = {2023},
	note = {Pages: 2023.10.26.23297629},
}

Downloads: 0

{"_id":"3Muq8qroyjDLWDWBE","bibbaseid":"yang-yao-tasmin-vashisht-jang-ouyang-wang-berlowitz-etal-performanceofmultimodalgpt4vonusmlewithimagepotentialforimagingdiagnosticsupportwithexplanations-2023","author_short":["Yang, Z.","Yao, Z.","Tasmin, M.","Vashisht, P.","Jang, W. S.","Ouyang, F.","Wang, B.","Berlowitz, D.","Yu, H."],"bibdata":{"bibtype":"misc","type":"misc","title":"Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations","copyright":"© 2023, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/","shorttitle":"Performance of Multimodal GPT-4V on USMLE with Image","url":"https://www.medrxiv.org/content/10.1101/2023.10.26.23297629v2","doi":"10.1101/2023.10.26.23297629","abstract":"Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images. Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations. Results GPT-4V achieved high accuracies on USMLE (86.2%), AMBOSS (62.0%), and DRQCE (73.1%), outperforming ChatGPT and GPT-4 by relative increase of 131.8% and 64.5% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly. Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use. 1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.","language":"en","urldate":"2023-11-14","publisher":"medRxiv","author":[{"propositions":[],"lastnames":["Yang"],"firstnames":["Zhichao"],"suffixes":[]},{"propositions":[],"lastnames":["Yao"],"firstnames":["Zonghai"],"suffixes":[]},{"propositions":[],"lastnames":["Tasmin"],"firstnames":["Mahbuba"],"suffixes":[]},{"propositions":[],"lastnames":["Vashisht"],"firstnames":["Parth"],"suffixes":[]},{"propositions":[],"lastnames":["Jang"],"firstnames":["Won","Seok"],"suffixes":[]},{"propositions":[],"lastnames":["Ouyang"],"firstnames":["Feiyun"],"suffixes":[]},{"propositions":[],"lastnames":["Wang"],"firstnames":["Beining"],"suffixes":[]},{"propositions":[],"lastnames":["Berlowitz"],"firstnames":["Dan"],"suffixes":[]},{"propositions":[],"lastnames":["Yu"],"firstnames":["Hong"],"suffixes":[]}],"month":"November","year":"2023","note":"Pages: 2023.10.26.23297629","bibtex":"@misc{yang_performance_2023,\n\ttitle = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}: {Potential} for {Imaging} {Diagnostic} {Support} with {Explanations}},\n\tcopyright = {© 2023, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/},\n\tshorttitle = {Performance of {Multimodal} {GPT}-{4V} on {USMLE} with {Image}},\n\turl = {https://www.medrxiv.org/content/10.1101/2023.10.26.23297629v2},\n\tdoi = {10.1101/2023.10.26.23297629},\n\tabstract = {Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images.\nMethods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations.\nResults GPT-4V achieved high accuracies on USMLE (86.2\\%), AMBOSS (62.0\\%), and DRQCE (73.1\\%), outperforming ChatGPT and GPT-4 by relative increase of 131.8\\% and 64.5\\% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7\\%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly.\nConclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use.\n1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.},\n\tlanguage = {en},\n\turldate = {2023-11-14},\n\tpublisher = {medRxiv},\n\tauthor = {Yang, Zhichao and Yao, Zonghai and Tasmin, Mahbuba and Vashisht, Parth and Jang, Won Seok and Ouyang, Feiyun and Wang, Beining and Berlowitz, Dan and Yu, Hong},\n\tmonth = nov,\n\tyear = {2023},\n\tnote = {Pages: 2023.10.26.23297629},\n}\n\n","author_short":["Yang, Z.","Yao, Z.","Tasmin, M.","Vashisht, P.","Jang, W. S.","Ouyang, F.","Wang, B.","Berlowitz, D.","Yu, H."],"key":"yang_performance_2023","id":"yang_performance_2023","bibbaseid":"yang-yao-tasmin-vashisht-jang-ouyang-wang-berlowitz-etal-performanceofmultimodalgpt4vonusmlewithimagepotentialforimagingdiagnosticsupportwithexplanations-2023","role":"author","urls":{"Paper":"https://www.medrxiv.org/content/10.1101/2023.10.26.23297629v2"},"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"http://fenway.cs.uml.edu/papers/pubs-all.bib","dataSources":["TqaA9miSB65nRfS5H"],"keywords":[],"search_terms":["performance","multimodal","gpt","usmle","image","potential","imaging","diagnostic","support","explanations","yang","yao","tasmin","vashisht","jang","ouyang","wang","berlowitz","yu"],"title":"Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations","year":2023}