MIMIC-SBDH: A Dataset for Social and Behavioral Determinants of Health. Ahsan, H., Ohnuki, E., Mitra, A., & Yu, H. Proceedings of Machine Learning Research 149:1–20, 2021, 2021.
abstract   bibtex   
Social and Behavioral Determinants of Health (SBDHs) are environmental and behavioral factors that have a profound impact on health and related outcomes. Given their importance, physicians document SBDHs of their patients in Electronic Health Records (EHRs). However, SBDHs are mostly documented in unstructured EHR notes. Determining the status of the SBDHs requires manually reviewing the notes which can be a tedious process. Therefore, there is a need to automate identifying the patients’ SBDH status in EHR notes. In this work, we created MIMIC-SBDH1, the first publicly available dataset of EHR notes annotated for patients’ SBDH status. Specifically, we annotated 7, 025 discharge summary notes for the status of 7 SBDHs as well as marked SBDH-related keywords. Using this annotated data for training and evaluation, we evaluated the performance of three machine learning models (Random Forest, XGBoost, and Bio-ClinicalBERT) on the task of identifying SBDH status in EHR notes. The performance ranged from the lowest 0.69 F1 score for Drug Use to the highest 0.96 F1 score for Community-Present. In addition to standard evaluation metrics such as the F1 score, we evaluated four capabilities that a model must possess to perform well on the task using the CheckList tool (Ribeiro et al., 2020). The results revealed several shortcomings of the models. Our results highlighted the need to perform more capability-centric evaluations in addition to standard metric comparisons.
@article{ahsan_mimic-sbdh_2021,
	title = {{MIMIC}-{SBDH}: {A} {Dataset} for {Social} and {Behavioral} {Determinants} of {Health}},
	abstract = {Social and Behavioral Determinants of Health (SBDHs) are environmental and behavioral factors that have a profound impact on health and related outcomes. Given their importance, physicians document SBDHs of their patients in Electronic Health Records (EHRs). However, SBDHs are mostly documented in unstructured EHR notes. Determining the status of the SBDHs requires manually reviewing the notes which can be a tedious process. Therefore, there is a need to automate identifying the patients’ SBDH status in EHR notes. In this work, we created MIMIC-SBDH1, the first publicly available dataset of EHR notes annotated for patients’ SBDH status. Specifically, we annotated 7, 025 discharge summary notes for the status of 7 SBDHs as well as marked SBDH-related keywords. Using this annotated data for training and evaluation, we evaluated the performance of three machine learning models (Random Forest, XGBoost, and Bio-ClinicalBERT) on the task of identifying SBDH status in EHR notes. The performance ranged from the lowest 0.69 F1 score for Drug Use to the highest 0.96 F1 score for Community-Present. In addition to standard evaluation metrics such as the F1 score, we evaluated four capabilities that a model must possess to perform well on the task using the CheckList tool (Ribeiro et al., 2020). The results revealed several shortcomings of the models. Our results highlighted the need to perform more capability-centric evaluations in addition to standard metric comparisons.},
	language = {en},
	journal = {Proceedings of Machine Learning Research 149:1–20, 2021},
	author = {Ahsan, Hiba and Ohnuki, Emmie and Mitra, Avijit and Yu, Hong},
	year = {2021},
}

Downloads: 0