ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks

Understanding the efficacy of LLMs compared to established methods in medicine using structured EHR and unstructured clinical notes.

Abstract

The rapidly expanding deployment of Large Language Models (LLMs) in medicine necessitates a clearer understanding of their efficacy in conventional non-generative clinical prediction tasks, where their performance relative to established methods has been uncertain. This study comprehensively benchmarks 9 GPT-based LLMs, 5 BERT-based models, and 7 conventional clinical prediction methods across supervised predictions using structured Electronic Health Records (EHR) and unstructured clinical notes. Our findings significantly revise previous assumptions: for clinical note-based predictions, state-of-the-art LLMs (e.g., DeepSeek R1/V3, GPT o3-mini-high) in zero-shot settings now considerably outperform fine-tuned BERT-based models. With structured EHR data, while specialized conventional models achieve superior performance when fully trained with ample data, leading LLMs (e.g., GPT-4o, DeepSeek R1/V3) demonstrate strong zero-shot capabilities, often surpassing conventional models trained on limited datasets, with a narrowing performance gap. Notably, open-source LLMs are competitive or even superior to proprietary LLMs. These results indicate modern LLMs are increasingly potent for non-generative clinical tasks, excelling with unstructured text and offering data-efficient solutions for structured data, thus warranting a nuanced model selection strategy.

Key Findings

State-of-the-art LLMs in zero-shot settings now considerably outperform fine-tuned BERT-based models for clinical note-based predictions.
With structured EHR, leading LLMs demonstrate strong zero-shot capabilities, often surpassing conventional models trained on limited datasets.
Open-source LLMs are competitive with, or even superior to, proprietary LLMs in these tasks.
Modern LLMs are increasingly potent for non-generative clinical tasks, especially with unstructured text and as data-efficient solutions for structured data.

About This Study

Authors

Yinghao Zhu^5,7,+, Junyi Gao^2,3,+, Zixiang Wang^5,+, Weibin Liao^4,5,+, Xiaochen Zheng⁶, Lifang Liang⁵, Miguel O. Bernabeu², Yasha Wang⁵, Lequan Yu⁷, Chengwei Pan^1,*, Ewen M. Harrison^2,*, Liantao Ma^5,*

Affiliations

¹ School of Artificial Intelligence, Beihang University, Beijing, China, 100191

² Centre for Medical Informatics, The University of Edinburgh, Edinburgh, UK, EH8 9YL

³ Health Data Research UK, UK

⁴ School of Computer Science, Peking University, Beijing, China, 100871

⁵ National Engineering Research Center for Software Engineering, Peking University, Beijing, China, 100871

⁶ ETH Zurich, Zurich, Switzerland, 8092

⁷ School of Computing and Data Science, The University of Hong Kong, Hong Kong SAR, China, 999077

^* Correspond to pancw@buaa.edu.cn, ewen.harrison@ed.ac.uk, malt@pku.edu.cn

⁺ These authors contributed equally to this work

Citation

To cite this work, please use the following:

Zhu, Y., Gao, J., Wang, Z., Liao, W., Zheng, X., Liang, L., Wang, Y., Pan, C., Harrison, E. M., & Ma, L. (2024). Is larger always better? Evaluating and prompting large language models for non-generative medical tasks. arXiv preprint arXiv:2407.18525.

A pre-print of the paper is available on arXiv:

https://arxiv.org/abs/2407.18525

Data Privacy and Code Availability

This research did not involve the collection of new patient EHR data. The TJH EHR dataset utilized in this study is publicly available on GitHub (https://github.com/HAIRLAB/Pre_Surv_COVID_19). The MIMIC datasets are open to researchers and can be accessed on request, including structured EHR data (https://physionet.org/content/mimiciv/3.1/) and clinical notes data (https://physionet.org/content/mimic-iv-note/2.2/). We use these datasets under their respective licenses.

Throughout the experiments, we strictly adhere to the data use agreement, reaffirming our commitment to responsible data handling and usage. The performance of GPT-4o and GPT o3-mini-high on all datasets is processed using the secure Azure OpenAI API, and the performance of DeepSeek-V3 and DeepSeek-R1 is obtained via the DeepSeek's official APIs, with human review of the data waived. Additionally, all other models, including ML, DL, and other LLMs, are deployed locally.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (U23A20468), and Xuzhou Scientific Technological Projects (KC23143).

Junyi Gao acknowledges the receipt of studentship awards from the Health Data Research UK-The Alan Turing Institute Wellcome PhD Programme in Health Data Science (grant 218529/Z/19/Z).

We extend our gratitude to Jingkun An, Yuning Tong, Enshen Zhou, Bowen Jiang, and Yifan He for their preliminary discussions and experiments. We thank Enshen Zhou for providing computing resources for part of our research. We also thank Ahmed Allam for his suggestions on experimental settings.

Code Repository

The code and resources for this study are available on GitHub:

https://github.com/yhzhu99/ehr-llm-benchmark

This repository contains the implementation of our comprehensive benchmark of various models, including GPT-based LLMs, BERT-based language models, and conventional clinical predictive models for non-generative medical tasks.