Benchmark Results: Unstructured Clinical Notes

This section presents the performance comparison of BERT-based and GPT-based models on MIMIC mortality and readmission prediction tasks using unstructured clinical notes.

Outcome Prediction Performance (MIMIC Clinical Notes)

Table 2a: Performance comparison of BERT-based and GPT-based models on MIMIC mortality prediction tasks using unstructured clinical notes. Bold denotes the best performance. We use a bootstrapping strategy on all test set samples 100 times to report the mean±std results. All metrics are multiplied by 100 for readability purposes.
Method Setting Outcome
BERT-based LM BERT freeze 69.96±6.78 24.99±7.33
finetune 81.04±5.14 40.90±9.21
Clinical-Longformer freeze 75.68±6.15 36.81±10.80
finetune 88.06±4.45 62.92±9.40
BioBERT freeze 75.30±4.95 29.49±8.17
finetune 75.85±6.63 36.93±10.89
GatorTron freeze 89.47±4.45 54.53±9.90
finetune 91.47±3.44 71.43±8.79
ClinicalBERT freeze 73.55±5.69 26.98±8.32
finetune 84.95±4.72 53.01±10.67
GPT-based Base LLM GPT-2-117M freeze 65.60±6.36 17.08±5.85
finetune 82.30±5.50 46.68±10.38
BioGPT-347M freeze 77.41±3.78 24.94±7.17
finetune 88.39±4.51 62.62±10.42
Meditron-7B freeze 71.98±6.15 26.77±8.52
finetune 79.05±5.05 39.10±9.84
BioMistral-7B freeze 70.78±7.30 33.64±9.81
finetune 74.42±5.88 40.31±9.32
OpenBioLLM-8B freeze 61.24±6.51 12.85±3.55
finetune 80.85±5.71 55.56±11.92
prompt 65.70±5.43 13.81±3.35
Qwen2.5-7B freeze 64.52±6.46 14.31±4.61
finetune 89.04±3.93 60.89±10.18
prompt 88.39±4.11 44.33±8.33
Gemma-3-4B freeze 66.24±6.26 28.78±9.54
prompt 96.57±1.32 68.97±9.29
DeepSeek-V3 prompt 97.13±1.07 65.16±10.39
GPT-4o prompt 93.99±2.00 58.04±10.01
Reasoning LLM HuatuoGPT-o1-7B freeze 60.69±5.52 11.67±2.88
finetune 88.16±3.95 62.39±9.52
prompt 83.84±3.55 29.72±6.79
DeepSeek-R1-7B freeze 56.76±6.73 16.72±7.24
finetune 87.32±5.98 55.75±12.11
prompt 83.17±4.63 36.25±8.75
DeepSeek-R1 prompt 97.64±1.08 69.61±10.13
o3-mini-high prompt 97.45±1.01 71.72±10.31

Readmission Prediction Performance (MIMIC Clinical Notes)

Table 2b: Performance comparison of BERT-based and GPT-based models on MIMIC readmission prediction tasks using unstructured clinical notes. Bold denotes the best performance. We use a bootstrapping strategy on all test set samples 100 times to report the mean±std results. All metrics are multiplied by 100 for readability purposes.
Method Setting Readmission
BERT-based LM BERT freeze 62.93±4.15 38.77±6.13
finetune 67.99±4.61 47.20±7.06
Clinical-Longformer freeze 65.04±4.98 48.70±6.04
finetune 75.56±4.42 59.19±6.65
BioBERT freeze 62.86±4.43 37.10±5.40
finetune 75.85±6.63 36.93±10.89
GatorTron freeze 70.75±4.24 47.74±7.09
finetune 75.96±4.34 59.01±6.85
ClinicalBERT freeze 64.94±4.40 40.90±6.10
finetune 71.93±4.35 53.82±6.85
GPT-based Base LLM GPT-2-117M freeze 60.43±5.00 37.05±6.14
finetune 71.81±4.58 50.14±6.67
BioGPT-347M freeze 66.54±4.04 37.62±5.37
finetune 71.66±4.10 52.23±6.63
Meditron-7B freeze 59.70±4.32 34.97±5.50
finetune 64.04±4.40 40.90±6.21
BioMistral-7B freeze 67.41±4.09 42.95±5.55
finetune 75.66±4.37 61.60±6.28
OpenBioLLM-8B freeze 57.09±3.50 32.06±5.08
finetune 77.22±3.73 63.91±5.82
prompt 58.47±4.21 32.99±4.66
Qwen2.5-7B freeze 61.48±4.46 40.03±5.98
finetune 75.80±4.00 58.42±6.72
prompt 77.32±4.75 66.64±6.05
Gemma-3-4B freeze 63.83±4.27 35.15±4.70
prompt 85.21±3.92 71.51±5.77
DeepSeek-V3 prompt 87.33±3.44 78.53±5.06
GPT-4o prompt 75.89±5.51 66.38±6.65
Reasoning LLM HuatuoGPT-o1-7B freeze 59.14±4.26 37.67±6.05
finetune 77.82±3.84 62.21±6.25
prompt 67.33±5.35 49.50±7.09
DeepSeek-R1-7B freeze 53.21±3.81 33.12±5.35
finetune 74.54±3.94 51.13±6.80
prompt 61.69±4.65 34.50±5.75
DeepSeek-R1 prompt 58.18±6.23 56.25±6.53
o3-mini-high prompt 87.59±3.34 75.48±5.74

Key Observations (Unstructured Clinical Notes):

  • State-of-the-art LLMs (e.g., DeepSeek R1/V3, GPT o3-mini-high, DeepSeek-R1) in zero-shot settings considerably outperform fine-tuned BERT-based models.
  • GPT-based models, especially with prompting, show strong performance, with models like DeepSeek-R1 and o3-mini-high achieving top results in AUROC and AUPRC for outcome prediction.
  • DeepSeek-V3 and o3-mini-high also show leading performance for readmission prediction.