Benchmark Results: Structured EHR Data

This section details the performance of mortality and readmission prediction on TJH and MIMIC datasets using structured Electronic Health Records (EHR).

Outcome Prediction Performance

Performance of mortality prediction on TJH and MIMIC datasets using structured EHR. Bold indicates the best performance excluding results using the full dataset. Italic indicates the proposed prompting framework outperforms basic prompts. We use bootstrapping on all test set samples 100 times to report the standard deviations. All metrics are multiplied by 100 for readability purposes.
Methods Setting TJH Outcome MIMIC-IV Outcome
ML CatBoost 10 shot 99.43±0.31 99.31±0.39 62.18±7.41 19.48±5.81
full shot 99.16±0.47 98.99±0.59 71.18±6.72 28.27±8.53
DT 10 shot 79.64±2.78 69.20±4.42 59.48±5.84 11.98±3.23
full shot 92.20±1.83 87.79±3.04 51.81±3.69 10.48±2.64
RF 10 shot 99.16±0.44 98.99±0.54 59.92±8.79 22.51±7.55
full shot 99.18±0.46 99.05±0.56 67.06±5.54 15.89±3.84
XGBoost 10 shot 62.66±3.38 53.15±4.37 55.77±5.43 10.76±2.55
full shot 98.05±0.94 95.58±2.18 64.62±4.97 17.66±5.12
DL GRU 10 shot 87.79±2.26 83.42±3.72 74.96±6.67 25.74±7.04
full shot 93.57±1.71 90.40±3.19 92.49±3.03 72.05±7.58
LSTM 10 shot 92.46±1.97 85.71±4.29 56.42±7.93 16.52±5.56
full shot 92.98±1.91 86.97±4.12 93.12±3.24 76.18±7.90
RNN 10 shot 95.53±1.28 94.27±1.78 62.03±7.97 26.07±8.62
full shot 96.42±1.10 95.18±1.62 91.76±3.12 72.03±8.15
AdaCare 10 shot 77.11±3.97 76.24±4.70 80.02±6.32 54.93±9.85
full shot 99.02±0.46 98.86±0.53 94.28±3.52 81.93±6.97
AICare 10 shot 86.79±2.41 82.98±3.96 60.87±6.25 23.69±8.14
full shot 95.97±1.31 94.56±2.02 92.89±3.66 77.84±7.10
ConCare 10 shot 90.98±1.97 91.43±2.12 72.58±6.42 30.73±8.44
full shot 91.00±2.14 91.72±2.30 94.08±3.70 80.65±6.98
GRASP 10 shot 87.25±2.53 84.32±3.56 69.89±8.80 45.96±10.33
full shot 94.25±1.58 92.03±2.54 93.14±3.03 72.55±8.36
Base LLM OpenBioLLM-8B base prompt 49.37±2.70 46.17±4.40 52.35±4.65 10.31±2.28
optimized prompt 56.75±3.92 49.76±4.67 58.69±6.06 12.85±3.77
Qwen2.5-7B base prompt 72.96±4.00 64.49±5.06 70.68±5.15 17.55±4.54
optimized prompt 79.83±2.68 70.87±4.61 61.57±7.12 13.58±3.17
Gemma-3-4B base prompt 65.64±3.53 64.83±4.68 63.46±7.68 17.76±5.71
optimized prompt 76.01±3.46 71.62±4.97 57.78±7.40 15.16±5.11
DeepSeek-V3 base prompt 89.59±1.93 85.06±3.01 78.07±6.13 43.76±10.43
optimized prompt 89.67±1.90 82.93±3.58 76.86±4.71 33.47±9.58
GPT-4o base prompt 96.76±1.30 93.90±2.55 76.96±6.59 33.90±9.17
optimized prompt 95.72±1.21 93.04±2.08 85.99±3.85 42.20±9.92
Reasoning LLM HuatuoGPT-o1-7B base prompt 77.74±3.34 71.89±4.83 73.20±6.12 22.32±6.87
optimized prompt 85.34±2.61 77.31±4.26 70.39±7.60 20.33±5.51
DeepSeek-R1-7B base prompt 53.59±3.88 49.06±4.05 53.59±5.26 10.53±2.75
optimized prompt 52.70±1.95 47.89±4.09 40.94±3.97 9.43±2.27
DeepSeek-R1 base prompt 90.63±1.99 83.59±3.85 73.68±7.52 33.27±9.51
optimized prompt 85.59±1.97 76.87±3.56 83.95±4.60 42.10±9.95
o3-mini-high base prompt 85.43±2.15 76.52±3.91 71.13±5.41 18.35±5.02
optimized prompt 84.42±2.52 75.65±4.48 71.23±7.19 28.99±7.88

Length of Stay (LOS) Prediction Performance

Performance of LOS prediction on TJH dataset using structured EHR. Bold indicates the best performance excluding results using the full dataset. Italic indicates the proposed prompting framework outperforms basic prompts. We use bootstrapping on all test set samples 100 times to report the standard deviations. All metrics are multiplied by 100 for readability purposes.
Methods Setting
ML CatBoost 10 shot 4.14±0.18 24.04±3.63 4.89±0.37
full shot 3.09±0.21 18.61±4.24 4.29±0.49
DT 10 shot 6.25±0.72 118.50±17.13 10.86±0.78
full shot 3.53±0.42 50.40±9.04 7.07±0.65
RF 10 shot 4.60±0.22 31.21±3.70 5.58±0.33
full shot 3.09±0.24 23.04±4.74 4.77±0.49
XGBoost 10 shot 4.22±0.18 24.73±3.59 4.96±0.36
full shot 3.06±0.21 18.70±4.29 4.30±0.49
DL GRU 10 shot 4.43±0.23 31.35±3.69 5.59±0.33
full shot 2.50±0.21 17.35±4.27 4.13±0.51
LSTM 10 shot 3.83±0.23 25.94±3.91 5.08±0.38
full shot 2.64±0.21 17.76±4.21 4.18±0.50
RNN 10 shot 4.32±0.19 25.97±3.58 5.08±0.35
full shot 2.07±0.24 16.64±5.00 4.03±0.61
AdaCare 10 shot 3.79±0.20 23.38±3.70 4.82±0.38
full shot 2.49±0.24 18.12±4.63 4.22±0.54
AICare 10 shot 3.24±0.22 20.35±4.43 4.48±0.49
full shot 2.17±0.23 16.28±4.84 3.99±0.60
ConCare 10 shot 4.01±0.29 30.03±6.41 5.45±0.58
full shot 2.32±0.24 17.94±4.99 4.19±0.59
GRASP 10 shot 5.81±0.18 41.25±2.96 6.42±0.23
full shot 3.84±0.20 23.89±3.07 4.88±0.31
Base LLM OpenBioLLM-8B base prompt 6.50±0.32 61.41±6.84 7.82±0.44
optimized prompt 8.97±0.36 105.23±6.52 10.25±0.32
Qwen2.5-7B base prompt 12.44±0.43 184.73±12.89 13.58±0.47
optimized prompt 15.34±0.71 313.70±27.84 17.69±0.79
Gemma-3-4B base prompt 13.25±0.35 199.67±9.20 14.13±0.33
optimized prompt 14.87±0.39 250.80±10.36 15.83±0.33
DeepSeek-V3 base prompt 13.83±0.41 220.52±11.64 14.84±0.39
optimized prompt 18.04±0.42 359.87±14.22 18.97±0.38
GPT-4o base prompt 14.56±0.47 258.23±14.89 16.06±0.46
optimized prompt 17.93±0.58 378.94±21.17 19.46±0.55
Reasoning LLM HuatuoGPT-o1-7B base prompt 8.87±0.26 98.61±5.83 9.93±0.29
optimized prompt 11.12±0.42 157.38±11.66 12.54±0.47
DeepSeek-R1-7B base prompt 5.24±0.22 37.66±4.89 6.12±0.39
optimized prompt 5.59±0.24 44.07±4.66 6.63±0.35
DeepSeek-R1 base prompt 12.89±0.41 194.28±10.52 13.93±0.38
optimized prompt 15.20±0.39 263.05±11.49 16.21±0.35
o3-mini-high base prompt 13.44±0.46 217.77±15.54 14.75±0.52
optimized prompt 15.63±0.40 276.67±13.52 16.63±0.41

Readmission Prediction Performance

Performance of readmission prediction on MIMIC-IV dataset using structured EHR. Bold indicates the best performance excluding results using the full dataset. Italic indicates the proposed prompting framework outperforms basic prompts. We use bootstrapping on all test set samples 100 times to report the standard deviations. All metrics are multiplied by 100 for readability purposes.
Methods Setting
ML CatBoost 10 shot 52.29±5.39 26.96±5.10
full shot 61.78±4.82 31.76±5.71
DT 10 shot 55.68±3.38 24.91±3.75
full shot 51.55±2.72 23.65±3.72
RF 10 shot 58.33±5.27 31.21±5.46
full shot 57.43±4.32 29.67±5.34
XGBoost 10 shot 56.40±4.53 29.10±4.63
full shot 64.23±4.34 34.31±6.65
DL GRU 10 shot 58.19±5.42 37.76±7.17
full shot 81.30±3.81 63.70±6.56
LSTM 10 shot 50.98±5.70 35.39±5.89
full shot 82.52±3.78 66.32±6.58
RNN 10 shot 60.80±5.22 44.51±6.86
full shot 80.90±3.64 64.07±6.56
AdaCare 10 shot 64.39±4.76 42.47±6.43
full shot 82.26±3.80 68.82±6.76
AICare 10 shot 66.32±4.55 43.76±7.02
full shot 80.20±4.23 65.50±6.72
ConCare 10 shot 70.30±4.62 48.68±7.41
full shot 79.17±4.42 64.27±6.97
GRASP 10 shot 53.76±5.54 36.35±6.24
full shot 77.76±4.17 62.42±7.08
Base LLM OpenBioLLM-8B base prompt 54.95±4.45 28.49±5.68
optimized prompt 50.21±4.97 24.23±4.02
Qwen2.5-7B base prompt 61.63±4.43 28.96±4.35
optimized prompt 55.86±3.98 25.32±4.02
Gemma-3-4B base prompt 59.85±4.12 28.94±4.64
optimized prompt 60.02±4.23 29.05±5.37
DeepSeek-V3 base prompt 66.70±4.76 34.02±5.89
optimized prompt 62.68±4.49 30.91±5.30
GPT-4o base prompt 59.16±5.45 28.70±5.08
optimized prompt 62.72±4.87 34.43±5.73
Reasoning LLM HuatuoGPT-o1-7B base prompt 56.39±4.88 29.35±5.27
optimized prompt 50.54±4.88 24.30±4.22
DeepSeek-R1-7B base prompt 52.62±3.76 23.55±4.01
optimized prompt 53.19±4.13 24.53±3.66
DeepSeek-R1 base prompt 65.31±4.98 38.68±6.88
optimized prompt 73.92±3.78 43.59±6.42
o3-mini-high base prompt 62.89±4.72 33.69±5.82
optimized prompt 63.30±4.85 36.13±6.18

Key Observations (Structured EHR):

  • Specialized conventional models (e.g., CatBoost on 10-shot TJH Outcome) achieve superior performance when fully trained with ample data.
  • Leading LLMs (e.g., GPT-4o, DeepSeek R1/V3) demonstrate strong zero-shot capabilities, often surpassing conventional models trained on limited datasets.
  • There's a narrowing performance gap between LLMs and conventional models for structured data.