This section details the performance of mortality and readmission prediction on TJH and MIMIC datasets using structured Electronic Health Records (EHR).
Outcome Prediction Performance
Performance of mortality prediction on TJH and MIMIC datasets using structured EHR.Bold indicates the best performance excluding results using the full dataset. Italic indicates the proposed prompting framework outperforms basic prompts. We use bootstrapping on all test set samples 100 times to report the standard deviations. All metrics are multiplied by 100 for readability purposes.
Methods
Setting
TJH Outcome
MIMIC-IV Outcome
ML
CatBoost
10 shot
99.43±0.31
99.31±0.39
62.18±7.41
19.48±5.81
full shot
99.16±0.47
98.99±0.59
71.18±6.72
28.27±8.53
DT
10 shot
79.64±2.78
69.20±4.42
59.48±5.84
11.98±3.23
full shot
92.20±1.83
87.79±3.04
51.81±3.69
10.48±2.64
RF
10 shot
99.16±0.44
98.99±0.54
59.92±8.79
22.51±7.55
full shot
99.18±0.46
99.05±0.56
67.06±5.54
15.89±3.84
XGBoost
10 shot
62.66±3.38
53.15±4.37
55.77±5.43
10.76±2.55
full shot
98.05±0.94
95.58±2.18
64.62±4.97
17.66±5.12
DL
GRU
10 shot
87.79±2.26
83.42±3.72
74.96±6.67
25.74±7.04
full shot
93.57±1.71
90.40±3.19
92.49±3.03
72.05±7.58
LSTM
10 shot
92.46±1.97
85.71±4.29
56.42±7.93
16.52±5.56
full shot
92.98±1.91
86.97±4.12
93.12±3.24
76.18±7.90
RNN
10 shot
95.53±1.28
94.27±1.78
62.03±7.97
26.07±8.62
full shot
96.42±1.10
95.18±1.62
91.76±3.12
72.03±8.15
AdaCare
10 shot
77.11±3.97
76.24±4.70
80.02±6.32
54.93±9.85
full shot
99.02±0.46
98.86±0.53
94.28±3.52
81.93±6.97
AICare
10 shot
86.79±2.41
82.98±3.96
60.87±6.25
23.69±8.14
full shot
95.97±1.31
94.56±2.02
92.89±3.66
77.84±7.10
ConCare
10 shot
90.98±1.97
91.43±2.12
72.58±6.42
30.73±8.44
full shot
91.00±2.14
91.72±2.30
94.08±3.70
80.65±6.98
GRASP
10 shot
87.25±2.53
84.32±3.56
69.89±8.80
45.96±10.33
full shot
94.25±1.58
92.03±2.54
93.14±3.03
72.55±8.36
Base LLM
OpenBioLLM-8B
base prompt
49.37±2.70
46.17±4.40
52.35±4.65
10.31±2.28
optimized prompt
56.75±3.92
49.76±4.67
58.69±6.06
12.85±3.77
Qwen2.5-7B
base prompt
72.96±4.00
64.49±5.06
70.68±5.15
17.55±4.54
optimized prompt
79.83±2.68
70.87±4.61
61.57±7.12
13.58±3.17
Gemma-3-4B
base prompt
65.64±3.53
64.83±4.68
63.46±7.68
17.76±5.71
optimized prompt
76.01±3.46
71.62±4.97
57.78±7.40
15.16±5.11
DeepSeek-V3
base prompt
89.59±1.93
85.06±3.01
78.07±6.13
43.76±10.43
optimized prompt
89.67±1.90
82.93±3.58
76.86±4.71
33.47±9.58
GPT-4o
base prompt
96.76±1.30
93.90±2.55
76.96±6.59
33.90±9.17
optimized prompt
95.72±1.21
93.04±2.08
85.99±3.85
42.20±9.92
Reasoning LLM
HuatuoGPT-o1-7B
base prompt
77.74±3.34
71.89±4.83
73.20±6.12
22.32±6.87
optimized prompt
85.34±2.61
77.31±4.26
70.39±7.60
20.33±5.51
DeepSeek-R1-7B
base prompt
53.59±3.88
49.06±4.05
53.59±5.26
10.53±2.75
optimized prompt
52.70±1.95
47.89±4.09
40.94±3.97
9.43±2.27
DeepSeek-R1
base prompt
90.63±1.99
83.59±3.85
73.68±7.52
33.27±9.51
optimized prompt
85.59±1.97
76.87±3.56
83.95±4.60
42.10±9.95
o3-mini-high
base prompt
85.43±2.15
76.52±3.91
71.13±5.41
18.35±5.02
optimized prompt
84.42±2.52
75.65±4.48
71.23±7.19
28.99±7.88
Length of Stay (LOS) Prediction Performance
Performance of LOS prediction on TJH dataset using structured EHR.Bold indicates the best performance excluding results using the full dataset. Italic indicates the proposed prompting framework outperforms basic prompts. We use bootstrapping on all test set samples 100 times to report the standard deviations. All metrics are multiplied by 100 for readability purposes.
Methods
Setting
ML
CatBoost
10 shot
4.14±0.18
24.04±3.63
4.89±0.37
full shot
3.09±0.21
18.61±4.24
4.29±0.49
DT
10 shot
6.25±0.72
118.50±17.13
10.86±0.78
full shot
3.53±0.42
50.40±9.04
7.07±0.65
RF
10 shot
4.60±0.22
31.21±3.70
5.58±0.33
full shot
3.09±0.24
23.04±4.74
4.77±0.49
XGBoost
10 shot
4.22±0.18
24.73±3.59
4.96±0.36
full shot
3.06±0.21
18.70±4.29
4.30±0.49
DL
GRU
10 shot
4.43±0.23
31.35±3.69
5.59±0.33
full shot
2.50±0.21
17.35±4.27
4.13±0.51
LSTM
10 shot
3.83±0.23
25.94±3.91
5.08±0.38
full shot
2.64±0.21
17.76±4.21
4.18±0.50
RNN
10 shot
4.32±0.19
25.97±3.58
5.08±0.35
full shot
2.07±0.24
16.64±5.00
4.03±0.61
AdaCare
10 shot
3.79±0.20
23.38±3.70
4.82±0.38
full shot
2.49±0.24
18.12±4.63
4.22±0.54
AICare
10 shot
3.24±0.22
20.35±4.43
4.48±0.49
full shot
2.17±0.23
16.28±4.84
3.99±0.60
ConCare
10 shot
4.01±0.29
30.03±6.41
5.45±0.58
full shot
2.32±0.24
17.94±4.99
4.19±0.59
GRASP
10 shot
5.81±0.18
41.25±2.96
6.42±0.23
full shot
3.84±0.20
23.89±3.07
4.88±0.31
Base LLM
OpenBioLLM-8B
base prompt
6.50±0.32
61.41±6.84
7.82±0.44
optimized prompt
8.97±0.36
105.23±6.52
10.25±0.32
Qwen2.5-7B
base prompt
12.44±0.43
184.73±12.89
13.58±0.47
optimized prompt
15.34±0.71
313.70±27.84
17.69±0.79
Gemma-3-4B
base prompt
13.25±0.35
199.67±9.20
14.13±0.33
optimized prompt
14.87±0.39
250.80±10.36
15.83±0.33
DeepSeek-V3
base prompt
13.83±0.41
220.52±11.64
14.84±0.39
optimized prompt
18.04±0.42
359.87±14.22
18.97±0.38
GPT-4o
base prompt
14.56±0.47
258.23±14.89
16.06±0.46
optimized prompt
17.93±0.58
378.94±21.17
19.46±0.55
Reasoning LLM
HuatuoGPT-o1-7B
base prompt
8.87±0.26
98.61±5.83
9.93±0.29
optimized prompt
11.12±0.42
157.38±11.66
12.54±0.47
DeepSeek-R1-7B
base prompt
5.24±0.22
37.66±4.89
6.12±0.39
optimized prompt
5.59±0.24
44.07±4.66
6.63±0.35
DeepSeek-R1
base prompt
12.89±0.41
194.28±10.52
13.93±0.38
optimized prompt
15.20±0.39
263.05±11.49
16.21±0.35
o3-mini-high
base prompt
13.44±0.46
217.77±15.54
14.75±0.52
optimized prompt
15.63±0.40
276.67±13.52
16.63±0.41
Readmission Prediction Performance
Performance of readmission prediction on MIMIC-IV dataset using structured EHR.Bold indicates the best performance excluding results using the full dataset. Italic indicates the proposed prompting framework outperforms basic prompts. We use bootstrapping on all test set samples 100 times to report the standard deviations. All metrics are multiplied by 100 for readability purposes.
Methods
Setting
ML
CatBoost
10 shot
52.29±5.39
26.96±5.10
full shot
61.78±4.82
31.76±5.71
DT
10 shot
55.68±3.38
24.91±3.75
full shot
51.55±2.72
23.65±3.72
RF
10 shot
58.33±5.27
31.21±5.46
full shot
57.43±4.32
29.67±5.34
XGBoost
10 shot
56.40±4.53
29.10±4.63
full shot
64.23±4.34
34.31±6.65
DL
GRU
10 shot
58.19±5.42
37.76±7.17
full shot
81.30±3.81
63.70±6.56
LSTM
10 shot
50.98±5.70
35.39±5.89
full shot
82.52±3.78
66.32±6.58
RNN
10 shot
60.80±5.22
44.51±6.86
full shot
80.90±3.64
64.07±6.56
AdaCare
10 shot
64.39±4.76
42.47±6.43
full shot
82.26±3.80
68.82±6.76
AICare
10 shot
66.32±4.55
43.76±7.02
full shot
80.20±4.23
65.50±6.72
ConCare
10 shot
70.30±4.62
48.68±7.41
full shot
79.17±4.42
64.27±6.97
GRASP
10 shot
53.76±5.54
36.35±6.24
full shot
77.76±4.17
62.42±7.08
Base LLM
OpenBioLLM-8B
base prompt
54.95±4.45
28.49±5.68
optimized prompt
50.21±4.97
24.23±4.02
Qwen2.5-7B
base prompt
61.63±4.43
28.96±4.35
optimized prompt
55.86±3.98
25.32±4.02
Gemma-3-4B
base prompt
59.85±4.12
28.94±4.64
optimized prompt
60.02±4.23
29.05±5.37
DeepSeek-V3
base prompt
66.70±4.76
34.02±5.89
optimized prompt
62.68±4.49
30.91±5.30
GPT-4o
base prompt
59.16±5.45
28.70±5.08
optimized prompt
62.72±4.87
34.43±5.73
Reasoning LLM
HuatuoGPT-o1-7B
base prompt
56.39±4.88
29.35±5.27
optimized prompt
50.54±4.88
24.30±4.22
DeepSeek-R1-7B
base prompt
52.62±3.76
23.55±4.01
optimized prompt
53.19±4.13
24.53±3.66
DeepSeek-R1
base prompt
65.31±4.98
38.68±6.88
optimized prompt
73.92±3.78
43.59±6.42
o3-mini-high
base prompt
62.89±4.72
33.69±5.82
optimized prompt
63.30±4.85
36.13±6.18
Key Observations (Structured EHR):
Specialized conventional models (e.g., CatBoost on 10-shot TJH Outcome) achieve superior performance when fully trained with ample data.
Leading LLMs (e.g., GPT-4o, DeepSeek R1/V3) demonstrate strong zero-shot capabilities, often surpassing conventional models trained on limited datasets.
There's a narrowing performance gap between LLMs and conventional models for structured data.