ClinicRealm

Benchmark Results: Unstructured Clinical Notes

This section presents the performance comparison of BERT-based and GPT-based models on MIMIC mortality and readmission prediction tasks using unstructured clinical notes.

Outcome Prediction Performance (MIMIC Clinical Notes)

Table 2a: *Performance comparison of BERT-based and GPT-based models on MIMIC mortality prediction tasks using unstructured clinical notes.* **Bold** denotes the best performance. We use a bootstrapping strategy on all test set samples 100 times to report the mean±std results. All metrics are multiplied by 100 for readability purposes.

Method		Setting	Outcome
Method		Setting
BERT-based LM	BERT	freeze	69.96±6.78	24.99±7.33
	BERT	finetune	81.04±5.14	40.90±9.21
	Clinical-Longformer	freeze	75.68±6.15	36.81±10.80
	Clinical-Longformer	finetune	88.06±4.45	62.92±9.40
	BioBERT	freeze	75.30±4.95	29.49±8.17
	BioBERT	finetune	75.85±6.63	36.93±10.89
	GatorTron	freeze	89.47±4.45	54.53±9.90
	GatorTron	finetune	91.47±3.44	71.43±8.79
	ClinicalBERT	freeze	73.55±5.69	26.98±8.32
	ClinicalBERT	finetune	84.95±4.72	53.01±10.67
GPT-based Base LLM	GPT-2-117M	freeze	65.60±6.36	17.08±5.85
	GPT-2-117M	finetune	82.30±5.50	46.68±10.38
	BioGPT-347M	freeze	77.41±3.78	24.94±7.17
	BioGPT-347M	finetune	88.39±4.51	62.62±10.42
	Meditron-7B	freeze	71.98±6.15	26.77±8.52
	Meditron-7B	finetune	79.05±5.05	39.10±9.84
	BioMistral-7B	freeze	70.78±7.30	33.64±9.81
	BioMistral-7B	finetune	74.42±5.88	40.31±9.32
	OpenBioLLM-8B	freeze	61.24±6.51	12.85±3.55
		finetune	80.85±5.71	55.56±11.92
		prompt	65.70±5.43	13.81±3.35
	Qwen2.5-7B	freeze	64.52±6.46	14.31±4.61
		finetune	89.04±3.93	60.89±10.18
		prompt	88.39±4.11	44.33±8.33
	Gemma-3-4B	freeze	66.24±6.26	28.78±9.54
	Gemma-3-4B	prompt	96.57±1.32	68.97±9.29
	DeepSeek-V3	prompt	97.13±1.07	65.16±10.39
	GPT-4o	prompt	93.99±2.00	58.04±10.01
Reasoning LLM	HuatuoGPT-o1-7B	freeze	60.69±5.52	11.67±2.88
		finetune	88.16±3.95	62.39±9.52
		prompt	83.84±3.55	29.72±6.79
	DeepSeek-R1-7B	freeze	56.76±6.73	16.72±7.24
		finetune	87.32±5.98	55.75±12.11
		prompt	83.17±4.63	36.25±8.75
	DeepSeek-R1	prompt	97.64±1.08	69.61±10.13
	o3-mini-high	prompt	97.45±1.01	71.72±10.31

Readmission Prediction Performance (MIMIC Clinical Notes)

Table 2b: *Performance comparison of BERT-based and GPT-based models on MIMIC readmission prediction tasks using unstructured clinical notes.* **Bold** denotes the best performance. We use a bootstrapping strategy on all test set samples 100 times to report the mean±std results. All metrics are multiplied by 100 for readability purposes.

Method		Setting	Readmission
Method		Setting
BERT-based LM	BERT	freeze	62.93±4.15	38.77±6.13
	BERT	finetune	67.99±4.61	47.20±7.06
	Clinical-Longformer	freeze	65.04±4.98	48.70±6.04
	Clinical-Longformer	finetune	75.56±4.42	59.19±6.65
	BioBERT	freeze	62.86±4.43	37.10±5.40
	BioBERT	finetune	75.85±6.63	36.93±10.89
	GatorTron	freeze	70.75±4.24	47.74±7.09
	GatorTron	finetune	75.96±4.34	59.01±6.85
	ClinicalBERT	freeze	64.94±4.40	40.90±6.10
	ClinicalBERT	finetune	71.93±4.35	53.82±6.85
GPT-based Base LLM	GPT-2-117M	freeze	60.43±5.00	37.05±6.14
	GPT-2-117M	finetune	71.81±4.58	50.14±6.67
	BioGPT-347M	freeze	66.54±4.04	37.62±5.37
	BioGPT-347M	finetune	71.66±4.10	52.23±6.63
	Meditron-7B	freeze	59.70±4.32	34.97±5.50
	Meditron-7B	finetune	64.04±4.40	40.90±6.21
	BioMistral-7B	freeze	67.41±4.09	42.95±5.55
	BioMistral-7B	finetune	75.66±4.37	61.60±6.28
	OpenBioLLM-8B	freeze	57.09±3.50	32.06±5.08
		finetune	77.22±3.73	63.91±5.82
		prompt	58.47±4.21	32.99±4.66
	Qwen2.5-7B	freeze	61.48±4.46	40.03±5.98
		finetune	75.80±4.00	58.42±6.72
		prompt	77.32±4.75	66.64±6.05
	Gemma-3-4B	freeze	63.83±4.27	35.15±4.70
	Gemma-3-4B	prompt	85.21±3.92	71.51±5.77
	DeepSeek-V3	prompt	87.33±3.44	78.53±5.06
	GPT-4o	prompt	75.89±5.51	66.38±6.65
Reasoning LLM	HuatuoGPT-o1-7B	freeze	59.14±4.26	37.67±6.05
		finetune	77.82±3.84	62.21±6.25
		prompt	67.33±5.35	49.50±7.09
	DeepSeek-R1-7B	freeze	53.21±3.81	33.12±5.35
		finetune	74.54±3.94	51.13±6.80
		prompt	61.69±4.65	34.50±5.75
	DeepSeek-R1	prompt	58.18±6.23	56.25±6.53
	o3-mini-high	prompt	87.59±3.34	75.48±5.74

Key Observations (Unstructured Clinical Notes):

State-of-the-art LLMs (e.g., DeepSeek R1/V3, GPT o3-mini-high, DeepSeek-R1) in zero-shot settings considerably outperform fine-tuned BERT-based models.
GPT-based models, especially with prompting, show strong performance, with models like DeepSeek-R1 and o3-mini-high achieving top results in AUROC and AUPRC for outcome prediction.
DeepSeek-V3 and o3-mini-high also show leading performance for readmission prediction.