HLIP Ablation

2025-11-17 · Chenhui Zhao · MLiNS @ Univeristy of Michigan

In HLIP, we present a language–image pre-training framework designed for uncurated 3D medical data that incorporates a hierarchical attention mechanism. HLIP achieves state-of-the-art results on both curated and uncurated 3D medical datasets spanning brain MRI, head CT, and chest CT. We attribute these gains to the effective modeling, careful implementation, and scalability. In this blog, building on HLIP’s conclusions and implementation, we push scalability one step further for uncurated 3D medical data. To this end, we conduct five ablation studies that appear not to improve performance yet are crucial for scalability and for advancing vision–language modeling, including visual instruction tuning. This yields a new HLIP model trained on the combined BrainMRI220K and HeadCT240K datasets. We further introduce a simple yet effective adjustment to the language supervision, resulting in an updated HLIP model.

The code github repo and model huggingface weights presented in this blog have been published.

Experimental setup

While HLIP uses the external Pub-Brain-5 dataset to ablate different model designs, this dataset contains only five classes (normal, stroke, glioma, meningioma, metastasis), which is not sufficiently comprehensive to assess model capacity. The same limitation applies to the external RSNA dataset. In the following experiments, we instead evaluate on our prospective dataset, which contains 23K studies covering 74 diagnoses for brain MRI and approximately 15K studies covering 83 diagnoses for head CT. Moreover, the linear-probe protocol can introduce additional bias during evaluation. Therefore, we instead use a zero-shot evaluation protocol, averaging over multiple prompts for stability (similar to the implementation in open-clip). Although the evaluation set is not publicly available, we hope that the conclusions drawn from these comprehensive evaluations can facilitate future work for the community.

ct
reimplementation on HeadCT240K
mri
reimplementation on BrainMRI220K
 

We first reimplement the HLIP model on the HeadCT240K and BrainMRI220K datasets, respectively.

Pooling strategy, Patch size, Sequence position embedding

cls token → dino.txt
cls token → dino.txt (solid)
(8,16,16) → (6,16,16)
patch size [8, 16, 16] → [6, 16, 16] (solid)
w/ vs w/o sequence pos emb
w/ sequence position emb → w/o (solid)

All three experiments are conducted on the BrainMRI220K dataset.

Patch dropout, Number of scans per study

patch dropout 0.25 → 0.50
patch dropout 0.25 → 0.50 (solid)
patch dropout 0.50 → 0.75
patch dropout 0.50 → 0.75 (solid)
10 scans → 8 scans
10 scans → 8 scans (solid)

All three experiments are conducted on the BrainMRI220K dataset.

Pushing scalability one step further

ct&mri vs ct
ct&mri (green) vs ct only
ct&mri vs mri
ct&mri (green) vs mri only
 

Keeping all five subtle but meaningful changes, we train HLIP on the combined BrainMRI220K and HeadCT240K datasets. Using a batch size of 768 achieved through a gradient-accumulation step of 2, the training process takes approximately two days on eight L40 GPUs. Combining these two datasets yields a significant advantage for head CT.

Sentence dropout

sentence dropout ct
sentence dropout (solid)
sentence dropout mri
sentence dropout (solid)
 

Image captions used in the original CLIP are very short, whereas radiology reports are much longer, even when using an LLM-generated summary or the impression section. Intuitively, we randomly select a single sentence at each training step during language–image pre-training. We find that this simple adjustment yields a significant improvement. We hypothesize that this improvement arises from the limited representational capacity of the language model and from the distribution shift between training (long text) and zero-shot evaluation (short prompt).

Umasked fine-tuning

unmasked finetune ct
unmasked finetune (deep green)
unmasked finetune mri
unmasked finetune (deep green)
 

We further perform unmasked fine-tuning, maintaining the same batch size of 768 by increasing the gradient-accumulation steps to 6. Unmasked fine-tuning further improves performance. This yields our updated HLIP model.

External evaluation

Pub-Brain-5 (Anomaly Detection)

  Stroke Glioma Meningioma Metastasis Mean
HLIP 91.5 89.2 79.2 78.1 84.5
HLIP-2025-10-08 94.8 94.8 86.0 86.2 90.5

RSNA (Full Set)

  Intraparenchymal Intraventricular Subarachnoid Subdural Any Mean
HLIP 88.2 91.4 84.1 83.4 81.5 85.7
HLIP-2025-10-08 93.5 96.4 90.2 89.1 90.8 92.0

We evaluate this new model on Pub-Brain-5’s anomaly detection task and on the full RSNA dataset, demonstrating superior performance compared with the HLIP model in the original paper. Note that these experiments are conducted under the zero-shot setting.

Supervised by LLM-summarized report

ct report 4o mini
gpt3.5turbo vs gpt4omini (solid)
ct report 4.1 mini
gpt3.5turbo vs gpt4.1mini (solid)
mri report 4o mini
gpt3.5turbo vs gpt4omini (solid)
mri report 4.1 mini
gpt3.5turbo vs gpt4.1mini (solid)

Here, we report a phenomenon observed when supervising with LLM-summarized reports. Although GPT-4omini and GPT-4.1mini are more advanced models than GPT-3.5, we find that supervising on reports summarized by these two models can lead to a significant decrease in zero-shot performance.

ct sentence dropout report 4o mini
gpt4omini w/ sentence dropout (solid)
mri sentence dropout report 4o mini
gpt4omini w/ sentence dropout (solid)
 

We find that with sentence dropout, this issue can be largely alleviated.

Unsuccessful attempts

initialization
patch embedding initialization average → central (solid)
smaller patch size
patch size [8, 16, 16] → [8, 14, 14] (solid)
rope ct
rotary position embedding (solid)
rope mri
rotary position embedding (solid)

At the end of this blog, we introduce four designs that we find do not provide benefits in our setting.