HLIP Ablation

2025-11-17 (updated 2026-02-22) · Chenhui Zhao · MLiNS @ Univeristy of Michigan

In HLIP, we present a language–image pre-training framework designed for uncurated 3D medical data that incorporates a hierarchical attention mechanism. HLIP achieves state-of-the-art results on both curated and uncurated 3D medical datasets spanning brain MRI, head CT, and chest CT. We attribute these gains to the effective modeling, careful implementation, and scalability. In this blog, building on HLIP’s conclusions and implementation, we push scalability one step further for uncurated 3D medical data. To this end, we conduct five ablation studies that appear not to improve performance yet are crucial for scalability and for advancing vision–language modeling, including visual instruction tuning. This yields new HLIP models trained on the combined BrainMRI220K and HeadCT240K datasets. We further introduce a simple yet effective adjustment to the language supervision, resulting in updated HLIP models.

The code github repo and model huggingface weights presented in this blog have been published.

Experiments

While HLIP uses the external Pub-Brain-5 dataset to ablate different model designs, this dataset contains only five classes (normal, stroke, glioma, meningioma, metastasis), which is not sufficiently comprehensive to assess model capacity. The same limitation applies to the external RSNA dataset. In the following experiments, we instead evaluate on our prospective dataset, which contains 23K studies covering 74 diagnoses for brain MRI and approximately 15K studies covering 83 diagnoses for head CT. Moreover, the linear-probe protocol can introduce additional bias during evaluation. Therefore, we instead use a zero-shot evaluation protocol, averaging over multiple prompts for stability (similar to the implementation in open-clip). Given the scale of our evaluation, even a 0.5 or 1.0 AUC gain would be considered significant. Although the evaluation set is not publicly available, we hope that the conclusions drawn from these comprehensive evaluations can facilitate future work for the community.

ct
reimplementation on HeadCT240K
mri
reimplementation on BrainMRI220K

We reimplement HLIP on the HeadCT240K and BrainMRI220K datasets, achieving 75.9 AUC on head CT and 81.1 AUC on brain MRI.

Pooling strategy, Patch size, Sequence position embedding

cls token → dino.txt
cls token → dino.txt (solid)
(8,16,16) → (6,16,16)
patch size [8, 16, 16] → [6, 16, 16] (solid)
w/ vs w/o sequence pos emb
w/ sequence position emb → w/o (solid)

All three experiments are conducted on the BrainMRI220K dataset.

Patch dropout, Number of scans per study

patch dropout 0.25 → 0.50
patch dropout 0.25 → 0.50 (solid)
patch dropout 0.50 → 0.75
patch dropout 0.50 → 0.75 (solid)
10 scans → 8 scans
10 scans → 8 scans (solid)

All three experiments are conducted on the BrainMRI220K dataset.

Pushing scalability one step further

ct&mri vs ct
ct&mri (green) vs ct only (yellow)
ct&mri vs mri
ct&mri (green) vs mri only (blue)

Keeping all five subtle but meaningful changes, we train HLIP on the combined BrainMRI220K and HeadCT240K datasets. Using a batch size of 768 achieved through a gradient-accumulation step of 2, the training process takes approximately two days on eight L40 GPUs. With these changes, HLIP achieves 79.2 AUC on head CT and 80.6 AUC on brain MRI.

Sentence dropout

sentence dropout ct
full report vs sentence dropout (solid)
sentence dropout mri
full report vs sentence dropout (solid)

Image captions used in the original CLIP are short, often fewer than 60 words, whereas radiology reports are substantially longer, even when using an LLM-generated summary or the impression section. Motivated by this mismatch, we randomly select a single sentence at each training step during language–image pre-training. We find that this simple change yields a significant improvement. With this change, HLIP achieves 88.9 AUC on both head CT and brain MRI. We hypothesize that the gain stems from the limited representational capacity of the language model and the distribution shift between training (long text) and zero-shot evaluation (short prompts).

Daul contrastive loss

dual contrastive loss ct
sentence dropout vs dual contrastive loss (orange)
dual contrastive loss mri
sentence dropout vs dual contrastive loss (orange)

One limitation of sentence dropout is that it can destabilize training. In practice, to ensure stable optimization, we must use a smaller learning rate (from 6e-4 to 4e-4). This instability increases the risk when scaling to a ViT-Large model, which is typically more difficult to train than a ViT-Base model. To improve training stability and avoid unnecessary hyperparameter tuning when training ViT-Large, we introduce a dual contrastive loss, following the strategy proposed in TIPS. Specifically, at each step, we introduce two CLS tokens in the ViT architecture and contrast them with a randomly selected sentence and the full report, respectively. We observe faster convergence for the sentence contrastive loss. Intuitively, the model learns global features from sentence supervision and dense features from full-report supervision.

Models

Building on the incremental designs introduced so far, we train four HLIP variants: (1) ViT-Base with scan attention (block indices: 0, 1, 3, 4, 6, 7, 9, 10) and study attention (block indices: 2, 5, 8, 11); (2) ViT-Base with slice attention (block indices: 0, 3, 6, 9), scan attention (block indices: 1, 4, 7, 10), and study attention (block indices: 2, 5, 8, 11); (3) ViT-Large with scan attention (block indices: 0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22) and study attention (block indices: 5, 11, 17, 23); and (4) ViT-Large with slice attention (block indices: 0, 1, 2, 3, 6, 7, 8, 9, 12, 13, 14, 15, 18, 19, 20, 21), scan attention (block indices: 4, 10, 16, 22), and study attention (block indices: 5, 11, 17, 23).

All models are trained for 20 epochs on the combined BrainMRI220K and HeadCT240K dataset with an initial learning rate of 5e-4 and a batch size of 768, followed by an additional 5 epochs of unmasked fine-tuning.

Model Prospective Pub-Brain-5 (Anomaly Detection) RSNA (Full Set)
CT MRI STR GLI MEN MET mean IPH IVH SAH SDH Any mean
in the paper 75.9 81.1 91.5 89.2 79.2 78.1 84.5 88.2 91.4 84.1 83.4 81.5 85.7
2025-10-08 89.1 89.1 94.8 94.8 86.0 86.2 90.5 93.5 96.4 90.2 89.1 90.8 92.0
ViT Base
(scan + study)
89.2 88.9 93.3 98.5 87.6 87.9 91.8 93.4 96.7 90.1 89.2 93.4 92.7
ViT Base
(slice + scan + study)
89.0 89.1 93.2 99.2 84.9 89.5 91.6 93.8 96.9 90.8 89.8 94.2 93.1
ViT Large
(scan + study)
89.0 89.7 92.5 99.6 83.7 84.8 90.2 94.1 96.8 91.1 87.9 94.6 92.9
ViT Large
(slice + scan + study)
89.6 89.6 94.9 99.1 86.5 85.7 91.5 94.2 96.9 91.2 89.0 94.1 93.1

“in the paper” denotes HLIP models described in the original paper, while “2025-10-08” denotes an HLIP model aligned with the first version of this blog, trained using only sentence dropout. We evaluate all models on Pub-Brain-5’s anomaly detection task and on the full RSNA dataset, demonstrating superior performance compared with the HLIP model in the original paper. Note that these experiments are conducted under the zero-shot setting.

Findings

At the end of this blog post, we share several interesting findings and unsuccessful attempts from our experiments. We hope these observations provide new insights for researchers interested in this line of research.

Supervised by different LLM-summarized reports

ct report 4o mini
gpt3.5turbo vs gpt4omini (solid)
ct report 4.1 mini
gpt3.5turbo vs gpt4.1mini (solid)
mri report 4o mini
gpt3.5turbo vs gpt4omini (solid)
mri report 4.1 mini
gpt3.5turbo vs gpt4.1mini (solid)

Although GPT-4omini and GPT-4.1mini are more advanced models than GPT-3.5, we find that supervising on reports summarized by these two models can lead to a significant decrease in zero-shot performance.

ct sentence dropout report 4o mini
gpt3.5turbo w/sentence dropout vs gpt4omini w/ sentence dropout (solid)
mri sentence dropout report 4o mini
gpt3.5turbo w/sentence dropout vs gpt4omini w/ sentence dropout (solid)
ct dual report 4o mini
gpt3.5turbo w/dual contrastive loss vs gpt4omini w/ dual contrastive loss (solid)
mri dual report 4o mini
gpt3.5turbo w/dual contrastive loss vs gpt4omini w/ dual contrastive loss (solid)

We find that either sentence dropout or dual contrastive loss can largely alleviate this issue.

Unsuccessful attempts

initialization
patch embedding initialization average → central (solid)
smaller patch size
patch size [8, 16, 16] → [8, 14, 14] (solid)
rope ct
rotary position embedding (solid)
rope mri
rotary position embedding (solid)

We introduce four designs that we find do not provide benefits in our current setting.

Citation

If this blog or the HLIP work is useful in your research, please consider citing:

@article{zhao2026towards,
  title={Towards Scalable Language-Image Pre-training for 3D Medical Imaging},
  author={Chenhui Zhao and Yiwei Lyu and Asadur Zaman Chowdury and Edward S Harake and Akhil Kondepudi and Akshay T Rao and Xinhai Hou and Honglak Lee and Todd C Hollon},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2026},
  url={https://openreview.net/forum?id=WxHf4EcBWA}
}
@misc{zhao2026hlipablationblog,
  author = {Chenhui Zhao},
  title = {HLIP Ablation},
  year = {2026},
  url = {https://zch0414.github.io/hlip-ablation/},
  note = {Accessed: 2026-02-23}
}