2025-11-17 (updated 2026-02-22) · Chenhui Zhao · MLiNS @ Univeristy of Michigan
In HLIP, we present a language–image pre-training framework designed for uncurated 3D medical data that incorporates a hierarchical attention mechanism. HLIP achieves state-of-the-art results on both curated and uncurated 3D medical datasets spanning brain MRI, head CT, and chest CT. We attribute these gains to the effective modeling, careful implementation, and scalability. In this blog, building on HLIP’s conclusions and implementation, we push scalability one step further for uncurated 3D medical data. To this end, we conduct five ablation studies that appear not to improve performance yet are crucial for scalability and for advancing vision–language modeling, including visual instruction tuning. This yields new HLIP models trained on the combined BrainMRI220K and HeadCT240K datasets. We further introduce a simple yet effective adjustment to the language supervision, resulting in updated HLIP models.
The code and model
presented in this blog have been published.
While HLIP uses the external Pub-Brain-5 dataset to ablate different model designs, this dataset contains only five classes (normal, stroke, glioma, meningioma, metastasis), which is not sufficiently comprehensive to assess model capacity. The same limitation applies to the external RSNA dataset. In the following experiments, we instead evaluate on our prospective dataset, which contains 23K studies covering 74 diagnoses for brain MRI and approximately 15K studies covering 83 diagnoses for head CT. Moreover, the linear-probe protocol can introduce additional bias during evaluation. Therefore, we instead use a zero-shot evaluation protocol, averaging over multiple prompts for stability (similar to the implementation in open-clip). Given the scale of our evaluation, even a 0.5 or 1.0 AUC gain would be considered significant. Although the evaluation set is not publicly available, we hope that the conclusions drawn from these comprehensive evaluations can facilitate future work for the community.
We reimplement HLIP on the HeadCT240K and BrainMRI220K datasets, achieving 75.9 AUC on head CT and 81.1 AUC on brain MRI.
All three experiments are conducted on the BrainMRI220K dataset.
All three experiments are conducted on the BrainMRI220K dataset.
Keeping all five subtle but meaningful changes, we train HLIP on the combined BrainMRI220K and HeadCT240K datasets. Using a batch size of 768 achieved through a gradient-accumulation step of 2, the training process takes approximately two days on eight L40 GPUs. With these changes, HLIP achieves 79.2 AUC on head CT and 80.6 AUC on brain MRI.
Image captions used in the original CLIP are short, often fewer than 60 words, whereas radiology reports are substantially longer, even when using an LLM-generated summary or the impression section. Motivated by this mismatch, we randomly select a single sentence at each training step during language–image pre-training. We find that this simple change yields a significant improvement. With this change, HLIP achieves 88.9 AUC on both head CT and brain MRI. We hypothesize that the gain stems from the limited representational capacity of the language model and the distribution shift between training (long text) and zero-shot evaluation (short prompts).
One limitation of sentence dropout is that it can destabilize training. In practice, to ensure stable optimization, we must use a smaller learning rate (from 6e-4 to 4e-4). This instability increases the risk when scaling to a ViT-Large model, which is typically more difficult to train than a ViT-Base model. To improve training stability and avoid unnecessary hyperparameter tuning when training ViT-Large, we introduce a dual contrastive loss, following the strategy proposed in TIPS. Specifically, at each step, we introduce two CLS tokens in the ViT architecture and contrast them with a randomly selected sentence and the full report, respectively. We observe faster convergence for the sentence contrastive loss. Intuitively, the model learns global features from sentence supervision and dense features from full-report supervision.
Building on the incremental designs introduced so far, we train four HLIP variants: (1) ViT-Base with scan attention (block indices: 0, 1, 3, 4, 6, 7, 9, 10) and study attention (block indices: 2, 5, 8, 11); (2) ViT-Base with slice attention (block indices: 0, 3, 6, 9), scan attention (block indices: 1, 4, 7, 10), and study attention (block indices: 2, 5, 8, 11); (3) ViT-Large with scan attention (block indices: 0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22) and study attention (block indices: 5, 11, 17, 23); and (4) ViT-Large with slice attention (block indices: 0, 1, 2, 3, 6, 7, 8, 9, 12, 13, 14, 15, 18, 19, 20, 21), scan attention (block indices: 4, 10, 16, 22), and study attention (block indices: 5, 11, 17, 23).
All models are trained for 20 epochs on the combined BrainMRI220K and HeadCT240K dataset with an initial learning rate of 5e-4 and a batch size of 768, followed by an additional 5 epochs of unmasked fine-tuning.
| Model | Prospective | Pub-Brain-5 (Anomaly Detection) | RSNA (Full Set) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CT | MRI | STR | GLI | MEN | MET | mean | IPH | IVH | SAH | SDH | Any | mean | |
| in the paper | 75.9 | 81.1 | 91.5 | 89.2 | 79.2 | 78.1 | 84.5 | 88.2 | 91.4 | 84.1 | 83.4 | 81.5 | 85.7 |
| 2025-10-08 | 89.1 | 89.1 | 94.8 | 94.8 | 86.0 | 86.2 | 90.5 | 93.5 | 96.4 | 90.2 | 89.1 | 90.8 | 92.0 |
| ViT Base (scan + study) |
89.2 | 88.9 | 93.3 | 98.5 | 87.6 | 87.9 | 91.8 | 93.4 | 96.7 | 90.1 | 89.2 | 93.4 | 92.7 |
| ViT Base (slice + scan + study) |
89.0 | 89.1 | 93.2 | 99.2 | 84.9 | 89.5 | 91.6 | 93.8 | 96.9 | 90.8 | 89.8 | 94.2 | 93.1 |
| ViT Large (scan + study) |
89.0 | 89.7 | 92.5 | 99.6 | 83.7 | 84.8 | 90.2 | 94.1 | 96.8 | 91.1 | 87.9 | 94.6 | 92.9 |
| ViT Large (slice + scan + study) |
89.6 | 89.6 | 94.9 | 99.1 | 86.5 | 85.7 | 91.5 | 94.2 | 96.9 | 91.2 | 89.0 | 94.1 | 93.1 |
“in the paper” denotes HLIP models described in the original paper, while “2025-10-08” denotes an HLIP model aligned with the first version of this blog, trained using only sentence dropout. We evaluate all models on Pub-Brain-5’s anomaly detection task and on the full RSNA dataset, demonstrating superior performance compared with the HLIP model in the original paper. Note that these experiments are conducted under the zero-shot setting.
At the end of this blog post, we share several interesting findings and unsuccessful attempts from our experiments. We hope these observations provide new insights for researchers interested in this line of research.
Although GPT-4omini and GPT-4.1mini are more advanced models than GPT-3.5, we find that supervising on reports summarized by these two models can lead to a significant decrease in zero-shot performance.
We find that either sentence dropout or dual contrastive loss can largely alleviate this issue.
We introduce four designs that we find do not provide benefits in our current setting.
If this blog or the HLIP work is useful in your research, please consider citing:
@article{zhao2026towards,
title={Towards Scalable Language-Image Pre-training for 3D Medical Imaging},
author={Chenhui Zhao and Yiwei Lyu and Asadur Zaman Chowdury and Edward S Harake and Akhil Kondepudi and Akshay T Rao and Xinhai Hou and Honglak Lee and Todd C Hollon},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2026},
url={https://openreview.net/forum?id=WxHf4EcBWA}
}
@misc{zhao2026hlipablationblog,
author = {Chenhui Zhao},
title = {HLIP Ablation},
year = {2026},
url = {https://zch0414.github.io/hlip-ablation/},
note = {Accessed: 2026-02-23}
}