2025-11-17 · Chenhui Zhao · MLiNS @ Univeristy of Michigan
In HLIP, we present a language–image pre-training framework designed for uncurated 3D medical data that incorporates a hierarchical attention mechanism. HLIP achieves state-of-the-art results on both curated and uncurated 3D medical datasets spanning brain MRI, head CT, and chest CT. We attribute these gains to the effective modeling, careful implementation, and scalability. In this blog, building on HLIP’s conclusions and implementation, we push scalability one step further for uncurated 3D medical data. To this end, we conduct five ablation studies that appear not to improve performance yet are crucial for scalability and for advancing vision–language modeling, including visual instruction tuning. This yields a new HLIP model trained on the combined BrainMRI220K and HeadCT240K datasets. We further introduce a simple yet effective adjustment to the language supervision, resulting in an updated HLIP model.
The code and model
presented in this blog have been published.
While HLIP uses the external Pub-Brain-5 dataset to ablate different model designs, this dataset contains only five classes (normal, stroke, glioma, meningioma, metastasis), which is not sufficiently comprehensive to assess model capacity. The same limitation applies to the external RSNA dataset. In the following experiments, we instead evaluate on our prospective dataset, which contains 23K studies covering 74 diagnoses for brain MRI and approximately 15K studies covering 83 diagnoses for head CT. Moreover, the linear-probe protocol can introduce additional bias during evaluation. Therefore, we instead use a zero-shot evaluation protocol, averaging over multiple prompts for stability (similar to the implementation in open-clip). Although the evaluation set is not publicly available, we hope that the conclusions drawn from these comprehensive evaluations can facilitate future work for the community.
We first reimplement the HLIP model on the HeadCT240K and BrainMRI220K datasets, respectively.
All three experiments are conducted on the BrainMRI220K dataset.
All three experiments are conducted on the BrainMRI220K dataset.
Keeping all five subtle but meaningful changes, we train HLIP on the combined BrainMRI220K and HeadCT240K datasets. Using a batch size of 768 achieved through a gradient-accumulation step of 2, the training process takes approximately two days on eight L40 GPUs. Combining these two datasets yields a significant advantage for head CT.
Image captions used in the original CLIP are very short, whereas radiology reports are much longer, even when using an LLM-generated summary or the impression section. Intuitively, we randomly select a single sentence at each training step during language–image pre-training. We find that this simple adjustment yields a significant improvement. We hypothesize that this improvement arises from the limited representational capacity of the language model and from the distribution shift between training (long text) and zero-shot evaluation (short prompt).
We further perform unmasked fine-tuning, maintaining the same batch size of 768 by increasing the gradient-accumulation steps to 6. Unmasked fine-tuning further improves performance. This yields our updated HLIP model.
Pub-Brain-5 (Anomaly Detection)
| Stroke | Glioma | Meningioma | Metastasis | Mean | |
|---|---|---|---|---|---|
| HLIP | 91.5 | 89.2 | 79.2 | 78.1 | 84.5 |
| HLIP-2025-10-08 | 94.8 | 94.8 | 86.0 | 86.2 | 90.5 |
RSNA (Full Set)
| Intraparenchymal | Intraventricular | Subarachnoid | Subdural | Any | Mean | |
|---|---|---|---|---|---|---|
| HLIP | 88.2 | 91.4 | 84.1 | 83.4 | 81.5 | 85.7 |
| HLIP-2025-10-08 | 93.5 | 96.4 | 90.2 | 89.1 | 90.8 | 92.0 |
We evaluate this new model on Pub-Brain-5’s anomaly detection task and on the full RSNA dataset, demonstrating superior performance compared with the HLIP model in the original paper. Note that these experiments are conducted under the zero-shot setting.
Here, we report a phenomenon observed when supervising with LLM-summarized reports. Although GPT-4omini and GPT-4.1mini are more advanced models than GPT-3.5, we find that supervising on reports summarized by these two models can lead to a significant decrease in zero-shot performance.
We find that with sentence dropout, this issue can be largely alleviated.
At the end of this blog, we introduce four designs that we find do not provide benefits in our setting.