HLIP Ablation

2025-11-17 · Chenhui Zhao · MLiNS @ Univeristy of Michigan

In HLIP, we present a language–image pre-training framework designed for uncurated 3D medical data that incorporates a hierarchical attention mechanism. HLIP achieves state-of-the-art results on both curated and uncurated 3D medical datasets spanning brain MRI, head CT, and chest CT. We attribute these gains to the effective modeling, careful implementation, and scalability. In this blog, building on HLIP’s conclusions and implementation, we push scalability one step further for uncurated 3D medical data. To this end, we conduct five ablation studies that appear not to improve performance yet are crucial for scalability and for advancing vision–language modeling, including visual instruction tuning. This yields a new HLIP model trained on the combined BrainMRI220K and HeadCT240K datasets. We further introduce a simple yet effective adjustment to the language supervision, resulting in an updated HLIP model.

The code and model presented in this blog have been published.

Experimental setup

While HLIP uses the external Pub-Brain-5 dataset to ablate different model designs, this dataset contains only five classes (normal, stroke, glioma, meningioma, metastasis), which is not sufficiently comprehensive to assess model capacity. The same limitation applies to the external RSNA dataset. In the following experiments, we instead evaluate on our prospective dataset, which contains 23K studies covering 74 diagnoses for brain MRI and approximately 15K studies covering 83 diagnoses for head CT. Moreover, the linear-probe protocol can introduce additional bias during evaluation. Therefore, we instead use a zero-shot evaluation protocol, averaging over multiple prompts for stability (similar to the implementation in open-clip). Although the evaluation set is not publicly available, we hope that the conclusions drawn from these comprehensive evaluations can facilitate future work for the community.

reimplementation on HeadCT240K

reimplementation on BrainMRI220K

We first reimplement the HLIP model on the HeadCT240K and BrainMRI220K datasets, respectively.

Pooling strategy, Patch size, Sequence position embedding

cls token → dino.txt (solid)

patch size [8, 16, 16] → [6, 16, 16] (solid)

w/ sequence position emb → w/o (solid)

All three experiments are conducted on the BrainMRI220K dataset.

Advancing vision–language modeling, such as visual instruction tuning, may require visual tokens extracted from a frozen vision encoder. However, because HLIP uses a CLS-token pooling strategy, the visual tokens in the final layer do not receive gradients during pre-training. One could instead use the visual tokens from the second-to-last layer, but this is undesirable because the final layer of HLIP performs study-level attention. Here, we instead ablate the pooling strategy proposed by DINO.TXT, which concatenates the CLS token with the average-pooled visual token. Although this does not improve performance in our setting, we retain this design because it can benefit downstream tasks like segmentation and visual instruction tuning.
Smaller patch sizes have been widely shown to benefit many perception tasks. Here, we find that HLIP also benefits from smaller patch sizes.
We find that the sequence position embedding is not necessary, likely because HLIP first applies scan-level attention, which is sufficient for the model to distinguish between different scans. Moreover, removing the sequence position embedding also makes the overall architecture more compatible with advanced positional embedding strategies, such as rotary position embedding, which we discuss later.

Patch dropout, Number of scans per study

patch dropout 0.25 → 0.50 (solid)

patch dropout 0.50 → 0.75 (solid)

10 scans → 8 scans (solid)

All three experiments are conducted on the BrainMRI220K dataset.

HLIP already uses a 0.25 patch dropout rate, primarily for acceleration and regularization. Here, we further explore this factor to enable larger batch sizes under fixed computational resources. We find that a 0.5 patch dropout rate still offers a favorable precision–batch-size trade-off when batch size is 384, whereas a 0.75 rate does not when batch size is 512. Therefore, we adopt a 0.5 patch dropout rate in subsequent experiments.
We also find that reducing the number of scans per study during training from ten to eight does not affect performance, while it accelerates the training process and alleviates memory consumption.

Pushing scalability one step further

ct&mri (green) vs ct only

ct&mri (green) vs mri only

Keeping all five subtle but meaningful changes, we train HLIP on the combined BrainMRI220K and HeadCT240K datasets. Using a batch size of 768 achieved through a gradient-accumulation step of 2, the training process takes approximately two days on eight L40 GPUs. Combining these two datasets yields a significant advantage for head CT.

Sentence dropout

sentence dropout (solid)

sentence dropout (solid)

Image captions used in the original CLIP are very short, whereas radiology reports are much longer, even when using an LLM-generated summary or the impression section. Intuitively, we randomly select a single sentence at each training step during language–image pre-training. We find that this simple adjustment yields a significant improvement. We hypothesize that this improvement arises from the limited representational capacity of the language model and from the distribution shift between training (long text) and zero-shot evaluation (short prompt).

Umasked fine-tuning

unmasked finetune (deep green)

unmasked finetune (deep green)

We further perform unmasked fine-tuning, maintaining the same batch size of 768 by increasing the gradient-accumulation steps to 6. Unmasked fine-tuning further improves performance. This yields our updated HLIP model.

External evaluation

Pub-Brain-5 (Anomaly Detection)

	Stroke	Glioma	Meningioma	Metastasis	Mean
HLIP	91.5	89.2	79.2	78.1	84.5
HLIP-2025-10-08	94.8	94.8	86.0	86.2	90.5

RSNA (Full Set)

	Intraparenchymal	Intraventricular	Subarachnoid	Subdural	Any	Mean
HLIP	88.2	91.4	84.1	83.4	81.5	85.7
HLIP-2025-10-08	93.5	96.4	90.2	89.1	90.8	92.0

We evaluate this new model on Pub-Brain-5’s anomaly detection task and on the full RSNA dataset, demonstrating superior performance compared with the HLIP model in the original paper. Note that these experiments are conducted under the zero-shot setting.

Supervised by LLM-summarized report

gpt3.5turbo vs gpt4omini (solid)

gpt3.5turbo vs gpt4.1mini (solid)

gpt3.5turbo vs gpt4omini (solid)

gpt3.5turbo vs gpt4.1mini (solid)

Here, we report a phenomenon observed when supervising with LLM-summarized reports. Although GPT-4omini and GPT-4.1mini are more advanced models than GPT-3.5, we find that supervising on reports summarized by these two models can lead to a significant decrease in zero-shot performance.

gpt4omini w/ sentence dropout (solid)

gpt4omini w/ sentence dropout (solid)

We find that with sentence dropout, this issue can be largely alleviated.

Unsuccessful attempts

patch embedding initialization average → central (solid)

patch size [8, 16, 16] → [8, 14, 14] (solid)

rotary position embedding (solid)

rotary position embedding (solid)

At the end of this blog, we introduce four designs that we find do not provide benefits in our setting.

For the patch embedding layer, while central-inflation initialization has been shown to perform better for video ViTs, we find that average-inflation initialization performs better in our setting.
We find that using smaller patch sizes along the x and y axes does not improve performance.
We implement a rotary position embedding following V-JEPA 2. However, we do not observe clear benefits. We hypothesize that rotary position embeddings may be more beneficial for larger models.