Hasan Abed Al Kader Hammoud and Bernard Ghanem
King Abdullah University of Science and Technology
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency.
Differential attention, proposed in Differential Transformer, computes the difference between two attention maps:
DiffAttn(X) = (softmax(Q₁K₁ᵀ/√d) − λ · softmax(Q₂K₂ᵀ/√d)) · V
where the query and key projections are split as [Q₁; Q₂] = X·Wᵠ and [K₁; K₂] = X·Wᵏ, and λ is a learnable parameter. This mechanism allows the model to capture complementary information by explicitly modeling the differences between attention patterns, leading to richer multimodal representations.
The repository contains two main components:
-
DifferentialVisionTransformer (in
diff_attention.py): A Vision Transformer modified to use differential attention. -
DiffCLIP (in
diff_clip.py): A CLIP model that uses differential attention in both its vision and text encoders.
# Clone the repository
git clone https://github.com/yourusername/DiffCLIP.git
cd DiffCLIP
# Install dependencies
pip install -r requirements.txtimport torch
from diff_clip import DiffCLIP_VITB16
# Create model
model = DiffCLIP_VITB16()
# Process image and text
image = torch.randn(1, 3, 224, 224)
text = torch.randint(0, 49408, (1, 77)) # Tokenized text
# Get embeddings
with torch.no_grad():
outputs = model(image, text)
print(outputs["image_embed"].shape) # Should be [1, 512]
print(outputs["text_embed"].shape) # Should be [1, 512]You can use the provided test_models.py script to perform zero-shot classification:
# Download the model from Hugging Face and test on a COCO image
python test_models.pyThis will:
- Download the DiffCLIP_ViTB16_CC12M model from Hugging Face
- Load a sample image from COCO
- Perform zero-shot classification
- Print the top-5 predicted classes
@misc{hammoud2025diffclipdifferentialattentionmeets,
title={DiffCLIP: Differential Attention Meets CLIP},
author={Hasan Abed Al Kader Hammoud and Bernard Ghanem},
year={2025},
eprint={2503.06626},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.06626},
}
