DiffCLIP: Differential Attention Meets CLIP

Hasan Abed Al Kader Hammoud and Bernard Ghanem

King Abdullah University of Science and Technology

Abstract

We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency.

What is Differential Attention?

Differential attention, proposed in Differential Transformer, computes the difference between two attention maps:

DiffAttn(X) = (softmax(Q₁K₁ᵀ/√d) − λ · softmax(Q₂K₂ᵀ/√d)) · V

where the query and key projections are split as [Q₁; Q₂] = X·Wᵠ and [K₁; K₂] = X·Wᵏ, and λ is a learnable parameter. This mechanism allows the model to capture complementary information by explicitly modeling the differences between attention patterns, leading to richer multimodal representations.

Structure

The repository contains two main components:

DifferentialVisionTransformer (in diff_attention.py): A Vision Transformer modified to use differential attention.
DiffCLIP (in diff_clip.py): A CLIP model that uses differential attention in both its vision and text encoders.

How to Use

Installation

# Clone the repository
git clone https://github.com/yourusername/DiffCLIP.git
cd DiffCLIP

# Install dependencies
pip install -r requirements.txt

Basic Usage

import torch
from diff_clip import DiffCLIP_VITB16

# Create model
model = DiffCLIP_VITB16()

# Process image and text
image = torch.randn(1, 3, 224, 224)
text = torch.randint(0, 49408, (1, 77))  # Tokenized text

# Get embeddings
with torch.no_grad():
    outputs = model(image, text)

print(outputs["image_embed"].shape)  # Should be [1, 512]
print(outputs["text_embed"].shape)   # Should be [1, 512]

Zero-Shot Classification

You can use the provided test_models.py script to perform zero-shot classification:

# Download the model from Hugging Face and test on a COCO image
python test_models.py

This will:

Download the DiffCLIP_ViTB16_CC12M model from Hugging Face
Load a sample image from COCO
Perform zero-shot classification
Print the top-5 predicted classes

References

@misc{hammoud2025diffclipdifferentialattentionmeets,
      title={DiffCLIP: Differential Attention Meets CLIP}, 
      author={Hasan Abed Al Kader Hammoud and Bernard Ghanem},
      year={2025},
      eprint={2503.06626},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.06626}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
assets		assets
DiffCLIP__Differential_Attention_Meets_CLIP.pdf		DiffCLIP__Differential_Attention_Meets_CLIP.pdf
README.md		README.md
__init__.py		__init__.py
bpe_simple_vocab_16e6.txt.gz		bpe_simple_vocab_16e6.txt.gz
diff_attention.py		diff_attention.py
diff_clip.py		diff_clip.py
requirements.txt		requirements.txt
test_models.py		test_models.py
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiffCLIP: Differential Attention Meets CLIP

Abstract

What is Differential Attention?

Structure

How to Use

Installation

Basic Usage

Zero-Shot Classification

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DiffCLIP: Differential Attention Meets CLIP

Abstract

What is Differential Attention?

Structure

How to Use

Installation

Basic Usage

Zero-Shot Classification

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages