Skip to content

gaobb/AdaptCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 

Repository files navigation

AdaptCLIP

HuggingFace Space

Official PyTorch Implementation of AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection, 2025.

News πŸŽ‰

  • [2025-11-01] AdaptCLIP achieves 81.4% I-AUROC, 92.2% P-AUROC, and 49.7% P-AUPR using only 2 training-free normal samples on the large-scale Real-IAD Variety, surpassing the state-of-the-art multi-class AD model (Dinomaly: 81.4% I-AUROC, 91.5% P-AUROC, and 37.6% P-AUPR) that utilizes full normal training images.

Introduction

Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. To this end, we present a simple yet effective AdaptCLIP based on two key insights:

  • Adaptive visual and textual representations should be learned alternately rather than jointly.
  • Comparative learning should incorporate contextual and aligned residual features rather than relying solely on residual features.

AdaptCLIP Framework

AdaptCLIP

Image 1 Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Image 2

Ablation Studies

No. Methods Shots TA VA PQA MVTec VisA
0 baselines 0 βœ— βœ— βœ— 91.1 / 33.0 82.1 / 18.0
1 baselines 0 βœ“ βœ— βœ— 92.2 / 31.4 82.9 / 19.7
2 baselines 0 βœ— βœ“ βœ— 90.5 / 39.4 81.0 / 22.1
3 joint 0 βœ“ βœ“ βœ— 89.3 / 36.2 81.6 / 21.5
4 alternating 0 βœ“ βœ“ βœ— 93.5 / 38.3 84.8 / 26.1
5 w/o context 1 βœ— βœ— βœ“ 62.6 / 7.0 85.3 / 28.7
6 w context 1 βœ— βœ— βœ“ 88.1 / 50.2 88.9 / 38.1
7 AdaptCLIP 1 βœ“ βœ“ βœ“ 94.2 / 52.5 92.0 / 38.8

Note: Following previous works, we use AUROC for image-level anomaly classification and AUPR for pixel-level anomaly segmentation in our main paper. Here, we emphasize that AUPR is better for anomaly segmentation, where the imbalance issue is very extreme between normal and anomaly pixels, as pointed out in VisA paper (ECCV 2022). In Appendix, we also provide detailed comparisons using all metrics, including AUROC, AUPR, and F1max.

Complexity and Efficiency Comparisons

Shots Methods CLIP Models Input Size # F+L Params (M) Inf. Time (ms)
0 WinCLIP [16] ViT-B-16+240 240Γ—240 208.4 + 0.0 201.3
0 WinCLIP [16] ViT-B-16+240 512Γ—512 208.4 + 0.0 3912.6
0 AdaCLIP [6] ViT-L/14@336px 518Γ—518 428.8 + 10.7 212.0
0 AnomalyCLIP [53] ViT-L/14@336px 518Γ—518 427.9 + 5.6 154.9
0 AdaptCLIP-Zero ViT-B-16+240 512Γ—512 208.4 + 0.4 49.9
0 AdaptCLIP-Zero ViT-L/14@336px 518Γ—518 427.9 + 0.6 162.2
1 WinCLIP+ [16] ViT-B-16+240 240Γ—240 208.4 + 0.0 339.5
1 WinCLIP+ [16] ViT-B-16+240 512Γ—512 208.4 + 0.0 7434.9
1 InCtrl [54] ViT-B-16+240 240Γ—240 208.4 + 0.3 337.0
1 AnomalyCLIP+ [53] ViT-L/14@336px 518Γ—518 427.9 + 5.6 158.6
1 AdaptCLIP ViT-B-16+240 512Γ—512 208.4 + 1.4 54.0
1 AdaptCLIP ViT-L/14@336px 518Γ—518 427.9 + 1.8 168.2

Note: F means Frozen Parameters (M) and L means Learnable Parameters (M)

ToDo List

  • release pre-trained AdaptCLIP models
  • deploy online AdaptCLIP Demo on HuggingFace Space
  • open testing code
  • open training code

Star History

Star History Chart

Releases

No releases published

Packages

No packages published