Official PyTorch Implementation of AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection, 2025.
- [2025-11-01] AdaptCLIP achieves 81.4% I-AUROC, 92.2% P-AUROC, and 49.7% P-AUPR using only 2 training-free normal samples on the large-scale Real-IAD Variety, surpassing the state-of-the-art multi-class AD model (Dinomaly: 81.4% I-AUROC, 91.5% P-AUROC, and 37.6% P-AUPR) that utilizes full normal training images.
Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. To this end, we present a simple yet effective AdaptCLIP based on two key insights:
- Adaptive visual and textual representations should be learned alternately rather than jointly.
- Comparative learning should incorporate contextual and aligned residual features rather than relying solely on residual features.
| No. | Methods | Shots | TA | VA | PQA | MVTec | VisA |
|---|---|---|---|---|---|---|---|
| 0 | baselines | 0 | β | β | β | 91.1 / 33.0 | 82.1 / 18.0 |
| 1 | baselines | 0 | β | β | β | 92.2 / 31.4 | 82.9 / 19.7 |
| 2 | baselines | 0 | β | β | β | 90.5 / 39.4 | 81.0 / 22.1 |
| 3 | joint | 0 | β | β | β | 89.3 / 36.2 | 81.6 / 21.5 |
| 4 | alternating | 0 | β | β | β | 93.5 / 38.3 | 84.8 / 26.1 |
| 5 | w/o context | 1 | β | β | β | 62.6 / 7.0 | 85.3 / 28.7 |
| 6 | w context | 1 | β | β | β | 88.1 / 50.2 | 88.9 / 38.1 |
| 7 | AdaptCLIP | 1 | β | β | β | 94.2 / 52.5 | 92.0 / 38.8 |
Note: Following previous works, we use AUROC for image-level anomaly classification and AUPR for pixel-level anomaly segmentation in our main paper. Here, we emphasize that AUPR is better for anomaly segmentation, where the imbalance issue is very extreme between normal and anomaly pixels, as pointed out in VisA paper (ECCV 2022). In Appendix, we also provide detailed comparisons using all metrics, including AUROC, AUPR, and F1max.
| Shots | Methods | CLIP Models | Input Size | # F+L Params (M) | Inf. Time (ms) |
|---|---|---|---|---|---|
| 0 | WinCLIP [16] | ViT-B-16+240 | 240Γ240 | 208.4 + 0.0 | 201.3 |
| 0 | WinCLIP [16] | ViT-B-16+240 | 512Γ512 | 208.4 + 0.0 | 3912.6 |
| 0 | AdaCLIP [6] | ViT-L/14@336px | 518Γ518 | 428.8 + 10.7 | 212.0 |
| 0 | AnomalyCLIP [53] | ViT-L/14@336px | 518Γ518 | 427.9 + 5.6 | 154.9 |
| 0 | AdaptCLIP-Zero | ViT-B-16+240 | 512Γ512 | 208.4 + 0.4 | 49.9 |
| 0 | AdaptCLIP-Zero | ViT-L/14@336px | 518Γ518 | 427.9 + 0.6 | 162.2 |
| 1 | WinCLIP+ [16] | ViT-B-16+240 | 240Γ240 | 208.4 + 0.0 | 339.5 |
| 1 | WinCLIP+ [16] | ViT-B-16+240 | 512Γ512 | 208.4 + 0.0 | 7434.9 |
| 1 | InCtrl [54] | ViT-B-16+240 | 240Γ240 | 208.4 + 0.3 | 337.0 |
| 1 | AnomalyCLIP+ [53] | ViT-L/14@336px | 518Γ518 | 427.9 + 5.6 | 158.6 |
| 1 | AdaptCLIP | ViT-B-16+240 | 512Γ512 | 208.4 + 1.4 | 54.0 |
| 1 | AdaptCLIP | ViT-L/14@336px | 518Γ518 | 427.9 + 1.8 | 168.2 |
Note: F means Frozen Parameters (M) and L means Learnable Parameters (M)
- release pre-trained AdaptCLIP models
- deploy online AdaptCLIP Demo on HuggingFace Space
- open testing code
- open training code


