WACV 2026
Patchify is a training-free framework for instance-level image retrieval. Gallery images are decomposed into a spatial pyramid of patches, and query embeddings are matched against local patch features from a pretrained vision encoder (SigLIP, DINOv2, CLIP). The approach achieves high retrieval and localization performance without any fine-tuning.
- Installation
- Dataset Preparation
- Quick Start
- Reproducing Paper Results
- CLI Reference
- Project Structure
- Citation
We recommend Python 3.10 with CUDA 11.8.
# 1. Clone the repository
git clone https://github.com/kaist-ami/Patchwise-Retrieval.git
cd Patchwise-Retrieval
# 2. Create and activate a conda environment
conda create -n ssr python=3.10 -y
conda activate ssr
# 3. Install dependencies
pip install torch==2.1.0+cu118 torchvision==0.16.0+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txtrequirements.txt key packages:
timm>=0.9.0 # pretrained vision models (SigLIP, DINOv2, CLIP)
faiss-gpu>=1.7.0 # approximate nearest-neighbour search
numpy, Pillow, tqdm, omegaconf, hydra-core, loguru, matplotlib, pyyaml
We evaluate on ILIAS (Instance-Level Image retrieval At Scale). Download the ILIAS core set from the official ILIAS website and place it under ilias/ilias_core/.
ilias/
├── ilias_core/ # images (4,885 gallery + 1,232 query) ← download here
├── query.json # query annotations (included)
└── gallery.json # gallery annotations (included)
query.jsonandgallery.jsonare already included in this repository with relative paths pre-configured.
All scripts are in the scripts/ directory. On the first run, patch features are automatically extracted and cached under ./features/. Trained FAISS indices are cached under ./faiss_indices/. Logs are written to ./logs/.
bash scripts/run_eval.shbash scripts/run_eval_pq.shpython main.py \
--config src/eval_retrieval.py \
--model_name siglip \
--multi-scale pyramid --crop True \
--pq False \
--tile_size_res 3 \
--compute_metrics "mAP, locscore" --iou_threshold defaultChange --tile_size_res to evaluate at different patchify levels (0 = global image, 3 = 4×4 patch grid).
| Model | Level | mAP | LocScore |
|---|---|---|---|
| SigLIP | L3 (4×4) | 64.37% | 19.24% |
IVF-PQ compresses the gallery index by ~13× with a small accuracy trade-off.
python main.py \
--config src/eval_retrieval.py \
--model_name siglip \
--multi-scale pyramid --crop True \
--pq ivfpq \
--pq_m 64 --pq_nlist 8192 --pq_nbits 8 --pq_nprobe 8192 \
--pq_dist_type IP --pq_train_res 1 \
--tile_size_res 3 \
--compute_metrics "mAP, locscore" --iou_threshold default| Method | mAP | LocScore |
|---|---|---|
| Exact search | 64.37% | 19.24% |
| IVF-PQ | 59.30% | 17.44% |
Paper reports 59.96% mAP for L3 with PQ. The 0.66 pp gap is consistent with the small difference in the base no-PQ result (our 64.37% vs. paper's 65.16%).
Key insight — why --pq_train_res 1:
Training the IVF quantizer on coarser L1 features (rather than L3 self-training) produces more globally representative centroids, which yields smaller residuals at L3 search time and better PQ quantization accuracy.
python main.py [OPTIONS]| Argument | Default | Description |
|---|---|---|
--config |
src/eval_retrieval.py |
Evaluator config file |
--model_name |
siglip |
Backbone: siglip, dinov2, clip |
--tile_size_res |
3 |
Patchify level: 0=L0 (global) … 3=L3 (4×4 grid) |
--multi-scale |
pyramid |
Multi-scale strategy (pyramid) |
--crop |
True |
Use crop-based patch extraction |
--image_size |
0 |
Override model's native input resolution (0 = native) |
--split |
test |
Dataset split |
--compute_metrics |
"mAP, locscore" |
Metrics: mAP, top-k, locscore |
--iou_threshold |
average |
IoU mode: default (0.4) or average (0.2–0.6) |
--visualize |
False |
Save top-k retrieval visualizations |
--batch_size |
16 |
Batch size for gallery encoding |
| PQ options | ||
--pq |
False |
PQ mode: False, pq, or ivfpq |
--pq_m |
16 |
Number of PQ sub-quantizers (must divide feature dim) |
--pq_nlist |
256 |
Number of IVF clusters |
--pq_nprobe |
256 |
Clusters to probe at search time (set = nlist for exhaustive) |
--pq_nbits |
8 |
Bits per sub-code (typically 8) |
--pq_dist_type |
IP |
Distance metric: IP (inner product) or L2 |
--pq_train_res |
-1 |
Resolution level for quantizer training: -1 = same as tile_size_res, -2 = all levels combined |
Patchwise-Retrieval/
├── main.py # Evaluation entry point
├── requirements.txt
│
├── ilias/ # ILIAS benchmark data
│ ├── ilias_core/ # Images (download separately)
│ ├── query.json # Query annotations (bounding boxes, image IDs)
│ └── gallery.json # Gallery annotations
│
├── src/
│ ├── config.py # Model loading (timm), LazyConfig
│ ├── eval_retrieval.py # Evaluator config (dataset path, metrics)
│ ├── dataset.py # Ilias dataset class + DatasetCatalog
│ ├── pyramid_embedding.py # Multi-scale patch extraction
│ ├── retrieval.py # RegionalImageRetrievalEvaluator, FAISS index
│ ├── metrics.py # mAP, LocScore computation
│ ├── util.py # Helpers (str2bool, set_seed, save/load)
│ └── visualization.py # Top-k result rendering
│
├── scripts/
│ ├── run_eval.sh # SigLIP L3 exact search
│ ├── run_eval_pq.sh # SigLIP L3 IVF-PQ
│ └── run_table1.sh # DINOv2 / CLIP reproduction
│
├── features/ # Auto-generated patch feature cache
├── faiss_indices/ # Auto-generated trained FAISS indices
└── logs/ # Evaluation logs
@inproceedings{choi2026PatchwiseRetrieval,
title={Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching},
author={Wonseok Choi and Sohwi Lim and Nam Hyeon-Woo and Moon Ye-Bin and Dong-Ju Jeong and Jinyoung Hwang and Tae-Hyun Oh},
booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
year={2026}
}Reference: ILIAS benchmark — https://github.com/ilias-vrg/ilias