Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching

WACV 2026

Patchify is a training-free framework for instance-level image retrieval. Gallery images are decomposed into a spatial pyramid of patches, and query embeddings are matched against local patch features from a pretrained vision encoder (SigLIP, DINOv2, CLIP). The approach achieves high retrieval and localization performance without any fine-tuning.

Installation

We recommend Python 3.10 with CUDA 11.8.

# 1. Clone the repository
git clone https://github.com/kaist-ami/Patchwise-Retrieval.git
cd Patchwise-Retrieval

# 2. Create and activate a conda environment
conda create -n ssr python=3.10 -y
conda activate ssr

# 3. Install dependencies
pip install torch==2.1.0+cu118 torchvision==0.16.0+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

requirements.txt key packages:

timm>=0.9.0         # pretrained vision models (SigLIP, DINOv2, CLIP)
faiss-gpu>=1.7.0    # approximate nearest-neighbour search
numpy, Pillow, tqdm, omegaconf, hydra-core, loguru, matplotlib, pyyaml

Dataset Preparation

We evaluate on ILIAS (Instance-Level Image retrieval At Scale). Download the ILIAS core set from the official ILIAS website and place it under ilias/ilias_core/.

ilias/
├── ilias_core/          # images (4,885 gallery + 1,232 query)  ← download here
├── query.json           # query annotations (included)
└── gallery.json         # gallery annotations (included)

query.json and gallery.json are already included in this repository with relative paths pre-configured.

Quick Start

All scripts are in the scripts/ directory. On the first run, patch features are automatically extracted and cached under ./features/. Trained FAISS indices are cached under ./faiss_indices/. Logs are written to ./logs/.

Exact Search (no PQ)

bash scripts/run_eval.sh

Compressed Search (IVF-PQ)

bash scripts/run_eval_pq.sh

Reproducing Paper Results

Exact Search (no PQ)

python main.py \
    --config src/eval_retrieval.py \
    --model_name siglip \
    --multi-scale pyramid --crop True \
    --pq False \
    --tile_size_res 3 \
    --compute_metrics "mAP, locscore" --iou_threshold default

Change --tile_size_res to evaluate at different patchify levels (0 = global image, 3 = 4×4 patch grid).

Model	Level	mAP	LocScore
SigLIP	L3 (4×4)	64.37%	19.24%

Compressed Search (IVF-PQ)

IVF-PQ compresses the gallery index by ~13× with a small accuracy trade-off.

python main.py \
    --config src/eval_retrieval.py \
    --model_name siglip \
    --multi-scale pyramid --crop True \
    --pq ivfpq \
    --pq_m 64 --pq_nlist 8192 --pq_nbits 8 --pq_nprobe 8192 \
    --pq_dist_type IP --pq_train_res 1 \
    --tile_size_res 3 \
    --compute_metrics "mAP, locscore" --iou_threshold default

Method	mAP	LocScore
Exact search	64.37%	19.24%
IVF-PQ	59.30%	17.44%

Paper reports 59.96% mAP for L3 with PQ. The 0.66 pp gap is consistent with the small difference in the base no-PQ result (our 64.37% vs. paper's 65.16%).

Key insight — why --pq_train_res 1: Training the IVF quantizer on coarser L1 features (rather than L3 self-training) produces more globally representative centroids, which yields smaller residuals at L3 search time and better PQ quantization accuracy.

CLI Reference

python main.py [OPTIONS]

Argument	Default	Description
`--config`	`src/eval_retrieval.py`	Evaluator config file
`--model_name`	`siglip`	Backbone: `siglip`, `dinov2`, `clip`
`--tile_size_res`	`3`	Patchify level: 0=L0 (global) … 3=L3 (4×4 grid)
`--multi-scale`	`pyramid`	Multi-scale strategy (`pyramid`)
`--crop`	`True`	Use crop-based patch extraction
`--image_size`	`0`	Override model's native input resolution (0 = native)
`--split`	`test`	Dataset split
`--compute_metrics`	`"mAP, locscore"`	Metrics: `mAP`, `top-k`, `locscore`
`--iou_threshold`	`average`	IoU mode: `default` (0.4) or `average` (0.2–0.6)
`--visualize`	`False`	Save top-k retrieval visualizations
`--batch_size`	`16`	Batch size for gallery encoding
PQ options
`--pq`	`False`	PQ mode: `False`, `pq`, or `ivfpq`
`--pq_m`	`16`	Number of PQ sub-quantizers (must divide feature dim)
`--pq_nlist`	`256`	Number of IVF clusters
`--pq_nprobe`	`256`	Clusters to probe at search time (set = `nlist` for exhaustive)
`--pq_nbits`	`8`	Bits per sub-code (typically 8)
`--pq_dist_type`	`IP`	Distance metric: `IP` (inner product) or `L2`
`--pq_train_res`	`-1`	Resolution level for quantizer training: `-1` = same as `tile_size_res`, `-2` = all levels combined

Project Structure

Patchwise-Retrieval/
├── main.py                    # Evaluation entry point
├── requirements.txt
│
├── ilias/                     # ILIAS benchmark data
│   ├── ilias_core/            # Images (download separately)
│   ├── query.json             # Query annotations (bounding boxes, image IDs)
│   └── gallery.json           # Gallery annotations
│
├── src/
│   ├── config.py              # Model loading (timm), LazyConfig
│   ├── eval_retrieval.py      # Evaluator config (dataset path, metrics)
│   ├── dataset.py             # Ilias dataset class + DatasetCatalog
│   ├── pyramid_embedding.py   # Multi-scale patch extraction
│   ├── retrieval.py           # RegionalImageRetrievalEvaluator, FAISS index
│   ├── metrics.py             # mAP, LocScore computation
│   ├── util.py                # Helpers (str2bool, set_seed, save/load)
│   └── visualization.py       # Top-k result rendering
│
├── scripts/
│   ├── run_eval.sh            # SigLIP L3 exact search
│   ├── run_eval_pq.sh         # SigLIP L3 IVF-PQ
│   └── run_table1.sh          # DINOv2 / CLIP reproduction
│
├── features/                  # Auto-generated patch feature cache
├── faiss_indices/             # Auto-generated trained FAISS indices
└── logs/                      # Evaluation logs

Citation

@inproceedings{choi2026PatchwiseRetrieval,
  title={Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching},
  author={Wonseok Choi and Sohwi Lim and Nam Hyeon-Woo and Moon Ye-Bin and Dong-Ju Jeong and Jinyoung Hwang and Tae-Hyun Oh},
  booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}

Reference: ILIAS benchmark — https://github.com/ilias-vrg/ilias

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ilias		ilias
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching

Contents

Installation

Dataset Preparation

Quick Start

Exact Search (no PQ)

Compressed Search (IVF-PQ)

Reproducing Paper Results

Exact Search (no PQ)

Compressed Search (IVF-PQ)

CLI Reference

Project Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching

Contents

Installation

Dataset Preparation

Quick Start

Exact Search (no PQ)

Compressed Search (IVF-PQ)

Reproducing Paper Results

Exact Search (no PQ)

Compressed Search (IVF-PQ)

CLI Reference

Project Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages