Project Page | Paper | Code
PyTorch implementation of uncertainty-guided part-to-whole alignment in hyperbolic vision-language models.
UNCHA: Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
Hayeon Kim*1,
Ji Ha Jang*1,
Junghun James Kim2,
Se Young Chun1,2
1Dept. of Electrical and Computer Engineering, 2INMC & IPAI
Seoul National University, Republic of Korea
*denotes equal contribution
Hyperbolic Vision-Language Models (VLMs) embed images and text in hyperbolic space to capture hierarchical part-whole relationships (e.g., a "street" scene contains parts like "cars", "people", "traffic signs"). However, not all parts represent the whole scene equally — a part showing the main street is far more representative than a tiny traffic sign in the corner.
UNCHA models this part-to-whole semantic representativeness as hyperbolic uncertainty:
- Low uncertainty → the part is highly representative of the whole scene (closer to the whole in meaning)
- High uncertainty → the part is less representative (e.g., a blurry crop or a minor background object)
This uncertainty is incorporated into:
- Contrastive loss — adaptive temperature scaling so representative parts contribute more to alignment (Eq. 10-11)
- Entailment loss — uncertainty calibration with entropy regularization for well-structured hyperbolic embeddings (Eq. 14-17)
The result: more accurate part-whole ordering in hyperbolic space, better compositional understanding, and state-of-the-art performance on zero-shot classification, retrieval, and multi-label benchmarks.
- Python >= 3.9
- PyTorch >= 2.0
- torchvision
- transformers
Create the environment by running:
conda create -n uncha python=3.9
conda activate uncha
python -m pip install --pre timm
python -m pip install -r requirements.txt
python setup.py developFirstly, the raw GRIT dataset (in webdataset format) has to be downloaded following instructions of huggingface/zzliang/GRIT. The dataset contains 20.5M grounded vision-language pairs and 35.9M part-level annotations. For faster training we pre-process the dataset by extracting out box information of each sample by running the following command:
python utils/prepare_GRIT_webdataset.py --raw_webdataset_path datasets/train/GRIT/raw \
--processed_webdataset_path datasets/train/GRIT/processed \
--max_num_processes 12To train UNCHA with ViT-S/16 backbone:
./scripts/train.sh --config configs/train_uncha_vit_s.py --num-gpus 4 --output-dir ./train_results/test --checkpoint-period 10000To train with ViT-B/16 backbone:
./scripts/train.sh --config configs/train_uncha_vit_b.py --num-gpus 4 --output-dir ./train_results/test --checkpoint-period 10000Training uses 4 GPUs with a total batch size of 768 and runs for 500K iterations.
python scripts/evaluate.py --config configs/eval_zero_shot_classification.py \
--checkpoint-path /path/to/your/ckpt \
--train-config configs/train_uncha_vit_b.pyWe evaluate on 16 benchmark datasets: ImageNet, CIFAR-10/100, SUN397, Caltech-101, STL-10, Food-101, CUB, Cars, Aircraft, Pets, Flowers, DTD, EuroSAT, RESISC45, and Country211.
python scripts/evaluate.py --config configs/eval_zero_shot_retrieval.py \
--checkpoint-path /path/to/your/ckpt \
--train-config configs/train_uncha_vit_b.pypython scripts/evaluate.py --config configs/eval_hierarchical_metrics.py \
--checkpoint-path /path/to/your/ckpt \
--train-config configs/train_uncha_vit_b.py| Model | Backbone | ImageNet Acc. | COCO R@1 (Text) | COCO R@1 (Image) |
|---|---|---|---|---|
| UNCHA | ViT-S/16 | 43.9 | 69.9 | 56.2 |
| UNCHA | ViT-B/16 | 48.8 | 72.7 | 60.0 |
UNCHA introduces uncertainty to explicitly quantify how well each part represents the whole scene. By assigning lower uncertainty to more representative parts and higher uncertainty to less informative ones, it enables adaptive weighting in the contrastive objective, leading to improved part–whole alignment. Furthermore, uncertainty is calibrated through the entailment loss and regularized by entropy, ensuring stable and balanced use of the hyperbolic embedding space. Together, these components allow UNCHA to achieve more effective compositional understanding and alignment. See our paper for full derivations.
| Method | ImageNet | CIFAR-10 | CIFAR-100 | SUN397 | Caltech-101 | STL-10 |
|---|---|---|---|---|---|---|
| CLIP | 40.6 | 78.9 | 48.3 | 43.0 | 70.7 | 92.4 |
| MERU | 40.1 | 78.6 | 49.3 | 43.0 | 73.0 | 92.8 |
| HyCoCLIP | 45.8 | 88.8 | 60.1 | 57.2 | 81.3 | 95.0 |
| UNCHA (Ours) | 48.8 | 90.4 | 63.2 | 57.7 | 83.9 | 95.7 |
| Method | ComCo 2obj | ComCo 5obj | SimCo 2obj | SimCo 5obj | VOC | COCO |
|---|---|---|---|---|---|---|
| CLIP | 77.55 | 80.22 | 77.15 | 88.48 | 78.56 | 53.94 |
| HyCoCLIP | 72.90 | 72.90 | 75.71 | 82.85 | 80.43 | 58.12 |
| UNCHA (Ours) | 77.92 | 81.18 | 79.72 | 90.65 | 82.14 | 59.43 |
If you find this work useful, please cite:
@inproceedings{kim2026uncha,
author = {Kim, Hayeon and Jang, Ji Ha and Kim, Junghun James and Chun, Se Young},
title = {UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment with Part-to-Whole Semantic Representativeness},
booktitle = {CVPR},
year = {2026},
}
This work was supported by IITP grants funded by the Korea government (MSIT), NRF grants funded by the Korea government (MSIT), the AI Computing Infrastructure Enhancement (GPU Rental Support) Program funded by MSIT, the BK21 FOUR Program at Seoul National University, and the AI-Bio Research Grant through Seoul National University. We also thank the authors of MERU, HyCoCLIP, and ATMG for their open-source implementations.# UNCHA

