Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment (NeurIPS 2025 Spotlight)
This repository is the official implementation of Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment, led by
Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but show notable drawbacks:
- Blur and artifacts when pushed to magnify beyond its training regime
- High computational costs and inefficiency of retraining models when we want to magnify further
This brings us to the fundamental question:
How can we effectively utilize super-resolution models to explore much higher resolutions than they were originally trained for?
We address this via Chain-of-Zoom 🔎, a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a prompt extractor VLM. This prompt extractor can be fine-tuned through GRPO with a critic VLM to further align text guidance towards human preference.
- [Aug 2025] Additional code released.
- [Jun 2025] Check out the awesome 🤗 Huggingface Space by @alexnasa! Thanks for the awesome work!
- [May 2025] Code and paper released.
First, create your environment. We recommend using the following commands.
git clone https://github.com/bryanswkim/Chain-of-Zoom.git
cd Chain-of-Zoom
conda create -n coz python=3.10
conda activate coz
pip install -r requirements.txt
| Models | Checkpoints |
|---|---|
| Stable Diffusion v3 | Hugging Face |
| Qwen2.5-VL-3B-Instruct | Hugging Face |
| RAM | Hugging Face |
You can quickly check the results of using CoZ with the following example:
python inference_coz.py \
-i samples \
-o inference_results/coz_vlmprompt \
--rec_type recursive_multiscale \
--prompt_type vlm \
--lora_path ckpt/SR_LoRA/model_20001.pkl \
--vae_path ckpt/SR_VAE/vae_encoder_20001.pt \
--vlm_lora_path ckpt/VLM_LoRA/checkpoint-10000 \
--pretrained_model_name_or_path 'stabilityai/stable-diffusion-3-medium-diffusers' \
--ram_ft_path ckpt/DAPE/DAPE.pth \
--ram_path ckpt/RAM/ram_swin_large_14m.pth \
--save_prompts;
Which will give a result like below:
Using --efficient_memory allows CoZ to run on a single GPU with 24GB VRAM, but highly increases inference time due to offloading.
We recommend using two GPUs.
Although our main focus is zooming into local areas, CoZ can be easily applied to super-resolution of full images. Try out the code below!
python inference_coz_full.py \
-i samples \
-o inference_results/coz_full \
--rec_type recursive_multiscale \
--prompt_type vlm \
--lora_path ckpt/SR_LoRA/model_20001.pkl \
--vae_path ckpt/SR_VAE/vae_encoder_20001.pt \
--vlm_lora_path ckpt/VLM_LoRA/checkpoint-10000 \
--pretrained_model_name_or_path 'stabilityai/stable-diffusion-3-medium-diffusers' \
--ram_ft_path ckpt/DAPE/DAPE.pth \
--ram_path ckpt/RAM/ram_swin_large_14m.pth;
Chain-of-Zoom is model-agnostic and can be used with any pretrained text-aware SR model. In this repository we leverage OSEDiff trained with Stable Diffusion 3 Medium as its backbone model. This requires some additional installations:
pip install wandb opencv-python basicsr==1.4.2
pip install --no-deps --extra-index-url https://download.pytorch.org/whl/cu121 xformers==0.0.28.post1
Please refer to the OSEDiff repository for training configurations (ex. preparing training data). Now train the SR backbone model:
bash scripts/train/train_osediff_sd3.sh
If you find our method useful, please cite as below or leave a star to this repository.
@article{kim2025chain,
title={Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment},
author={Kim, Bryan Sangwoo and Kim, Jeongsol and Ye, Jong Chul},
journal={arXiv preprint arXiv:2505.18600},
year={2025}
}
We thank the authors of OSEDiff for sharing their awesome work!

