Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment (NeurIPS 2025 Spotlight)

This repository is the official implementation of Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment, led by

Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye

🔥 Summary

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but show notable drawbacks:

Blur and artifacts when pushed to magnify beyond its training regime
High computational costs and inefficiency of retraining models when we want to magnify further

This brings us to the fundamental question:
How can we effectively utilize super-resolution models to explore much higher resolutions than they were originally trained for?

We address this via Chain-of-Zoom 🔎, a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a prompt extractor VLM. This prompt extractor can be fine-tuned through GRPO with a critic VLM to further align text guidance towards human preference.

🗓 ️News

[Aug 2025] Additional code released.
[Jun 2025] Check out the awesome 🤗 Huggingface Space by @alexnasa! Thanks for the awesome work!
[May 2025] Code and paper released.

🛠️ Setup

First, create your environment. We recommend using the following commands.

git clone https://github.com/bryanswkim/Chain-of-Zoom.git
cd Chain-of-Zoom

conda create -n coz python=3.10
conda activate coz
pip install -r requirements.txt

⏳ Models

Models	Checkpoints
Stable Diffusion v3	Hugging Face
Qwen2.5-VL-3B-Instruct	Hugging Face
RAM	Hugging Face

⚡ Quick Inference

You can quickly check the results of using CoZ with the following example:

python inference_coz.py \
  -i samples \
  -o inference_results/coz_vlmprompt \
  --rec_type recursive_multiscale \
  --prompt_type vlm \
  --lora_path ckpt/SR_LoRA/model_20001.pkl \
  --vae_path ckpt/SR_VAE/vae_encoder_20001.pt \
  --vlm_lora_path ckpt/VLM_LoRA/checkpoint-10000 \
  --pretrained_model_name_or_path 'stabilityai/stable-diffusion-3-medium-diffusers' \
  --ram_ft_path ckpt/DAPE/DAPE.pth \
  --ram_path ckpt/RAM/ram_swin_large_14m.pth \
  --save_prompts;

Which will give a result like below:

🔬 Efficient Memory

Using --efficient_memory allows CoZ to run on a single GPU with 24GB VRAM, but highly increases inference time due to offloading.
We recommend using two GPUs.

🌄 Full Image Super-Resolution

Although our main focus is zooming into local areas, CoZ can be easily applied to super-resolution of full images. Try out the code below!

python inference_coz_full.py \
  -i samples \
  -o inference_results/coz_full \
  --rec_type recursive_multiscale \
  --prompt_type vlm \
  --lora_path ckpt/SR_LoRA/model_20001.pkl \
  --vae_path ckpt/SR_VAE/vae_encoder_20001.pt \
  --vlm_lora_path ckpt/VLM_LoRA/checkpoint-10000 \
  --pretrained_model_name_or_path 'stabilityai/stable-diffusion-3-medium-diffusers' \
  --ram_ft_path ckpt/DAPE/DAPE.pth \
  --ram_path ckpt/RAM/ram_swin_large_14m.pth;

🚆 Training the SR Backbone Model

Chain-of-Zoom is model-agnostic and can be used with any pretrained text-aware SR model. In this repository we leverage OSEDiff trained with Stable Diffusion 3 Medium as its backbone model. This requires some additional installations:

pip install wandb opencv-python basicsr==1.4.2

pip install --no-deps --extra-index-url https://download.pytorch.org/whl/cu121 xformers==0.0.28.post1

Please refer to the OSEDiff repository for training configurations (ex. preparing training data). Now train the SR backbone model:

bash scripts/train/train_osediff_sd3.sh

📝 Citation

If you find our method useful, please cite as below or leave a star to this repository.

@article{kim2025chain,
  title={Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment},
  author={Kim, Bryan Sangwoo and Kim, Jeongsol and Ye, Jong Chul},
  journal={arXiv preprint arXiv:2505.18600},
  year={2025}
}

🤗 Acknowledgements

We thank the authors of OSEDiff for sharing their awesome work!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
ckpt		ckpt
lora		lora
ram		ram
samples		samples
scripts		scripts
train_utils		train_utils
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference_coz.py		inference_coz.py
inference_coz_full.py		inference_coz_full.py
osediff_sd3.py		osediff_sd3.py
requirements.txt		requirements.txt
train_osediff_sd3.py		train_osediff_sd3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment (NeurIPS 2025 Spotlight)

🔥 Summary

🗓 ️News

🛠️ Setup

⏳ Models

⚡ Quick Inference

🔬 Efficient Memory

🌄 Full Image Super-Resolution

🚆 Training the SR Backbone Model

📝 Citation

🤗 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

korinho23/Chain-of-Zoom

Folders and files

Latest commit

History

Repository files navigation

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment (NeurIPS 2025 Spotlight)

🔥 Summary

🗓 ️News

🛠️ Setup

⏳ Models

⚡ Quick Inference

🔬 Efficient Memory

🌄 Full Image Super-Resolution

🚆 Training the SR Backbone Model

📝 Citation

🤗 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages