Skip to content

SpatialVision/Spatial-CLIP

Repository files navigation

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

Zehan Wang1*, Sashuai Zhou1*, Shaoxuan He1, Haifeng Huang1, Lihe Yang2, Ziang Zhang1, Xize Cheng1, Shengpeng Ji1, Tao Jin1, Hengshuang Zhao2, Zhou Zhao1

1Zhejiang University    2The University of Hong Kong
* equal contribution  

Paper PDF Project Page

This repository introduces Spatial-CLIP, a research-oriented, open-source framework that injects spatial understanding into vision-language modeling by building on top of Depth-Anything-V2. It enables more precise associations between textual queries and spatial regions in images, unlocking enhanced describe-and-reason capabilities for scenes with depth and geometry.

comparison

Quick Start

Follow these steps to set up Spatial-CLIP locally and prepare the required checkpoints.

  1. Clone the repository and enter the project directory
git clone https://github.com/master-chou/spatial-clip.git
cd spatial-clip
  1. Create and activate a Python environment
conda create -n spclip python=3.10 -y
conda activate spclip
  1. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
  1. Download and prepare checkpoints
  • Download the Depth-Anything-V2 weights from Depth-Anything-V2 and place them at:
    /Depth-Anything-V2/checkpoints/depth_anything_v2_vitl.pth

  • If other weights are required (e.g., for specific experiments), place them in their designated paths.

Note: The links above point to the necessary Depth-Anything-V2 checkpoints. Please follow the original repository’s guidance to download and place them in the specified paths to ensure proper model initialization.

  1. Run and validate
  • The repository already provides the relevant test code in the evaluate module.
    For a quick start, you can directly run the provided script:
bash evaluation.sh

Quick Demo

The following minimal example shows how to quickly verify Spatial-CLIP with one image and one caption.
It loads the pretrained Depth-Anything-V2 and Spatial-CLIP models, then computes the similarity score.

import torch
from PIL import Image
from torchvision import transforms
from huggingface_hub import hf_hub_download

import spat_clip as sclip
from depth_anything_v2.dpt import DepthAnythingV2

# ----- Load Models -----
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load DepthAnythingV2
depth_model = DepthAnythingV2(encoder="vitl", features=256,
                              out_channels=[256, 512, 1024, 1024])
depth_model.load_state_dict(torch.load(
    "../Depth-Anything-V2/checkpoints/depth_anything_v2_vitl.pth",
    map_location="cpu"
))
depth_model = depth_model.to(device).eval()

# Load Spatial-CLIP
clip_model, _ = sclip.load("ViT-L/14@336px", device="cpu", lora_adapt=False, rank=-1)
ckpt_path = hf_hub_download(repo_id="demo911/spatialclip", repo_type="dataset", filename="iter_5000.pth")
ckpt = torch.load(ckpt_path, map_location="cpu")
clip_model.load_state_dict({k.replace("module.", ""): v for k, v in ckpt.items()}, strict=True)
clip_model = clip_model.to(device).eval()

# ----- Preprocessing -----
preprocess = transforms.Compose([
    transforms.Resize(336, interpolation=transforms.InterpolationMode.BICUBIC),
    transforms.CenterCrop(336),
    transforms.Lambda(lambda image: image.convert("RGB")),
    transforms.ToTensor(),
    transforms.Normalize((0.48145466, 0.4578275, 0.40821073),
                         (0.26862954, 0.26130258, 0.27577711)),
])

# ----- Demo Input -----
image = Image.open("demo.jpg")
caption = "A dog is running on the grass."

image_tensor = preprocess(image).unsqueeze(0).to(device)
depth_map = depth_model(image_tensor)

text_tokens = sclip.tokenize([caption]).to(device)

# ----- Feature Extraction -----
with torch.no_grad():
    image_feat = clip_model.visual(image_tensor, depth_map, pos_embed="")
    text_feat = clip_model.encode_text(text_tokens)

# Normalize
image_feat = image_feat / image_feat.norm(dim=-1, keepdim=True)
text_feat = text_feat / text_feat.norm(dim=-1, keepdim=True)

# ----- Similarity -----
similarity = (image_feat @ text_feat.T).item()
print(f"Similarity between image and caption: {similarity:.4f}")

License

This project is licensed under the Apache-2.0 License.See the LICENSE file for details.

Acknowledgments

We acknowledge the contributors and the broader open-source community whose work enabled Spatial-CLIP, including:


Citations

If you find Spatial-CLIP useful, please consider citing when publishing work that builds on it.

@inproceedings{wang2025spatialclip,
  title={SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language},
  author={Wang, Zehan and Zhou, Sashuai and He, Shaoxuan and Huang, Haifeng and Yang, Lihe and Zhang, Ziang and Cheng, Xize and Ji, Shengpeng and Jin, Tao and Zhao, Hengshuang and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={29656--29666},
  year={2025}
}

About

The accepted work for cvpr2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published