SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

Zehan Wang^1*, Sashuai Zhou^1*, Shaoxuan He¹, Haifeng Huang¹, Lihe Yang², Ziang Zhang¹, Xize Cheng¹, Shengpeng Ji¹, Tao Jin¹, Hengshuang Zhao², Zhou Zhao¹

¹Zhejiang University ²The University of Hong Kong
^* equal contribution

This repository introduces Spatial-CLIP, a research-oriented, open-source framework that injects spatial understanding into vision-language modeling by building on top of Depth-Anything-V2. It enables more precise associations between textual queries and spatial regions in images, unlocking enhanced describe-and-reason capabilities for scenes with depth and geometry.

Quick Start

Follow these steps to set up Spatial-CLIP locally and prepare the required checkpoints.

Clone the repository and enter the project directory

git clone https://github.com/master-chou/spatial-clip.git
cd spatial-clip

Create and activate a Python environment

conda create -n spclip python=3.10 -y
conda activate spclip

Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Download and prepare checkpoints

Download the Depth-Anything-V2 weights from Depth-Anything-V2 and place them at:
/Depth-Anything-V2/checkpoints/depth_anything_v2_vitl.pth
If other weights are required (e.g., for specific experiments), place them in their designated paths.

Note: The links above point to the necessary Depth-Anything-V2 checkpoints. Please follow the original repository’s guidance to download and place them in the specified paths to ensure proper model initialization.

Run and validate

The repository already provides the relevant test code in the evaluate module.
For a quick start, you can directly run the provided script:

bash evaluation.sh

Quick Demo

The following minimal example shows how to quickly verify Spatial-CLIP with one image and one caption.
It loads the pretrained Depth-Anything-V2 and Spatial-CLIP models, then computes the similarity score.

import torch
from PIL import Image
from torchvision import transforms
from huggingface_hub import hf_hub_download

import spat_clip as sclip
from depth_anything_v2.dpt import DepthAnythingV2

# ----- Load Models -----
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load DepthAnythingV2
depth_model = DepthAnythingV2(encoder="vitl", features=256,
                              out_channels=[256, 512, 1024, 1024])
depth_model.load_state_dict(torch.load(
    "../Depth-Anything-V2/checkpoints/depth_anything_v2_vitl.pth",
    map_location="cpu"
))
depth_model = depth_model.to(device).eval()

# Load Spatial-CLIP
clip_model, _ = sclip.load("ViT-L/14@336px", device="cpu", lora_adapt=False, rank=-1)
ckpt_path = hf_hub_download(repo_id="demo911/spatialclip", repo_type="dataset", filename="iter_5000.pth")
ckpt = torch.load(ckpt_path, map_location="cpu")
clip_model.load_state_dict({k.replace("module.", ""): v for k, v in ckpt.items()}, strict=True)
clip_model = clip_model.to(device).eval()

# ----- Preprocessing -----
preprocess = transforms.Compose([
    transforms.Resize(336, interpolation=transforms.InterpolationMode.BICUBIC),
    transforms.CenterCrop(336),
    transforms.Lambda(lambda image: image.convert("RGB")),
    transforms.ToTensor(),
    transforms.Normalize((0.48145466, 0.4578275, 0.40821073),
                         (0.26862954, 0.26130258, 0.27577711)),
])

# ----- Demo Input -----
image = Image.open("demo.jpg")
caption = "A dog is running on the grass."

image_tensor = preprocess(image).unsqueeze(0).to(device)
depth_map = depth_model(image_tensor)

text_tokens = sclip.tokenize([caption]).to(device)

# ----- Feature Extraction -----
with torch.no_grad():
    image_feat = clip_model.visual(image_tensor, depth_map, pos_embed="")
    text_feat = clip_model.encode_text(text_tokens)

# Normalize
image_feat = image_feat / image_feat.norm(dim=-1, keepdim=True)
text_feat = text_feat / text_feat.norm(dim=-1, keepdim=True)

# ----- Similarity -----
similarity = (image_feat @ text_feat.T).item()
print(f"Similarity between image and caption: {similarity:.4f}")

License

This project is licensed under the Apache-2.0 License.See the LICENSE file for details.

Acknowledgments

We acknowledge the contributors and the broader open-source community whose work enabled Spatial-CLIP, including:

Citations

If you find Spatial-CLIP useful, please consider citing when publishing work that builds on it.

@inproceedings{wang2025spatialclip,
  title={SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language},
  author={Wang, Zehan and Zhou, Sashuai and He, Shaoxuan and Huang, Haifeng and Yang, Lihe and Zhang, Ziang and Cheng, Xize and Ji, Shengpeng and Jin, Tao and Zhao, Hengshuang and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={29656--29666},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Depth-Anything-V2		Depth-Anything-V2
assets		assets
dataset		dataset
evaluate		evaluate
spat_clip		spat_clip
.gitignore		.gitignore
README.md		README.md
conda_packages.txt		conda_packages.txt
environment.yml		environment.yml
evaluate.sh		evaluate.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

Quick Start

Quick Demo

License

Acknowledgments

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SpatialVision/Spatial-CLIP

Folders and files

Latest commit

History

Repository files navigation

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

Quick Start

Quick Demo

License

Acknowledgments

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages