EntitySAM [CVPR'25]

EntitySAM: Segment Everything in Video
CVPR 2025
Adobe Research, EPFL, CMU

We propose EntitySAM, a novel framework extending SAM2 to the task of Video Entity Segmentation—segmenting every entity in a video without requiring category annotations. Our method achieves generalizable and exhaustive segmentation using only image-level training data, and demonstrates strong zero-shot performance across multiple benchmarks. Refer to our paper for more details.

Updates

🔥 2025/07/12: We release the training code. See Training section.

2025/06/26: We release the evaluation code and checkpoints. See Evaluation section.

2025/06/02: Our paper EntitySAM is online.

Introduction

Automatically tracking and segmenting every video entity remains a significant challenge. Despite rapid advancements in video segmentation, even state-of-the-art models like SAM 2 struggle to consistently track all entities across a video—a task we refer to as Video Entity Segmentation. We propose EntitySAM, a framework for zero-shot video entity segmentation. EntitySAM extends SAM 2 by removing the need for explicit prompts, allowing automatic discovery and tracking of all entities, including those appearing in later frames. We incorporate query-based entity discovery and association into SAM 2, inspired by transformer-based object detectors. Specifically, we introduce an entity decoder to facilitate inter-object communication and an automatic prompt generator using learnable object queries. Additionally, we add a semantic encoder to enhance SAM 2’s semantic awareness, improving segmentation quality. Trained on image-level mask annotations without category information from the COCO dataset, EntitySAM demonstrates strong generalization on four zero-shot video segmentation tasks: Video Entity, Panoptic, Instance, and Semantic Segmentation. Results on six popular benchmarks show that EntitySAM outperforms previous unified video segmentation methods and strong baselines, setting new standards for zero-shot video segmentation.

Method Overview

(a) Overview of the EntitySAM framework: EntitySAM utilizes the frozen encoder and memory parameters from SAM 2, incorporating a dual encoder design for enhanced semantic features. The PromptGenerator automatically generates prompts from Prompt Queries. The enhanced features and distinct query groups are processed by the EntityDecoder to produce video mask outputs.

(b) EntityDecoder Self-attention and cross-attention mechanisms in EntityDecoder layers.

Visualizations

Zero-shot Video Entity Segmentation

Our EntitySAM segment every entity in a video. Results are compared with SAM 2 and DEVA. EntitySAM achieves zero-shot video entity segmentation without requiring video-level training data or category annotations.

example1.mp4

example2.mp4

example3.mp4

example4.mp4

Installation

EntitySAM environment is based on SAM 2. The code requires python>=3.10, as well as torch>=2.5.1 and torchvision>=0.20.1. Please follow the instructions here to install both PyTorch and TorchVision dependencies. You can install EntitySAM on a GPU machine using:

git clone https://github.com/ymq2017/entitysam && cd entitysam

pip install -e .

pip install git+https://github.com/cocodataset/panopticapi.git
pip install git+https://github.com/facebookresearch/detectron2.git

Datasets

Training Dataset

For training EntitySAM, you can download the COCO dataset from the official website.

Evaluation Datasets

For evaluation, you can download the VIPSeg dataset from the official website or use our processed version for convenience.

Directory Structure

datasets/
├── VIPSeg_720P/
│   ├── images/
│   ├── panomasks/
│   ├── panomasksRGB/
│   ├── panoVIPSeg_categories.json
│   ├── panoptic_gt_VIPSeg_val.json
│   ├── train.txt
│   ├── val.txt
│   └── test.txt
└── coco/
    ├── train2017/
    ├── val2017/
    ├── annotations/
    └── panoptic_train2017/

Evaluation

For detailed evaluation instructions, please see EVAL.md.

Pre-trained Checkpoints

We provide pre-trained EntitySAM checkpoints on Hugging Face. You can download the checkpoints using the following table:

Model	Checkpoint Path	Download Link
ViT-L	`./checkpoints/vit-l/model_0009999.pth`	Download
ViT-S	`./checkpoints/vit-s/model_0009999.pth`	Download

Train

For training instructions, please see TRAIN.md.

Citation

If you find EntitySAM useful in your research or refer to the provided baseline results, please star ⭐ this repository and consider citing 📝:

@inproceedings{entitysam,
    title={EntitySAM: Segment Everything in Video},
    author={Ye, Mingqiao and Oh, Seoung Wug and Ke, Lei and Lee, Joon-Young},
    booktitle={CVPR},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
eval		eval
sam2		sam2
train		train
.gitignore		.gitignore
EVAL.md		EVAL.md
README.md		README.md
TRAIN.md		TRAIN.md
setup.py		setup.py
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EntitySAM [CVPR'25]

Updates

Introduction

Method Overview

Visualizations

Zero-shot Video Entity Segmentation

Installation

Datasets

Training Dataset

Evaluation Datasets

Directory Structure

Evaluation

Pre-trained Checkpoints

Train

Citation

About

Uh oh!

Releases

Packages

Languages

ymq2017/entitysam

Folders and files

Latest commit

History

Repository files navigation

EntitySAM [CVPR'25]

Updates

Introduction

Method Overview

Visualizations

Zero-shot Video Entity Segmentation

Installation

Datasets

Training Dataset

Evaluation Datasets

Directory Structure

Evaluation

Pre-trained Checkpoints

Train

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages