EntitySAM: Segment Everything in Video
CVPR 2025
Adobe Research, EPFL, CMU
We propose EntitySAM, a novel framework extending SAM2 to the task of Video Entity Segmentation—segmenting every entity in a video without requiring category annotations. Our method achieves generalizable and exhaustive segmentation using only image-level training data, and demonstrates strong zero-shot performance across multiple benchmarks. Refer to our paper for more details.
🔥 2025/07/12: We release the training code. See Training section.
2025/06/26: We release the evaluation code and checkpoints. See Evaluation section.
2025/06/02: Our paper EntitySAM is online.
Automatically tracking and segmenting every video entity remains a significant challenge. Despite rapid advancements in video segmentation, even state-of-the-art models like SAM 2 struggle to consistently track all entities across a video—a task we refer to as Video Entity Segmentation. We propose EntitySAM, a framework for zero-shot video entity segmentation. EntitySAM extends SAM 2 by removing the need for explicit prompts, allowing automatic discovery and tracking of all entities, including those appearing in later frames. We incorporate query-based entity discovery and association into SAM 2, inspired by transformer-based object detectors. Specifically, we introduce an entity decoder to facilitate inter-object communication and an automatic prompt generator using learnable object queries. Additionally, we add a semantic encoder to enhance SAM 2’s semantic awareness, improving segmentation quality. Trained on image-level mask annotations without category information from the COCO dataset, EntitySAM demonstrates strong generalization on four zero-shot video segmentation tasks: Video Entity, Panoptic, Instance, and Semantic Segmentation. Results on six popular benchmarks show that EntitySAM outperforms previous unified video segmentation methods and strong baselines, setting new standards for zero-shot video segmentation.
(a) Overview of the EntitySAM framework: EntitySAM utilizes the frozen encoder and memory parameters from SAM 2, incorporating a dual encoder design for enhanced semantic features. The PromptGenerator automatically generates prompts from Prompt Queries. The enhanced features and distinct query groups are processed by the EntityDecoder to produce video mask outputs.
(b) EntityDecoder Self-attention and cross-attention mechanisms in EntityDecoder layers.
Our EntitySAM segment every entity in a video. Results are compared with SAM 2 and DEVA. EntitySAM achieves zero-shot video entity segmentation without requiring video-level training data or category annotations.
example1.mp4
example2.mp4
example3.mp4
example4.mp4
EntitySAM environment is based on SAM 2. The code requires python>=3.10
, as well as torch>=2.5.1
and torchvision>=0.20.1
. Please follow the instructions here to install both PyTorch and TorchVision dependencies. You can install EntitySAM on a GPU machine using:
git clone https://github.com/ymq2017/entitysam && cd entitysam
pip install -e .
pip install git+https://github.com/cocodataset/panopticapi.git
pip install git+https://github.com/facebookresearch/detectron2.git
For training EntitySAM, you can download the COCO dataset from the official website.
For evaluation, you can download the VIPSeg dataset from the official website or use our processed version for convenience.
datasets/
├── VIPSeg_720P/
│ ├── images/
│ ├── panomasks/
│ ├── panomasksRGB/
│ ├── panoVIPSeg_categories.json
│ ├── panoptic_gt_VIPSeg_val.json
│ ├── train.txt
│ ├── val.txt
│ └── test.txt
└── coco/
├── train2017/
├── val2017/
├── annotations/
└── panoptic_train2017/
For detailed evaluation instructions, please see EVAL.md.
We provide pre-trained EntitySAM checkpoints on Hugging Face. You can download the checkpoints using the following table:
Model | Checkpoint Path | Download Link |
---|---|---|
ViT-L | ./checkpoints/vit-l/model_0009999.pth |
Download |
ViT-S | ./checkpoints/vit-s/model_0009999.pth |
Download |
For training instructions, please see TRAIN.md.
If you find EntitySAM useful in your research or refer to the provided baseline results, please star ⭐ this repository and consider citing 📝:
@inproceedings{entitysam,
title={EntitySAM: Segment Everything in Video},
author={Ye, Mingqiao and Oh, Seoung Wug and Ke, Lei and Lee, Joon-Young},
booktitle={CVPR},
year={2025}
}