Video Instance Segmentation (VIS) aims to simultaneously perform object detection, segmentation, and tracking in videos. It plays an important role in applications such as video understanding, surveillance, video editing, and autonomous driving.
However, existing VIS methods heavily rely on dense pixel-level annotations across consecutive frames, which leads to:
- High annotation cost
- Low scalability
- Difficulty in real-world deployment
To address this challenge, we propose a lightweight semi-supervised VIS framework that enables training with frame-only supervision, significantly reducing annotation requirements.
The proposed framework adopts a semi-supervised learning paradigm, leveraging both labeled and unlabeled video frames to train the model.
Key idea:
- Use limited labeled frames for supervised learning
- Generate pseudo labels for unlabeled frames
- Jointly train the model using both labeled and pseudo-labeled data
The overall training process consists of:
-
Labeled data
- Sparse annotations (e.g., 1 frame per video)
- Provide ground truth supervision
-
Unlabeled data
- Remaining frames without annotations
- Used for pseudo-label generation
-
Pseudo-label generation
- Model predicts segmentation results for unlabeled frames
- These predictions are treated as pseudo labels
-
Joint training
- Combine labeled data and pseudo-labeled data
- Optimize the model using both sources
- Base model: MinVIS
- Learning paradigm: Semi-supervised learning
- Supervision: Frame-only labeled data + pseudo labels
- YouTube-VIS 2019
- Only 1 labeled frame per video
- Remaining frames are treated as unlabeled
- The model maintains competitive instance segmentation performance under extremely sparse supervision
- Demonstrates that pseudo-label-based training can effectively utilize unlabeled video data
- Propose a semi-supervised VIS framework with frame-only supervision
- Introduce pseudo-labeling mechanism for video instance segmentation
- Significantly reduce annotation cost while maintaining performance
- Validate effectiveness on YouTube-VIS 2019 dataset