A Lightweight Semi-Supervised Video Instance Segmentation Framework with Frame-Only Training

1. Introduction

Video Instance Segmentation (VIS) aims to simultaneously perform object detection, segmentation, and tracking in videos. It plays an important role in applications such as video understanding, surveillance, video editing, and autonomous driving.

However, existing VIS methods heavily rely on dense pixel-level annotations across consecutive frames, which leads to:

High annotation cost
Low scalability
Difficulty in real-world deployment

To address this challenge, we propose a lightweight semi-supervised VIS framework that enables training with frame-only supervision, significantly reducing annotation requirements.

2. Method

2.1 Overview

The proposed framework adopts a semi-supervised learning paradigm, leveraging both labeled and unlabeled video frames to train the model.

Key idea:

Use limited labeled frames for supervised learning
Generate pseudo labels for unlabeled frames
Jointly train the model using both labeled and pseudo-labeled data

2.2 Training Pipeline

The overall training process consists of:

Labeled data
- Sparse annotations (e.g., 1 frame per video)
- Provide ground truth supervision
Unlabeled data
- Remaining frames without annotations
- Used for pseudo-label generation
Pseudo-label generation
- Model predicts segmentation results for unlabeled frames
- These predictions are treated as pseudo labels
Joint training
- Combine labeled data and pseudo-labeled data
- Optimize the model using both sources

2.3 Framework

Base model: MinVIS
Learning paradigm: Semi-supervised learning
Supervision: Frame-only labeled data + pseudo labels

3. Experiments

Dataset

YouTube-VIS 2019

Experimental Setting

Only 1 labeled frame per video
Remaining frames are treated as unlabeled

Results

The model maintains competitive instance segmentation performance under extremely sparse supervision
Demonstrates that pseudo-label-based training can effectively utilize unlabeled video data

4. Key Contributions

Propose a semi-supervised VIS framework with frame-only supervision
Introduce pseudo-labeling mechanism for video instance segmentation
Significantly reduce annotation cost while maintaining performance
Validate effectiveness on YouTube-VIS 2019 dataset

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
datasets		datasets
demo_video		demo_video
mask2former		mask2former
mask2former_video		mask2former_video
minvis		minvis
.gitignore		.gitignore
GETTING_STARTED.md		GETTING_STARTED.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MODEL_ZOO.md		MODEL_ZOO.md
README.md		README.md
detectron2		detectron2
requirements.txt		requirements.txt
train_net_video.py		train_net_video.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Lightweight Semi-Supervised Video Instance Segmentation Framework with Frame-Only Training

1. Introduction

2. Method

2.1 Overview

2.2 Training Pipeline

2.3 Framework

3. Experiments

Dataset

Experimental Setting

Results

4. Key Contributions

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Lightweight Semi-Supervised Video Instance Segmentation Framework with Frame-Only Training

1. Introduction

2. Method

2.1 Overview

2.2 Training Pipeline

2.3 Framework

3. Experiments

Dataset

Experimental Setting

Results

4. Key Contributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages