Skip to content

Forwain/Sparse-MinVIS

Repository files navigation

A Lightweight Semi-Supervised Video Instance Segmentation Framework with Frame-Only Training

1. Introduction

Video Instance Segmentation (VIS) aims to simultaneously perform object detection, segmentation, and tracking in videos. It plays an important role in applications such as video understanding, surveillance, video editing, and autonomous driving.

However, existing VIS methods heavily rely on dense pixel-level annotations across consecutive frames, which leads to:

  • High annotation cost
  • Low scalability
  • Difficulty in real-world deployment

To address this challenge, we propose a lightweight semi-supervised VIS framework that enables training with frame-only supervision, significantly reducing annotation requirements.


2. Method

2.1 Overview

The proposed framework adopts a semi-supervised learning paradigm, leveraging both labeled and unlabeled video frames to train the model.

Key idea:

  • Use limited labeled frames for supervised learning
  • Generate pseudo labels for unlabeled frames
  • Jointly train the model using both labeled and pseudo-labeled data

2.2 Training Pipeline

The overall training process consists of:

  1. Labeled data

    • Sparse annotations (e.g., 1 frame per video)
    • Provide ground truth supervision
  2. Unlabeled data

    • Remaining frames without annotations
    • Used for pseudo-label generation
  3. Pseudo-label generation

    • Model predicts segmentation results for unlabeled frames
    • These predictions are treated as pseudo labels
  4. Joint training

    • Combine labeled data and pseudo-labeled data
    • Optimize the model using both sources

2.3 Framework

  • Base model: MinVIS
  • Learning paradigm: Semi-supervised learning
  • Supervision: Frame-only labeled data + pseudo labels

3. Experiments

Dataset

  • YouTube-VIS 2019

Experimental Setting

  • Only 1 labeled frame per video
  • Remaining frames are treated as unlabeled

Results

  • The model maintains competitive instance segmentation performance under extremely sparse supervision
  • Demonstrates that pseudo-label-based training can effectively utilize unlabeled video data

4. Key Contributions

  • Propose a semi-supervised VIS framework with frame-only supervision
  • Introduce pseudo-labeling mechanism for video instance segmentation
  • Significantly reduce annotation cost while maintaining performance
  • Validate effectiveness on YouTube-VIS 2019 dataset

About

A Lightweight Semi-Supervised Video Instance Segmentation Framework with Frame-Only Training

Resources

License

Stars

Watchers

Forks

Contributors

Languages