Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

InternVideo [Paper]

中文 README

Benchmarks (Papers With Code)

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

This repo provides the official implementation of InternVideo: General Video Foundation Models via Generative and Discriminative Learning.

Highlights:

  • 91.1% Top-1 accuracy on Kinetics-400, surpassing the 90% milestone for the first time
  • 77.2% Top-1 accuracy on Something-Something V2
  • SOTA on 39 video benchmarks (action recognition, temporal localization, retrieval, etc.) at release (2022)

Project Structure

InternVideo1/
├── Pretrain/                          # Pretraining models
│   ├── VideoMAE/                      # Video masked autoencoder pretraining
│   ├── Multi-Modalities-Pretraining/  # Video-language contrastive learning (demo/inference only)
│   ├── ViCLIP/                        # Video CLIP for transferable video-text representation
│   └── UniFormerV2/                   # Spatiotemporal learning with image ViTs (git submodule)
│
├── Downstream/                        # Downstream task implementations
│   ├── Video-Text-Retrieval/          # Video-text retrieval on 6 benchmarks
│   ├── Open-Set-Action-Recognition/   # Open-set action recognition (MMAction-based)
│   ├── Spatial-Temporal-Action-Localization/  # Spatio-temporal action localization (AVA)
│   ├── Temporal-Action-Localization/  # Temporal action localization (ActivityNet, THUMOS14)
│   ├── Visual-Language-Navigation/    # Vision-language navigation (VLN-CE)
│   └── multi-modalities-downstream/   # VQA, zero-shot action recognition, zero-shot multiple choice
│
└── Media/                             # Images for documentation

Each sub-project is self-contained with its own README, dependencies, and training scripts. See individual READMEs for setup and usage instructions.

Introduction

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adaptation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.

Pretraining

Component Description Link
VideoMAE Video masked autoencoder pretraining & supervised finetuning Pretrain/VideoMAE
Multi-Modalities Video-language contrastive learning (inference demo) Pretrain/Multi-Modalities-Pretraining
ViCLIP Video CLIP trained on InternVid-10M Pretrain/ViCLIP
UniFormerV2 Spatiotemporal learning by arming image ViTs with video UniFormer Pretrain/UniFormerV2

Downstream Tasks

Task Description Link
Video-Text Retrieval Finetuned retrieval on MSR-VTT, DiDeMo, LSMDC, MSVD, VATEX, ActivityNet Downstream/Video-Text-Retrieval
Open-Set Action Recognition Open-set recognition on UCF101, HMDB51 Downstream/Open-Set-Action-Recognition
Spatio-Temporal Action Localization Action localization on AVA, AVA-Kinetics Downstream/Spatial-Temporal-Action-Localization
Temporal Action Localization Temporal localization on ActivityNet, THUMOS14, HACS, FineAction Downstream/Temporal-Action-Localization
Visual-Language Navigation VLN-CE navigation task Downstream/Visual-Language-Navigation
VQA & Zero-Shot Tasks Video QA, zero-shot action recognition, zero-shot multiple choice Downstream/multi-modalities-downstream
Ego4D Tasks Ego-centric video understanding External repo

Model Zoo

Pretrained Models
Model Training Data Download
InternVideo-MM-L-14 WebVid10M+Self-collected (14M) ckpt
VideoMAE-B UnlabeledHybrid (1M) ckpt
VideoMAE-L UnlabeledHybrid (1M) ckpt
VideoMAE-H UnlabeledHybrid (1M) ckpt
Downstream Task Models

Classification

Model Finetuning Data Download
VideoMAE-B K400 ckpt
VideoMAE-B K710 ckpt
VideoMAE-B SSv2 ckpt
VideoMAE-L K400 ckpt
VideoMAE-L K700 ckpt
VideoMAE-L SSv2 ckpt
VideoMAE-H K400 ckpt
VideoMAE-H SSv1 ckpt

Retrieval

Model Training Data Download
InternVideo-MM-L-14 ActivityNet ckpt
InternVideo-MM-L-14 DiDeMo ckpt
InternVideo-MM-L-14 LSMDC ckpt
InternVideo-MM-L-14 MSR-VTT ckpt
InternVideo-MM-L-14 MSVD ckpt
InternVideo-MM-L-14 VATEX ckpt

Video QA

Model Finetuning Data Download
InternVideo-MM-L-14 MSR-VTT ckpt
InternVideo-MM-L-14 MSVD ckpt
InternVideo-MM-L-14 TGIFQA ckpt

Spatio-Temporal Action Localization

Model Finetuning Data Download
VideoMAE-H AVA-Kinetics ckpt

Updates

  • Jan 16, 2024: InternVid accepted for spotlight at ICLR 2024.
  • Sep 7, 2023: ViCLIP released on Hugging Face with strong zero-shot action recognition.
  • Jul 16, 2023: InternVid video-text dataset (10M clip subset) released.
  • May 11, 2023: Video instruction data released for tuning video dialogue systems like VideoChat.
  • Mar 8, 2023: All pretrained foundation model weights released.
  • Dec 6, 2022: Technical report released.

Citation

If this work is helpful for your research, please consider citing InternVideo.

@article{wang2022internvideo,
  title={InternVideo: General Video Foundation Models via Generative and Discriminative Learning},
  author={Wang, Yi and Li, Kunchang and Li, Yizhuo and He, Yinan and Huang, Bingkun and Zhao, Zhiyu and Zhang, Hongjie and Xu, Jilan and Liu, Yi and Wang, Zun and Xing, Sen and Chen, Guo and Pan, Junting and Yu, Jiashuo and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2212.03191},
  year={2022}
}

@article{wang2023videomae,
  title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
  author={Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
  journal={arXiv preprint arXiv:2303.16727},
  year={2023}
}

@article{li2022uniformerv2,
  title={UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer},
  author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2211.09552},
  year={2022}
}

@article{li2023unmasked,
  title={Unmasked Teacher: Towards Training-Efficient Video Foundation Models},
  author={Li, Kunchang and Wang, Yali and Li, Yizhuo and Wang, Yi and He, Yinan and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2303.16058},
  year={2023}
}

@article{wang2023internvid,
  title={InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation},
  author={Wang, Yi and He, Yinan and Li, Yizhuo and Li, Kunchang and Yu, Jiashuo and Ma, Xin and Chen, Xinyuan and Wang, Yaohui and Luo, Ping and Liu, Ziwei and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2307.06942},
  year={2023}
}