InternVideo [Paper]
This repo provides the official implementation of InternVideo: General Video Foundation Models via Generative and Discriminative Learning.
Highlights:
- 91.1% Top-1 accuracy on Kinetics-400, surpassing the 90% milestone for the first time
- 77.2% Top-1 accuracy on Something-Something V2
- SOTA on 39 video benchmarks (action recognition, temporal localization, retrieval, etc.) at release (2022)
InternVideo1/
├── Pretrain/ # Pretraining models
│ ├── VideoMAE/ # Video masked autoencoder pretraining
│ ├── Multi-Modalities-Pretraining/ # Video-language contrastive learning (demo/inference only)
│ ├── ViCLIP/ # Video CLIP for transferable video-text representation
│ └── UniFormerV2/ # Spatiotemporal learning with image ViTs (git submodule)
│
├── Downstream/ # Downstream task implementations
│ ├── Video-Text-Retrieval/ # Video-text retrieval on 6 benchmarks
│ ├── Open-Set-Action-Recognition/ # Open-set action recognition (MMAction-based)
│ ├── Spatial-Temporal-Action-Localization/ # Spatio-temporal action localization (AVA)
│ ├── Temporal-Action-Localization/ # Temporal action localization (ActivityNet, THUMOS14)
│ ├── Visual-Language-Navigation/ # Vision-language navigation (VLN-CE)
│ └── multi-modalities-downstream/ # VQA, zero-shot action recognition, zero-shot multiple choice
│
└── Media/ # Images for documentation
Each sub-project is self-contained with its own README, dependencies, and training scripts. See individual READMEs for setup and usage instructions.
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adaptation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
| Component | Description | Link |
|---|---|---|
| VideoMAE | Video masked autoencoder pretraining & supervised finetuning | Pretrain/VideoMAE |
| Multi-Modalities | Video-language contrastive learning (inference demo) | Pretrain/Multi-Modalities-Pretraining |
| ViCLIP | Video CLIP trained on InternVid-10M | Pretrain/ViCLIP |
| UniFormerV2 | Spatiotemporal learning by arming image ViTs with video UniFormer | Pretrain/UniFormerV2 |
| Task | Description | Link |
|---|---|---|
| Video-Text Retrieval | Finetuned retrieval on MSR-VTT, DiDeMo, LSMDC, MSVD, VATEX, ActivityNet | Downstream/Video-Text-Retrieval |
| Open-Set Action Recognition | Open-set recognition on UCF101, HMDB51 | Downstream/Open-Set-Action-Recognition |
| Spatio-Temporal Action Localization | Action localization on AVA, AVA-Kinetics | Downstream/Spatial-Temporal-Action-Localization |
| Temporal Action Localization | Temporal localization on ActivityNet, THUMOS14, HACS, FineAction | Downstream/Temporal-Action-Localization |
| Visual-Language Navigation | VLN-CE navigation task | Downstream/Visual-Language-Navigation |
| VQA & Zero-Shot Tasks | Video QA, zero-shot action recognition, zero-shot multiple choice | Downstream/multi-modalities-downstream |
| Ego4D Tasks | Ego-centric video understanding | External repo |
Pretrained Models
| Model | Training Data | Download |
|---|---|---|
| InternVideo-MM-L-14 | WebVid10M+Self-collected (14M) | ckpt |
| VideoMAE-B | UnlabeledHybrid (1M) | ckpt |
| VideoMAE-L | UnlabeledHybrid (1M) | ckpt |
| VideoMAE-H | UnlabeledHybrid (1M) | ckpt |
Downstream Task Models
Classification
| Model | Finetuning Data | Download |
|---|---|---|
| VideoMAE-B | K400 | ckpt |
| VideoMAE-B | K710 | ckpt |
| VideoMAE-B | SSv2 | ckpt |
| VideoMAE-L | K400 | ckpt |
| VideoMAE-L | K700 | ckpt |
| VideoMAE-L | SSv2 | ckpt |
| VideoMAE-H | K400 | ckpt |
| VideoMAE-H | SSv1 | ckpt |
Retrieval
| Model | Training Data | Download |
|---|---|---|
| InternVideo-MM-L-14 | ActivityNet | ckpt |
| InternVideo-MM-L-14 | DiDeMo | ckpt |
| InternVideo-MM-L-14 | LSMDC | ckpt |
| InternVideo-MM-L-14 | MSR-VTT | ckpt |
| InternVideo-MM-L-14 | MSVD | ckpt |
| InternVideo-MM-L-14 | VATEX | ckpt |
Video QA
| Model | Finetuning Data | Download |
|---|---|---|
| InternVideo-MM-L-14 | MSR-VTT | ckpt |
| InternVideo-MM-L-14 | MSVD | ckpt |
| InternVideo-MM-L-14 | TGIFQA | ckpt |
Spatio-Temporal Action Localization
| Model | Finetuning Data | Download |
|---|---|---|
| VideoMAE-H | AVA-Kinetics | ckpt |
Jan 16, 2024: InternVid accepted for spotlight at ICLR 2024.Sep 7, 2023: ViCLIP released on Hugging Face with strong zero-shot action recognition.Jul 16, 2023: InternVid video-text dataset (10M clip subset) released.May 11, 2023: Video instruction data released for tuning video dialogue systems like VideoChat.Mar 8, 2023: All pretrained foundation model weights released.Dec 6, 2022: Technical report released.
If this work is helpful for your research, please consider citing InternVideo.
@article{wang2022internvideo,
title={InternVideo: General Video Foundation Models via Generative and Discriminative Learning},
author={Wang, Yi and Li, Kunchang and Li, Yizhuo and He, Yinan and Huang, Bingkun and Zhao, Zhiyu and Zhang, Hongjie and Xu, Jilan and Liu, Yi and Wang, Zun and Xing, Sen and Chen, Guo and Pan, Junting and Yu, Jiashuo and Wang, Yali and Wang, Limin and Qiao, Yu},
journal={arXiv preprint arXiv:2212.03191},
year={2022}
}
@article{wang2023videomae,
title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
author={Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
journal={arXiv preprint arXiv:2303.16727},
year={2023}
}
@article{li2022uniformerv2,
title={UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer},
author={Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Wang, Limin and Qiao, Yu},
journal={arXiv preprint arXiv:2211.09552},
year={2022}
}
@article{li2023unmasked,
title={Unmasked Teacher: Towards Training-Efficient Video Foundation Models},
author={Li, Kunchang and Wang, Yali and Li, Yizhuo and Wang, Yi and He, Yinan and Wang, Limin and Qiao, Yu},
journal={arXiv preprint arXiv:2303.16058},
year={2023}
}
@article{wang2023internvid,
title={InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation},
author={Wang, Yi and He, Yinan and Li, Yizhuo and Li, Kunchang and Yu, Jiashuo and Ma, Xin and Chen, Xinyuan and Wang, Yaohui and Luo, Ping and Liu, Ziwei and Wang, Yali and Wang, Limin and Qiao, Yu},
journal={arXiv preprint arXiv:2307.06942},
year={2023}
}