diff --git a/README.md b/README.md
index 362cc85e..dc8978c9 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,9 @@
- [English Version]
+ [中文版本]
+There are a lot of issues now, our team will check and solve them one by one, please wait patiently.
-# 书生2.5 - 多模态多任务通用大模型
+# INTERN-2.5: Multimodal Multitask General Large Model
[](https://paperswithcode.com/sota/object-detection-on-coco?p=internimage-exploring-large-scale-vision)
[](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=internimage-exploring-large-scale-vision)
@@ -24,46 +25,48 @@
[](https://paperswithcode.com/sota/image-classification-on-places205?p=internimage-exploring-large-scale-vision)
[](https://paperswithcode.com/sota/image-classification-on-imagenet?p=internimage-exploring-large-scale-vision)
-这个代码仓库是[InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions](https://arxiv.org/abs/2211.05778)的官方实现。
+This repository is an official implementation of the [InternImage: Exploring Large-Scale Vision Foundation Models with
+Deformable Convolutions](https://arxiv.org/abs/2211.05778).
-[论文](https://arxiv.org/abs/2211.05778) \| [知乎专栏](https://zhuanlan.zhihu.com/p/610772005) | [文档](./docs/)
-## 简介
-商汤科技与上海人工智能实验室在2023年3月14日联合发布多模态多任务通用大模型“书生2.5”。“书生2.5”在多模态多任务处理能力中斩获多项全新突破,其卓越的图文跨模态任务处理能力可为自动驾驶等通用场景任务提供高效精准的感知和理解能力支持。“书生2.5”致力于多模态多任务通用模型的构建,旨在接收处理各种不同模态的输入,并采用统一的模型架构和参数处理各种不同的任务,促进不同模态和任务之间在表示学习方面的协作,逐步实现通用人工智能领域的融会贯通。
+[Paper](https://arxiv.org/abs/2211.05778) \| [Blog in Chinese](https://zhuanlan.zhihu.com/p/610772005) | [Documents](./docs/)
-## 概览图
-
-

-
+## Introduction
+SenseTime and Shanghai AI Laboratory jointly released the multimodal multitask general model "INTERN-2.5" on March 14, 2023. "INTERN-2.5" achieved multiple breakthroughs in multimodal multitask processing, and its excellent cross-modal task processing ability in text and image can provide efficient and accurate perception and understanding capabilities for general scenarios such as autonomous driving.
+## Overview
-## 亮点
-- :thumbsup: **高达30亿参数的最强视觉通用主干模型**
-- 🏆 **图像分类标杆数据集ImageNet `90.1% Top1`准确率,开源模型中准确度最高**
-- 🏆 **物体检测标杆数据集COCO `65.5 mAP`,唯一超过`65 mAP`的模型**
+
+

+
-## 最新进展
-- 2023年3月14日: 🚀 “书生2.5”发布!
-- 2023年2月28日: 🚀 InternImage 被CVPR 2023接收!
-- 2022年11月18日: 🚀 基于 InternImage-XL 主干网络,[BEVFormer v2](https://arxiv.org/abs/2211.10439) 在nuScenes的纯视觉3D检测任务上取得了最佳性能 `63.4 NDS` !
-- 2022年11月10日: 🚀 InternImage-H 在COCO目标检测任务上以 `65.4 mAP` 斩获冠军,是唯一突破 `65.0 mAP` 的超强物体检测模型!
-- 2022年11月10日: 🚀 InternImage-H 在ADE20K语义分割数据集上取得 `62.9 mIoU` 的SOTA性能!
+## Highlights
+- :thumbsup: **The strongest visual universal backbone model with up to 3 billion parameters**
+- 🏆 **Achieved `90.1% Top1` accuracy in ImageNet, the most accurate among open-source models**
+- 🏆 **Achieved `65.5 mAP` on the COCO benchmark dataset for object detection, the only model that exceeded `65.0 mAP`**
+## News
+- `Mar 14, 2023`: 🚀 "INTERN-2.5" is released!
+- `Feb 28, 2023`: 🚀 InternImage is accepted to CVPR 2023!
+- `Nov 18, 2022`: 🚀 InternImage-XL merged into [BEVFormer v2](https://arxiv.org/abs/2211.10439) achieves state-of-the-art performance of `63.4 NDS` on nuScenes Camera Only.
+- `Nov 10, 2022`: 🚀 InternImage-H achieves a new record `65.4 mAP` on COCO detection test-dev and `62.9 mIoU` on
+ADE20K, outperforming previous models by a large margin.
-## “书生2.5”的应用
+## Applications
-### 1. 图像模态任务性能
-- 在图像分类标杆数据集ImageNet上,“书生2.5”仅基于公开数据便达到了 90.1% 的Top-1准确率。这是除谷歌与微软两个未公开模型及额外数据集外,唯一准确率超过90.0%的模型,同时也是世界上开源模型中ImageNet准确度最高,规模最大的模型;
-- 在物体检测标杆数据集COCO上,“书生2.5” 取得了 65.5 的 mAP,是世界上唯一超过65 mAP的模型;
-- 在另外16个重要的视觉基础数据集(覆盖分类、检测和分割任务)上取得世界最好性能。
+### 1. Performance on Image Modality Tasks
+- On the ImageNet benchmark dataset,
+"INTERN-2.5" achieved a Top-1 accuracy of 90.1% using only publicly available data for image classification. This is the only model, besides two undisclosed models from Google and Microsoft and additional datasets, to achieve a Top-1 accuracy of over 90.0%. It is also the highest-accuracy open-source model on ImageNet and the largest model in scale in the world.
+- On the COCO object detection benchmark dataset, "INTERN-2.5" achieved a mAP of 65.5, making it the only model in the world to surpass 65 mAP.
+- "INTERN-2.5" achieved the world's best performance on 16 other important visual benchmark datasets, covering classification, detection, and segmentation tasks.
-**分类任务**
+**Classification Task**
- 图像分类 | 场景分类 | 长尾分类 |
+ Image Classification | Scene Classification | Long-Tail Classification |
ImageNet | Places365 | Places 205 | iNaturalist 2018 |
@@ -75,10 +78,10 @@
-**检测任务**
+**Detection Task**
- 常规物体检测 | 长尾物体检测 | 自动驾驶物体检测 | 密集物体检测 |
+ Conventional Object Detection | Long-Tail Object Detection | Autonomous Driving Object Detection | Dense Object Detection |
COCO | VOC 2007 | VOC 2012 | OpenImage | LVIS minival | LVIS val | BDD100K | nuScenes | CrowdHuman |
@@ -89,10 +92,10 @@
-**分割任务**
+**Segmentation Task**
- 语义分割 | 街景分割 | RGBD分割 |
+ Semantic Segmentation | Street Segmentation | RGBD Segmentation |
ADE20K | COCO Stuff-10K | Pascal Context | CityScapes | NYU Depth V2 |
@@ -105,26 +108,26 @@
-### 2. 图文跨模态任务性能
+### 2. Cross-Modal Performance for Image and Text Tasks
-- 图文检索
+- Image-Text Retrieval
-“书生2.5”可根据文本内容需求快速定位检索出语义最相关的图像。这一能力既可应用于视频和图像集合,也可进一步结合物体检测框,具有丰富的应用模式,帮助用户更便捷、快速地找到所需图像资源, 例如可在相册中返回文本所指定的相关图像。
+"INTERN-2.5" can quickly locate and retrieve the most semantically relevant images based on textual content requirements. This capability can be applied to both videos and image collections and can be further combined with object detection boxes to enable a variety of applications, helping users quickly and easily find the required image resources. For example, it can return the relevant images specified by the text in the album.
-- 以图生文
+- Image-To-Text
-“书生2.5”的“以图生文”在图像描述、视觉问答、视觉推理和文字识别等多个方面均拥有强大的理解能力。例如在自动驾驶场景下,可以提升场景感知理解能力,辅助车辆判断交通信号灯状态、道路标志牌等信息,为车辆的决策规划提供有效的感知信息支持。
+"INTERN-2.5" has a strong understanding capability in various aspects of visual-to-text tasks such as image captioning, visual question answering, visual reasoning, and optical character recognition. For example, in the context of autonomous driving, it can enhance the scene perception and understanding capabilities, assist the vehicle in judging traffic signal status, road signs, and other information, and provide effective perception information support for vehicle decision-making and planning.
-**图文多模态任务**
+**Multimodal Tasks**
- 图像描述 | 微调图文检索 | 零样本图文检索 |
+ Image Captioning | Fine-tuning Image-Text Retrieval | Zero-shot Image-Text Retrieval |
COCO Caption | COCO Caption | Flickr30k | Flickr30k |
@@ -137,9 +140,10 @@
+## Core Technologies
+The outstanding performance of "INTERN-2.5" in the field of cross-modal learning is due to several innovations in the core technology of multi-modal multi-task general model, including the development of InternImage as the backbone network for visual perception, LLM as the large-scale text pre-training network for text processing, and Uni-Perceiver as the compatible decoding modeling for multi-task learning.
-## 核心技术
-“书生2.5”在图文跨模态领域卓越的性能表现,源自于在多模态多任务通用模型技术核心方面的多项创新,实现了视觉核心视觉感知大模型主干网络(InternImage)、用于文本核心的超大规模文本预训练网络(LLM)和用于多任务的兼容解码建模(Uni-Perceiver)的创新组合。 视觉主干网络InternImage参数量高达30亿,能够基于动态稀疏卷积算子自适应地调整卷积的位置和组合方式,从而为多功能视觉感知提供强大的表示。Uni-Perceiver通才任务解码建模通过将不同模态的数据编码到统一的表示空间,并将不同任务统一为相同的任务范式,从而能够以相同的任务架构和共享的模型参数同时处理各种模态和任务。
+InternImage, the visual backbone network of "INTERN-2.5", has a parameter size of up to 3 billion and can adaptively adjust the position and combination of convolutions based on dynamic sparse convolution operators, providing powerful representations for multi-functional visual perception. Uni-Perceiver, a versatile task decoding model, encodes data from different modalities into a unified representation space and unifies different tasks into the same task paradigm, enabling simultaneous processing of various modalities and tasks with the same task architecture and shared model parameters.
@@ -147,95 +151,120 @@
-## 项目功能
-- [ ] 各类下游任务
+## Project Release
+- [ ] Model for other downstream tasks
- [x] InternImage-H(1B)/G(3B)
-- [x] TensorRT 推理
-- [x] InternImage 系列分类代码
-- [x] InternImage-T/S/B/L/XL ImageNet-1K 预训练模型
-- [x] InternImage-L/XL ImageNet-22K 预训练模型
-- [x] InternImage-T/S/B/L/XL 检测和实例分割模型
-- [x] InternImage-T/S/B/L/XL 语义分割模型
-
-
-## 相关开源项目
-- 目标检测和实例分割: [COCO](detection/configs/coco/)
-- 语义分割: [ADE20K](segmentation/configs/ade20k/), [Cityscapes](segmentation/configs/cityscapes/)
-- 图文检索、图像描述和视觉问答: [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver)
-- 3D感知: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
-
-## 开源视觉预训练模型
-| name | pretrain | pre-training resolution | #param | download |
-| :------------: | :--------: | :--------: | :-----: | :-----------------: |
-| InternImage-L | ImageNet-22K | 384x384 | 223M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) |
-| InternImage-XL | ImageNet-22K | 384x384 | 335M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) |
-| InternImage-H | Joint 427M | 384x384 | 1.08B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth) |
-| InternImage-G | - | 384x384 | 3B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |
-
-
-
-## ImageNet-1K图像分类
-| name | pretrain | resolution | acc@1 | #param | FLOPs | download |
-| :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: |
-| InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
-| InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
-| InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
-| InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
-| InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
-| InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/internimage_h_22kto1k_640.yaml) |
-| InternImage-G | - | 512x512 | 90.1 | 3B | 2700G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/internimage_g_22kto1k_512.yaml) |
-
-
-
-## COCO目标检测和实例分割
-
-| backbone | method | schd | box mAP | mask mAP | #param | FLOPs | download |
-| :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: |
-| InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
-| InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
-| InternImage-S | Mask R-CNN | 1x | 47.8 | 43.3 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_1x_coco.py) |
-| InternImage-S | Mask R-CNN | 3x | 49.7 | 44.5 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_3x_coco.py) |
-| InternImage-B | Mask R-CNN | 1x | 48.8 | 44.0 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_1x_coco.py) |
-| InternImage-B | Mask R-CNN | 3x | 50.3 | 44.8 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_3x_coco.py) |
-| InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
-| InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
-| InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
-| InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
-
-| backbone | method | box mAP (val/test) | #param | FLOPs | download |
-| :------------: | :----------------: | :---------: | :------: | :-----: | :-----: |
-| InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO |
-| InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO |
-
-## ADE20K语义分割
-
-| backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | download |
-| :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: |
-| InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
-| InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
-| InternImage-B | UperNet | 512x512 | 50.8 / 51.3 | 128M | 1185G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_b_512_160k_ade20k.py) |
-| InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) |
-| InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) |
-| InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py) |
-| InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | TODO |
-
-
-## 模型推理速度
-
-[[TensorRT]](classification/export.py)
+- [x] TensorRT inference
+- [x] Classification code of the InternImage series
+- [x] InternImage-T/S/B/L/XL ImageNet-1K pretrained model
+- [x] InternImage-L/XL ImageNet-22K pretrained model
+- [x] InternImage-T/S/B/L/XL detection and instance segmentation model
+- [x] InternImage-T/S/B/L/XL semantic segmentation model
+
+
+## Related Projects
+- Object Detection and Instance Segmentation: [COCO](detection/configs/coco/)
+- Semantic Segmentation: [ADE20K](segmentation/configs/ade20k/), [Cityscapes](segmentation/configs/cityscapes/)
+- Image-Text Retrieval, Image Captioning, and Visual Question Answering: [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver)
+- 3D Perception: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
+
+
+## Open-source Visual Pretrained Models
+| name | pretrain | pre-training resolution | #param | download |
+| :------------: | :----------: | :----------------------: | :----: | :---------------------------------------------------------------------------------------------------: |
+| InternImage-L | ImageNet-22K | 384x384 | 223M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) |
+| InternImage-XL | ImageNet-22K | 384x384 | 335M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) |
+| InternImage-H | Joint 427M | 384x384 | 1.08B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth) |
+| InternImage-G | - | 384x384 | 3B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |
+
+
+
+## ImageNet-1K Image Classification
+| name | pretrain | resolution | acc@1 | #param | FLOPs | download |
+| :------------: | :----------: | :--------: | :---: | :----: | :---: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
+| InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
+| InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
+| InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
+| InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
+| InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/internimage_h_22kto1k_640.yaml) |
+| InternImage-G | - | 512x512 | 90.1 | 3B | 2700G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/internimage_g_22kto1k_512.yaml) |
+
+
+## COCO Object Detection and Instance Segmentation
+
+| backbone | method | schd | box mAP | mask mAP | #param | FLOPs | download |
+| :------------: | :--------: | :---: | :-----: | :------: | :----: | :---: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
+| InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
+| InternImage-S | Mask R-CNN | 1x | 47.8 | 43.3 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_1x_coco.py) |
+| InternImage-S | Mask R-CNN | 3x | 49.7 | 44.5 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_3x_coco.py) |
+| InternImage-B | Mask R-CNN | 1x | 48.8 | 44.0 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_1x_coco.py) |
+| InternImage-B | Mask R-CNN | 3x | 50.3 | 44.8 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_3x_coco.py) |
+| InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
+| InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
+| InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
+| InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
+
+| backbone | method | box mAP (val/test) | #param | FLOPs | download |
+| :-----------: | :--------: | :----------------: | :----: | :---: | :------: |
+| InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO |
+| InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO |
+
+## ADE20K Semantic Segmentation
+
+
+| backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | download |
+| :------------: | :---------: | :--------: | :----------: | :----: | :---: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
+| InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
+| InternImage-B | UperNet | 512x512 | 50.8 / 51.3 | 128M | 1185G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_b_512_160k_ade20k.py) |
+| InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) |
+| InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) |
+| InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py) |
+| InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | TODO |
+
+
+## Main Results of FPS
+
+[export classification model from pytorch to tensorrt](classification/README.md#export)
+
+[export detection model from pytorch to tensorrt](detection/README.md#export)
+
+[export segmentation model from pytorch to tensorrt](segmentation/README.md#export)
| name | resolution | #param | FLOPs | batch 1 FPS (TensorRT) |
-| :------------: | :--------: | :-----: | :---: | :-------------------: |
-| InternImage-T | 224x224 | 30M | 5G | 156 |
-| InternImage-S | 224x224 | 50M | 8G | 129 |
-| InternImage-B | 224x224 | 97M | 16G | 116 |
-| InternImage-L | 384x384 | 223M | 108G | 56 |
-| InternImage-XL | 384x384 | 335M | 163G | 47 |
+| :------------: | :--------: | :----: | :---: | :--------------------: |
+| InternImage-T | 224x224 | 30M | 5G | 156 |
+| InternImage-S | 224x224 | 50M | 8G | 129 |
+| InternImage-B | 224x224 | 97M | 16G | 116 |
+| InternImage-L | 384x384 | 223M | 108G | 56 |
+| InternImage-XL | 384x384 | 335M | 163G | 47 |
+
+Before using `mmdeploy` to convert our PyTorch models to TensorRT, please make sure you have the DCNv3 custom operator builded correctly. You can build it with the following command:
+```shell
+export MMDEPLOY_DIR=/the/root/path/of/MMDeploy
+
+# prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3
+cp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt
+
+# build custom ops
+cd ${MMDEPLOY_DIR}
+mkdir -p build && cd build
+cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
+make -j$(nproc) && make install
+
+# install the mmdeploy after building custom ops
+cd ${MMDEPLOY_DIR}
+pip install -e .
+```
+For more details on building custom ops, please refering to [this document](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/01-how-to-build/linux-x86_64.md).
+
-## 引用
+## Citation
-若“书生2.5”对您的研究工作有帮助,请参考如下bibtex对我们的工作进行引用。
+If this work is helpful for your research, please consider citing the following BibTeX entry.
```
@article{wang2022internimage,
@@ -291,5 +320,6 @@
```
-

+
+[//]: # (

)
diff --git a/README_CN.md b/README_CN.md
new file mode 100644
index 00000000..b07df50a
--- /dev/null
+++ b/README_CN.md
@@ -0,0 +1,322 @@
+
+ [English Version]
+
+现在issue有点多,我们团队会逐一查阅并解决,请耐心等待。
+
+# 书生2.5 - 多模态多任务通用大模型
+
+[](https://paperswithcode.com/sota/object-detection-on-coco?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-minival?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-val?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2007?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2012?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/object-detection-on-openimages-v6?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/object-detection-on-crowdhuman-full-body?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/2d-object-detection-on-bdd100k-val?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes-val?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/semantic-segmentation-on-pascal-context?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/semantic-segmentation-on-coco-stuff-test?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/3d-object-detection-on-nuscenes-camera-only?p=bevformer-v2-adapting-modern-image-backbones)
+[](https://paperswithcode.com/sota/image-classification-on-inaturalist-2018?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/image-classification-on-places365?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/image-classification-on-places205?p=internimage-exploring-large-scale-vision)
+[](https://paperswithcode.com/sota/image-classification-on-imagenet?p=internimage-exploring-large-scale-vision)
+
+这个代码仓库是[InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions](https://arxiv.org/abs/2211.05778)的官方实现。
+
+[论文](https://arxiv.org/abs/2211.05778) \| [知乎专栏](https://zhuanlan.zhihu.com/p/610772005) | [文档](./docs/)
+## 简介
+商汤科技与上海人工智能实验室在2023年3月14日联合发布多模态多任务通用大模型“书生2.5”。“书生2.5”在多模态多任务处理能力中斩获多项全新突破,其卓越的图文跨模态任务处理能力可为自动驾驶等通用场景任务提供高效精准的感知和理解能力支持。“书生2.5”致力于多模态多任务通用模型的构建,旨在接收处理各种不同模态的输入,并采用统一的模型架构和参数处理各种不同的任务,促进不同模态和任务之间在表示学习方面的协作,逐步实现通用人工智能领域的融会贯通。
+
+## 概览图
+
+
+

+
+
+
+## 亮点
+- :thumbsup: **高达30亿参数的最强视觉通用主干模型**
+- 🏆 **图像分类标杆数据集ImageNet `90.1% Top1`准确率,开源模型中准确度最高**
+- 🏆 **物体检测标杆数据集COCO `65.5 mAP`,唯一超过`65 mAP`的模型**
+
+## 最新进展
+- 2023年3月14日: 🚀 “书生2.5”发布!
+- 2023年2月28日: 🚀 InternImage 被CVPR 2023接收!
+- 2022年11月18日: 🚀 基于 InternImage-XL 主干网络,[BEVFormer v2](https://arxiv.org/abs/2211.10439) 在nuScenes的纯视觉3D检测任务上取得了最佳性能 `63.4 NDS` !
+- 2022年11月10日: 🚀 InternImage-H 在COCO目标检测任务上以 `65.4 mAP` 斩获冠军,是唯一突破 `65.0 mAP` 的超强物体检测模型!
+- 2022年11月10日: 🚀 InternImage-H 在ADE20K语义分割数据集上取得 `62.9 mIoU` 的SOTA性能!
+
+
+## “书生2.5”的应用
+
+### 1. 图像模态任务性能
+- 在图像分类标杆数据集ImageNet上,“书生2.5”仅基于公开数据便达到了 90.1% 的Top-1准确率。这是除谷歌与微软两个未公开模型及额外数据集外,唯一准确率超过90.0%的模型,同时也是世界上开源模型中ImageNet准确度最高,规模最大的模型;
+- 在物体检测标杆数据集COCO上,“书生2.5” 取得了 65.5 的 mAP,是世界上唯一超过65 mAP的模型;
+- 在另外16个重要的视觉基础数据集(覆盖分类、检测和分割任务)上取得世界最好性能。
+
+
+
+
+**分类任务**
+
+
+ 图像分类 | 场景分类 | 长尾分类 |
+
+
+ ImageNet | Places365 | Places 205 | iNaturalist 2018 |
+
+
+ 90.1 | 61.2 | 71.7 | 92.3 |
+
+
+
+
+
+**检测任务**
+
+
+ 常规物体检测 | 长尾物体检测 | 自动驾驶物体检测 | 密集物体检测 |
+
+
+ COCO | VOC 2007 | VOC 2012 | OpenImage | LVIS minival | LVIS val | BDD100K | nuScenes | CrowdHuman |
+
+
+ 65.5 | 94.0 | 97.2 | 74.1 | 65.8 | 63.2 | 38.8 | 64.8 | 97.2 |
+
+
+
+
+**分割任务**
+
+
+ 语义分割 | 街景分割 | RGBD分割 |
+
+
+ ADE20K | COCO Stuff-10K | Pascal Context | CityScapes | NYU Depth V2 |
+
+
+ 62.9 | 59.6 | 70.3 | 86.1 | 69.7 |
+
+
+
+
+
+
+### 2. 图文跨模态任务性能
+
+- 图文检索
+
+“书生2.5”可根据文本内容需求快速定位检索出语义最相关的图像。这一能力既可应用于视频和图像集合,也可进一步结合物体检测框,具有丰富的应用模式,帮助用户更便捷、快速地找到所需图像资源, 例如可在相册中返回文本所指定的相关图像。
+
+
+- 以图生文
+
+“书生2.5”的“以图生文”在图像描述、视觉问答、视觉推理和文字识别等多个方面均拥有强大的理解能力。例如在自动驾驶场景下,可以提升场景感知理解能力,辅助车辆判断交通信号灯状态、道路标志牌等信息,为车辆的决策规划提供有效的感知信息支持。
+
+
+
+
+
+
+**图文多模态任务**
+
+
+ 图像描述 | 微调图文检索 | 零样本图文检索 |
+
+
+ COCO Caption | COCO Caption | Flickr30k | Flickr30k |
+
+
+ 148.2 | 76.4 | 94.8 | 89.1 |
+
+
+
+
+
+
+
+## 核心技术
+“书生2.5”在图文跨模态领域卓越的性能表现,源自于在多模态多任务通用模型技术核心方面的多项创新,实现了视觉核心视觉感知大模型主干网络(InternImage)、用于文本核心的超大规模文本预训练网络(LLM)和用于多任务的兼容解码建模(Uni-Perceiver)的创新组合。 视觉主干网络InternImage参数量高达30亿,能够基于动态稀疏卷积算子自适应地调整卷积的位置和组合方式,从而为多功能视觉感知提供强大的表示。Uni-Perceiver通才任务解码建模通过将不同模态的数据编码到统一的表示空间,并将不同任务统一为相同的任务范式,从而能够以相同的任务架构和共享的模型参数同时处理各种模态和任务。
+
+
+
+

+
+
+
+## 项目功能
+- [ ] 各类下游任务
+- [x] DCNv3 预编译的whl包
+- [x] InternImage-H(1B)/G(3B)
+- [x] TensorRT 推理
+- [x] InternImage 系列分类代码
+- [x] InternImage-T/S/B/L/XL ImageNet-1K 预训练模型
+- [x] InternImage-L/XL ImageNet-22K 预训练模型
+- [x] InternImage-T/S/B/L/XL 检测和实例分割模型
+- [x] InternImage-T/S/B/L/XL 语义分割模型
+
+
+## 相关开源项目
+- 目标检测和实例分割: [COCO](detection/configs/coco/)
+- 语义分割: [ADE20K](segmentation/configs/ade20k/), [Cityscapes](segmentation/configs/cityscapes/)
+- 图文检索、图像描述和视觉问答: [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver)
+- 3D感知: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
+
+## 开源视觉预训练模型
+| name | pretrain | pre-training resolution | #param | download |
+| :------------: | :----------: | :----------------------: | :----: | :---------------------------------------------------------------------------------------------------: |
+| InternImage-L | ImageNet-22K | 384x384 | 223M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) |
+| InternImage-XL | ImageNet-22K | 384x384 | 335M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) |
+| InternImage-H | Joint 427M | 384x384 | 1.08B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth) |
+| InternImage-G | - | 384x384 | 3B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |
+
+
+
+## ImageNet-1K图像分类
+| name | pretrain | resolution | acc@1 | #param | FLOPs | download |
+| :------------: | :----------: | :--------: | :---: | :----: | :---: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
+| InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
+| InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
+| InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
+| InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
+| InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/internimage_h_22kto1k_640.yaml) |
+| InternImage-G | - | 512x512 | 90.1 | 3B | 2700G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/internimage_g_22kto1k_512.yaml) |
+
+
+
+## COCO目标检测和实例分割
+
+| backbone | method | schd | box mAP | mask mAP | #param | FLOPs | download |
+| :------------: | :--------: | :---: | :-----: | :------: | :----: | :---: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
+| InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
+| InternImage-S | Mask R-CNN | 1x | 47.8 | 43.3 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_1x_coco.py) |
+| InternImage-S | Mask R-CNN | 3x | 49.7 | 44.5 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_3x_coco.py) |
+| InternImage-B | Mask R-CNN | 1x | 48.8 | 44.0 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_1x_coco.py) |
+| InternImage-B | Mask R-CNN | 3x | 50.3 | 44.8 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_3x_coco.py) |
+| InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
+| InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
+| InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
+| InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
+
+| backbone | method | box mAP (val/test) | #param | FLOPs | download |
+| :-----------: | :--------: | :----------------: | :----: | :---: | :------: |
+| InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO |
+| InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO |
+
+## ADE20K语义分割
+
+| backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | download |
+| :------------: | :---------: | :--------: | :----------: | :----: | :---: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
+| InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
+| InternImage-B | UperNet | 512x512 | 50.8 / 51.3 | 128M | 1185G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_b_512_160k_ade20k.py) |
+| InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) |
+| InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) |
+| InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py) |
+| InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | TODO |
+
+
+## 模型推理速度
+
+[export classification model from pytorch to tensorrt](classification/README.md#export)
+
+[export detection model from pytorch to tensorrt](detection/README.md#export)
+
+[export segmentation model from pytorch to tensorrt](segmentation/README.md#export)
+
+| name | resolution | #param | FLOPs | batch 1 FPS (TensorRT) |
+| :------------: | :--------: | :----: | :---: | :--------------------: |
+| InternImage-T | 224x224 | 30M | 5G | 156 |
+| InternImage-S | 224x224 | 50M | 8G | 129 |
+| InternImage-B | 224x224 | 97M | 16G | 116 |
+| InternImage-L | 384x384 | 223M | 108G | 56 |
+| InternImage-XL | 384x384 | 335M | 163G | 47 |
+
+在使用`mmdeploy`将PyTorch模型转为TensorRT之前,请确保您已正确编译DCNv3的自定义算子,其安装方式如下:
+```shell
+export MMDEPLOY_DIR=/the/root/path/of/MMDeploy
+
+# prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3
+cp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt
+
+# build custom ops
+cd ${MMDEPLOY_DIR}
+mkdir -p build && cd build
+cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
+make -j$(nproc) && make install
+
+# install the mmdeploy after building custom ops
+cd ${MMDEPLOY_DIR}
+pip install -e .
+```
+关于`mmdeploy`编译自定义算子的更多细节,请参考这份[文档](https://github.com/open-mmlab/mmdeploy/blob/master/docs/en/01-how-to-build/linux-x86_64.md)。
+
+
+
+## 引用
+
+若“书生2.5”对您的研究工作有帮助,请参考如下bibtex对我们的工作进行引用。
+
+```
+@article{wang2022internimage,
+ title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
+ author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
+ journal={arXiv preprint arXiv:2211.05778},
+ year={2022}
+}
+
+@inproceedings{zhu2022uni,
+ title={Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks},
+ author={Zhu, Xizhou and Zhu, Jinguo and Li, Hao and Wu, Xiaoshi and Li, Hongsheng and Wang, Xiaohua and Dai, Jifeng},
+ booktitle={CVPR},
+ pages={16804--16815},
+ year={2022}
+}
+
+@article{zhu2022uni,
+ title={Uni-perceiver-moe: Learning sparse generalist models with conditional moes},
+ author={Zhu, Jinguo and Zhu, Xizhou and Wang, Wenhai and Wang, Xiaohua and Li, Hongsheng and Wang, Xiaogang and Dai, Jifeng},
+ journal={arXiv preprint arXiv:2206.04674},
+ year={2022}
+}
+
+@article{li2022uni,
+ title={Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks},
+ author={Li, Hao and Zhu, Jinguo and Jiang, Xiaohu and Zhu, Xizhou and Li, Hongsheng and Yuan, Chun and Wang, Xiaohua and Qiao, Yu and Wang, Xiaogang and Wang, Wenhai and others},
+ journal={arXiv preprint arXiv:2211.09808},
+ year={2022}
+}
+
+@article{yang2022bevformer,
+ title={BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision},
+ author={Yang, Chenyu and Chen, Yuntao and Tian, Hao and Tao, Chenxin and Zhu, Xizhou and Zhang, Zhaoxiang and Huang, Gao and Li, Hongyang and Qiao, Yu and Lu, Lewei and others},
+ journal={arXiv preprint arXiv:2211.10439},
+ year={2022}
+}
+
+@article{su2022towards,
+ title={Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information},
+ author={Su, Weijie and Zhu, Xizhou and Tao, Chenxin and Lu, Lewei and Li, Bin and Huang, Gao and Qiao, Yu and Wang, Xiaogang and Zhou, Jie and Dai, Jifeng},
+ journal={arXiv preprint arXiv:2211.09807},
+ year={2022}
+}
+
+@inproceedings{li2022bevformer,
+ title={Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers},
+ author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng},
+ booktitle={ECCV},
+ pages={1--18},
+ year={2022},
+}
+```
+
+
+
+[//]: # (

)
+
diff --git a/README_EN.md b/README_EN.md
deleted file mode 100644
index e92d7c59..00000000
--- a/README_EN.md
+++ /dev/null
@@ -1,299 +0,0 @@
-
- [中文版本]
-
-
-# INTERN-2.5: Multimodal Multitask General Large Model
-
-[](https://paperswithcode.com/sota/object-detection-on-coco?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-minival?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/object-detection-on-lvis-v1-0-val?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2007?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/object-detection-on-pascal-voc-2012?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/object-detection-on-openimages-v6?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/object-detection-on-crowdhuman-full-body?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/2d-object-detection-on-bdd100k-val?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/semantic-segmentation-on-cityscapes-val?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/semantic-segmentation-on-pascal-context?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/semantic-segmentation-on-coco-stuff-test?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/3d-object-detection-on-nuscenes-camera-only?p=bevformer-v2-adapting-modern-image-backbones)
-[](https://paperswithcode.com/sota/image-classification-on-inaturalist-2018?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/image-classification-on-places365?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/image-classification-on-places205?p=internimage-exploring-large-scale-vision)
-[](https://paperswithcode.com/sota/image-classification-on-imagenet?p=internimage-exploring-large-scale-vision)
-
-This repository is an official implementation of the [InternImage: Exploring Large-Scale Vision Foundation Models with
-Deformable Convolutions](https://arxiv.org/abs/2211.05778).
-
-[Paper](https://arxiv.org/abs/2211.05778) \| [Blog in Chinese](https://zhuanlan.zhihu.com/p/610772005) | [Documents](./docs/)
-
-
-## Introduction
-SenseTime and Shanghai AI Laboratory jointly released the multimodal multitask general model "INTERN-2.5" on March 14, 2023. "INTERN-2.5" achieved multiple breakthroughs in multimodal multitask processing, and its excellent cross-modal task processing ability in text and image can provide efficient and accurate perception and understanding capabilities for general scenarios such as autonomous driving.
-
-## Overview
-
-
-

-
-
-## Highlights
-- :thumbsup: **The strongest visual universal backbone model with up to 3 billion parameters**
-- 🏆 **Achieved `90.1% Top1` accuracy in ImageNet, the most accurate among open-source models**
-- 🏆 **Achieved `65.5 mAP` on the COCO benchmark dataset for object detection, the only model that exceeded `65.0 mAP`**
-
-## News
-- `Mar 14, 2023`: 🚀 "INTERN-2.5" is released!
-- `Feb 28, 2023`: 🚀 InternImage is accepted to CVPR 2023!
-- `Nov 18, 2022`: 🚀 InternImage-XL merged into [BEVFormer v2](https://arxiv.org/abs/2211.10439) achieves state-of-the-art performance of `63.4 NDS` on nuScenes Camera Only.
-- `Nov 10, 2022`: 🚀 InternImage-H achieves a new record `65.4 mAP` on COCO detection test-dev and `62.9 mIoU` on
-ADE20K, outperforming previous models by a large margin.
-
-## Applications
-
-### 1. Performance on Image Modality Tasks
-- On the ImageNet benchmark dataset,
-"INTERN-2.5" achieved a Top-1 accuracy of 90.1% using only publicly available data for image classification. This is the only model, besides two undisclosed models from Google and Microsoft and additional datasets, to achieve a Top-1 accuracy of over 90.0%. It is also the highest-accuracy open-source model on ImageNet and the largest model in scale in the world.
-- On the COCO object detection benchmark dataset, "INTERN-2.5" achieved a mAP of 65.5, making it the only model in the world to surpass 65 mAP.
-- "INTERN-2.5" achieved the world's best performance on 16 other important visual benchmark datasets, covering classification, detection, and segmentation tasks.
-
-
-
-
-**Classification Task**
-
-
- Image Classification | Scene Classification | Long-Tail Classification |
-
-
- ImageNet | Places365 | Places 205 | iNaturalist 2018 |
-
-
- 90.1 | 61.2 | 71.7 | 92.3 |
-
-
-
-
-
-**Detection Task**
-
-
- Conventional Object Detection | Long-Tail Object Detection | Autonomous Driving Object Detection | Dense Object Detection |
-
-
- COCO | VOC 2007 | VOC 2012 | OpenImage | LVIS minival | LVIS val | BDD100K | nuScenes | CrowdHuman |
-
-
- 65.5 | 94.0 | 97.2 | 74.1 | 65.8 | 63.2 | 38.8 | 64.8 | 97.2 |
-
-
-
-
-**Segmentation Task**
-
-
- Semantic Segmentation | Street Segmentation | RGBD Segmentation |
-
-
- ADE20K | COCO Stuff-10K | Pascal Context | CityScapes | NYU Depth V2 |
-
-
- 62.9 | 59.6 | 70.3 | 86.1 | 69.7 |
-
-
-
-
-
-
-### 2. Cross-Modal Performance for Image and Text Tasks
-
-- Image-Text Retrieval
-
-"INTERN-2.5" can quickly locate and retrieve the most semantically relevant images based on textual content requirements. This capability can be applied to both videos and image collections and can be further combined with object detection boxes to enable a variety of applications, helping users quickly and easily find the required image resources. For example, it can return the relevant images specified by the text in the album.
-
-
-- Image-To-Text
-
-"INTERN-2.5" has a strong understanding capability in various aspects of visual-to-text tasks such as image captioning, visual question answering, visual reasoning, and optical character recognition. For example, in the context of autonomous driving, it can enhance the scene perception and understanding capabilities, assist the vehicle in judging traffic signal status, road signs, and other information, and provide effective perception information support for vehicle decision-making and planning.
-
-
-
-
-
-
-**Multimodal Tasks**
-
-
- Image Captioning | Fine-tuning Image-Text Retrieval | Zero-shot Image-Text Retrieval |
-
-
- COCO Caption | COCO Caption | Flickr30k | Flickr30k |
-
-
- 148.2 | 76.4 | 94.8 | 89.1 |
-
-
-
-
-
-
-## Core Technologies
-The outstanding performance of "INTERN-2.5" in the field of cross-modal learning is due to several innovations in the core technology of multi-modal multi-task general model, including the development of InternImage as the backbone network for visual perception, LLM as the large-scale text pre-training network for text processing, and Uni-Perceiver as the compatible decoding modeling for multi-task learning.
-
-InternImage, the visual backbone network of "INTERN-2.5", has a parameter size of up to 3 billion and can adaptively adjust the position and combination of convolutions based on dynamic sparse convolution operators, providing powerful representations for multi-functional visual perception. Uni-Perceiver, a versatile task decoding model, encodes data from different modalities into a unified representation space and unifies different tasks into the same task paradigm, enabling simultaneous processing of various modalities and tasks with the same task architecture and shared model parameters.
-
-
-
-

-
-
-
-## Project Release
-- [ ] Model for other downstream tasks
-- [x] InternImage-H(1B)/G(3B)
-- [x] TensorRT inference
-- [x] Classification code of the InternImage series
-- [x] InternImage-T/S/B/L/XL ImageNet-1K pretrained model
-- [x] InternImage-L/XL ImageNet-22K pretrained model
-- [x] InternImage-T/S/B/L/XL detection and instance segmentation model
-- [x] InternImage-T/S/B/L/XL semantic segmentation model
-
-
-## Related Projects
-- Object Detection and Instance Segmentation: [COCO](detection/configs/coco/)
-- Semantic Segmentation: [ADE20K](segmentation/configs/ade20k/), [Cityscapes](segmentation/configs/cityscapes/)
-- Image-Text Retrieval, Image Captioning, and Visual Question Answering: [Uni-Perceiver](https://github.com/fundamentalvision/Uni-Perceiver)
-- 3D Perception: [BEVFormer](https://github.com/fundamentalvision/BEVFormer)
-
-
-## Open-source Visual Pretrained Models
-| name | pretrain | pre-training resolution | #param | download |
-| :------------: | :--------: | :--------: | :-----: | :-----------------: |
-| InternImage-L | ImageNet-22K | 384x384 | 223M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22k_192to384.pth) |
-| InternImage-XL | ImageNet-22K | 384x384 | 335M | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22k_192to384.pth) |
-| InternImage-H | Joint 427M | 384x384 | 1.08B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_jointto22k_384.pth) |
-| InternImage-G | - | 384x384 | 3B | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_pretrainto22k_384.pth) |
-
-
-
-## ImageNet-1K Image Classification
-| name | pretrain | resolution | acc@1 | #param | FLOPs | download |
-| :------------: | :----------: | :--------: | :---: | :-----: | :---: | :-----------------: |
-| InternImage-T | ImageNet-1K | 224x224 | 83.5 | 30M | 5G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth) \| [cfg](classification/configs/internimage_t_1k_224.yaml) |
-| InternImage-S | ImageNet-1K | 224x224 | 84.2 | 50M | 8G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth) \| [cfg](classification/configs/internimage_s_1k_224.yaml) |
-| InternImage-B | ImageNet-1K | 224x224 | 84.9 | 97M | 16G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth) \| [cfg](classification/configs/internimage_b_1k_224.yaml) |
-| InternImage-L | ImageNet-22K | 384x384 | 87.7 | 223M | 108G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth) \| [cfg](classification/configs/internimage_l_22kto1k_384.yaml) |
-| InternImage-XL | ImageNet-22K | 384x384 | 88.0 | 335M | 163G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) \| [cfg](classification/configs/internimage_xl_22kto1k_384.yaml) |
-| InternImage-H | Joint 427M | 640x640 | 89.6 | 1.08B | 1478G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth) \| [cfg](classification/configs/internimage_h_22kto1k_640.yaml) |
-| InternImage-G | - | 512x512 | 90.1 | 3B | 2700G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth) \| [cfg](classification/configs/internimage_g_22kto1k_512.yaml) |
-
-
-## COCO Object Detection and Instance Segmentation
-
-| backbone | method | schd | box mAP | mask mAP | #param | FLOPs | download |
-| :------------: | :----------------: | :---------: | :-----: | :------: | :-----: | :---: | :---: |
-| InternImage-T | Mask R-CNN | 1x | 47.2 | 42.5 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_1x_coco.py) |
-| InternImage-T | Mask R-CNN | 3x | 49.1 | 43.7 | 49M | 270G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_t_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_t_fpn_3x_coco.py) |
-| InternImage-S | Mask R-CNN | 1x | 47.8 | 43.3 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_1x_coco.py) |
-| InternImage-S | Mask R-CNN | 3x | 49.7 | 44.5 | 69M | 340G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_s_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_s_fpn_3x_coco.py) |
-| InternImage-B | Mask R-CNN | 1x | 48.8 | 44.0 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_1x_coco.py) |
-| InternImage-B | Mask R-CNN | 3x | 50.3 | 44.8 | 115M | 501G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/mask_rcnn_internimage_b_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/mask_rcnn_internimage_b_fpn_3x_coco.py) |
-| InternImage-L | Cascade | 1x | 54.9 | 47.7 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_1x_coco.py) |
-| InternImage-L | Cascade | 3x | 56.1 | 48.5 | 277M | 1399G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_l_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_l_fpn_3x_coco.py) |
-| InternImage-XL | Cascade | 1x | 55.3 | 48.1 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_1x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_1x_coco.py) |
-| InternImage-XL | Cascade | 3x | 56.2 | 48.8 | 387M | 1782G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/cascade_internimage_xl_fpn_3x_coco.pth) \| [cfg](detection/configs/coco/cascade_internimage_xl_fpn_3x_coco.py) |
-
-| backbone | method | box mAP (val/test) | #param | FLOPs | download |
-| :------------: | :----------------: | :---------: | :------: | :-----: | :---: |
-| InternImage-H | DINO (TTA) | 65.0 / 65.4 | 2.18B | TODO | TODO |
-| InternImage-G | DINO (TTA) | 65.3 / 65.5 | 3B | TODO | TODO |
-
-## ADE20K Semantic Segmentation
-
-
-| backbone | method | resolution | mIoU (ss/ms) | #param | FLOPs | download |
-| :------------: | :--------: | :--------: | :----------: | :-----: | :---: | :---: |
-| InternImage-T | UperNet | 512x512 | 47.9 / 48.1 | 59M | 944G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_t_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_t_512_160k_ade20k.py) |
-| InternImage-S | UperNet | 512x512 | 50.1 / 50.9 | 80M | 1017G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_s_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_s_512_160k_ade20k.py) |
-| InternImage-B | UperNet | 512x512 | 50.8 / 51.3 | 128M | 1185G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_b_512_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_b_512_160k_ade20k.py) |
-| InternImage-L | UperNet | 640x640 | 53.9 / 54.1 | 256M | 2526G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_l_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_l_640_160k_ade20k.py) |
-| InternImage-XL | UperNet | 640x640 | 55.0 / 55.3 | 368M | 3142G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_xl_640_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_xl_640_160k_ade20k.py) |
-| InternImage-H | UperNet | 896x896 | 59.9 / 60.3 | 1.12B | 3566G | [ckpt](https://huggingface.co/OpenGVLab/InternImage/resolve/main/upernet_internimage_h_896_160k_ade20k.pth) \| [cfg](segmentation/configs/ade20k/upernet_internimage_h_896_160k_ade20k.py) |
-| InternImage-H | Mask2Former | 896x896 | 62.5 / 62.9 | 1.31B | 4635G | TODO |
-
-
-## Main Results of FPS
-
-[[TensorRT]](classification/export.py)
-
-| name | resolution | #param | FLOPs | batch 1 FPS (TensorRT) |
-| :------------: | :--------: | :-----: | :---: | :-------------------: |
-| InternImage-T | 224x224 | 30M | 5G | 156 |
-| InternImage-S | 224x224 | 50M | 8G | 129 |
-| InternImage-B | 224x224 | 97M | 16G | 116 |
-| InternImage-L | 384x384 | 223M | 108G | 56 |
-| InternImage-XL | 384x384 | 335M | 163G | 47 |
-
-
-## Citation
-
-If this work is helpful for your research, please consider citing the following BibTeX entry.
-
-```
-@article{wang2022internimage,
- title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
- author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
- journal={arXiv preprint arXiv:2211.05778},
- year={2022}
-}
-
-@inproceedings{zhu2022uni,
- title={Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks},
- author={Zhu, Xizhou and Zhu, Jinguo and Li, Hao and Wu, Xiaoshi and Li, Hongsheng and Wang, Xiaohua and Dai, Jifeng},
- booktitle={CVPR},
- pages={16804--16815},
- year={2022}
-}
-
-@article{zhu2022uni,
- title={Uni-perceiver-moe: Learning sparse generalist models with conditional moes},
- author={Zhu, Jinguo and Zhu, Xizhou and Wang, Wenhai and Wang, Xiaohua and Li, Hongsheng and Wang, Xiaogang and Dai, Jifeng},
- journal={arXiv preprint arXiv:2206.04674},
- year={2022}
-}
-
-@article{li2022uni,
- title={Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks},
- author={Li, Hao and Zhu, Jinguo and Jiang, Xiaohu and Zhu, Xizhou and Li, Hongsheng and Yuan, Chun and Wang, Xiaohua and Qiao, Yu and Wang, Xiaogang and Wang, Wenhai and others},
- journal={arXiv preprint arXiv:2211.09808},
- year={2022}
-}
-
-@article{yang2022bevformer,
- title={BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision},
- author={Yang, Chenyu and Chen, Yuntao and Tian, Hao and Tao, Chenxin and Zhu, Xizhou and Zhang, Zhaoxiang and Huang, Gao and Li, Hongyang and Qiao, Yu and Lu, Lewei and others},
- journal={arXiv preprint arXiv:2211.10439},
- year={2022}
-}
-
-@article{su2022towards,
- title={Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information},
- author={Su, Weijie and Zhu, Xizhou and Tao, Chenxin and Lu, Lewei and Li, Bin and Huang, Gao and Qiao, Yu and Wang, Xiaogang and Zhou, Jie and Dai, Jifeng},
- journal={arXiv preprint arXiv:2211.09807},
- year={2022}
-}
-
-@inproceedings{li2022bevformer,
- title={Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers},
- author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng},
- booktitle={ECCV},
- pages={1--18},
- year={2022},
-}
-```
-
-
-

-
diff --git a/autonomous_driving/README.md b/autonomous_driving/README.md
new file mode 100644
index 00000000..2988a727
--- /dev/null
+++ b/autonomous_driving/README.md
@@ -0,0 +1,63 @@
+# End-to-end Autonomous Driving Challenge
+
+InternImage is used as the backbone for the baseline model of CVPR 2023 [End-to-end Autonomous Driving Challenge](https://opendrivelab.com/AD23Challenge.html)
+
+There are 3 tracks that will use InternImage as the backbone of the baseline model:
+
+1. [OpenLane-V2](https://github.com/OpenDriveLab/OpenLane-V2)
+
+ The primary task of the dataset is scene structure perception and reasoning, which requires the model to recognize the dynamic drivable states of lanes in the surrounding environment. The challenge of this dataset includes not only detecting lane centerlines and traffic elements but also recognizing the attribute of traffic elements and topology relationships on detected objects.
+
+2. [Online HD Map Construction](https://github.com/Tsinghua-MARS-Lab/Online-HD-Map-Construction-CVPR2023)
+
+ Constructing HD maps is a central component of autonomous driving. However, traditional mapping pipelines require a vast amount of human efforts in annotating and maintaining the map, which limits its scalability. Online HD map construction task aims to dynamically construct the local semantic map based on onboard sensor observations. Compared to lane detection, our constructed HD map provides more semantics information of multiple categories. Vectorized polyline representation are adopted to deal with complicated and even irregular road structures.
+
+3. [3D Occupancy Prediction](https://github.com/CVPR2023-3D-Occupancy-Prediction/CVPR2023-3D-Occupancy-Prediction)
+
+ Understanding the 3D surroundings including the background stuffs and foreground objects is important for autonomous driving. In the traditional 3D object detection task, a foreground object is represented by the 3D bounding box. However, the geometrical shape of the object is complex, which can not be represented by a simple 3D box, and the perception of the background is absent. The goal of this task is to predict the 3D occupancy of the scene. In this task, we provide a large-scale occupancy benchmark based on the nuScenes dataset. The benchmark is a voxelized representation of the 3D space, and the occupancy state and semantics of the voxel in 3D space are jointly estimated in this task. The complexity of this task lies in the dense prediction of 3D space given the surround-view image.
+
+This folder contains the implementation of the autonomous driving challenge using InternImage as a powerful backbone.
+
+## Usage
+
+### Install
+
+```
+TODO
+```
+
+### Data Preparation
+
+```
+TODO
+```
+
+### Evaluation
+
+```
+TODO
+```
+
+### Training from Scratch on XXXX-Dataset
+
+```
+TODO
+```
+
+### Manage Jobs with Slurm.
+
+```
+TODO
+```
+
+### Test pretrained model on ImageNet-22K
+
+```
+TODO
+```
+
+### Export
+
+```
+TODO
+```
diff --git a/classification/README.md b/classification/README.md
index a42b3576..10c57e2e 100644
--- a/classification/README.md
+++ b/classification/README.md
@@ -50,7 +50,8 @@ sh ./make.sh
# unit test (should see all checking is True)
python test.py
```
-
+- You can also install the operator using .whl files
+[DCNv3-1.0-whl](https://github.com/OpenGVLab/InternImage/releases/tag/whl_files)
### Data Preparation
We use standard ImageNet dataset, you can download it from http://image-net.org/. We provide the following two ways to
diff --git a/detection/README.md b/detection/README.md
index c11bddc9..bb64498d 100644
--- a/detection/README.md
+++ b/detection/README.md
@@ -53,6 +53,9 @@ sh ./make.sh
# unit test (should see all checking is True)
python test.py
```
+- You can also install the operator using .whl files
+
+[DCNv3-1.0-whl](https://github.com/OpenGVLab/InternImage/releases/tag/whl_files)
### Data Preparation
@@ -100,3 +103,35 @@ For example, to train `InternImage-L` with 32 GPU on 4 node, run:
```bash
GPUS=32 sh slurm_train.sh configs/coco/cascade_internimage_xl_fpn_3x_coco.py work_dirs/cascade_internimage_xl_fpn_3x_coco
```
+
+### Export
+
+To export a detection model from PyTorch to TensorRT, run:
+```shell
+MODEL="model_name"
+CKPT_PATH="/path/to/model/ckpt.pth"
+
+python deploy.py \
+ "./deploy/configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py" \
+ "./configs/coco/${MODEL}.py" \
+ "${CKPT_PATH}" \
+ "./deploy/demo.jpg" \
+ --work-dir "./work_dirs/mmdet/instance-seg/${MODEL}" \
+ --device cuda \
+ --dump-info
+```
+
+For example, to export `mask_rcnn_internimage_t_fpn_1x_coco` from PyTorch to TensorRT, run:
+```shell
+MODEL="mask_rcnn_internimage_t_fpn_1x_coco"
+CKPT_PATH="/path/to/model/ckpt/mask_rcnn_internimage_t_fpn_1x_coco.pth"
+
+python deploy.py \
+ "./deploy/configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py" \
+ "./configs/coco/${MODEL}.py" \
+ "${CKPT_PATH}" \
+ "./deploy/demo.jpg" \
+ --work-dir "./work_dirs/mmdet/instance-seg/${MODEL}" \
+ --device cuda \
+ --dump-info
+```
diff --git a/detection/deploy.py b/detection/deploy.py
new file mode 100644
index 00000000..6c527c4e
--- /dev/null
+++ b/detection/deploy.py
@@ -0,0 +1,310 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import logging
+import os
+import os.path as osp
+from functools import partial
+
+import mmcv
+import torch.multiprocessing as mp
+from torch.multiprocessing import Process, set_start_method
+
+from mmdeploy.apis import (create_calib_input_data, extract_model,
+ get_predefined_partition_cfg, torch2onnx,
+ torch2torchscript, visualize_model)
+from mmdeploy.apis.core import PIPELINE_MANAGER
+from mmdeploy.apis.utils import to_backend
+from mmdeploy.backend.sdk.export_info import export2SDK
+from mmdeploy.utils import (IR, Backend, get_backend, get_calib_filename,
+ get_ir_config, get_partition_config,
+ get_root_logger, load_config, target_wrapper)
+
+import mmcv_custom
+import mmdet_custom
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='Export model to backends.')
+ parser.add_argument('deploy_cfg', help='deploy config path')
+ parser.add_argument('model_cfg', help='model config path')
+ parser.add_argument('checkpoint', help='model checkpoint path')
+ parser.add_argument('img', help='image used to convert model model')
+ parser.add_argument(
+ '--test-img',
+ default=None,
+ type=str,
+ nargs='+',
+ help='image used to test model')
+ parser.add_argument(
+ '--work-dir',
+ default=os.getcwd(),
+ help='the dir to save logs and models')
+ parser.add_argument(
+ '--calib-dataset-cfg',
+ help=('dataset config path used to calibrate in int8 mode. If not '
+ 'specified, it will use "val" dataset in model config instead.'),
+ default=None)
+ parser.add_argument(
+ '--device', help='device used for conversion', default='cpu')
+ parser.add_argument(
+ '--log-level',
+ help='set log level',
+ default='INFO',
+ choices=list(logging._nameToLevel.keys()))
+ parser.add_argument(
+ '--show', action='store_true', help='Show detection outputs')
+ parser.add_argument(
+ '--dump-info', action='store_true', help='Output information for SDK')
+ parser.add_argument(
+ '--quant-image-dir',
+ default=None,
+ help='Image directory for quantize model.')
+ parser.add_argument(
+ '--quant', action='store_true', help='Quantize model to low bit.')
+ parser.add_argument(
+ '--uri',
+ default='192.168.1.1:60000',
+ help='Remote ipv4:port or ipv6:port for inference on edge device.')
+ args = parser.parse_args()
+ return args
+
+
+def create_process(name, target, args, kwargs, ret_value=None):
+ logger = get_root_logger()
+ logger.info(f'{name} start.')
+ log_level = logger.level
+
+ wrap_func = partial(target_wrapper, target, log_level, ret_value)
+
+ process = Process(target=wrap_func, args=args, kwargs=kwargs)
+ process.start()
+ process.join()
+
+ if ret_value is not None:
+ if ret_value.value != 0:
+ logger.error(f'{name} failed.')
+ exit(1)
+ else:
+ logger.info(f'{name} success.')
+
+
+def torch2ir(ir_type: IR):
+ """Return the conversion function from torch to the intermediate
+ representation.
+
+ Args:
+ ir_type (IR): The type of the intermediate representation.
+ """
+ if ir_type == IR.ONNX:
+ return torch2onnx
+ elif ir_type == IR.TORCHSCRIPT:
+ return torch2torchscript
+ else:
+ raise KeyError(f'Unexpected IR type {ir_type}')
+
+
+def main():
+ args = parse_args()
+ set_start_method('spawn', force=True)
+ logger = get_root_logger()
+ log_level = logging.getLevelName(args.log_level)
+ logger.setLevel(log_level)
+
+ pipeline_funcs = [
+ torch2onnx, torch2torchscript, extract_model, create_calib_input_data
+ ]
+ PIPELINE_MANAGER.enable_multiprocess(True, pipeline_funcs)
+ PIPELINE_MANAGER.set_log_level(log_level, pipeline_funcs)
+
+ deploy_cfg_path = args.deploy_cfg
+ model_cfg_path = args.model_cfg
+ checkpoint_path = args.checkpoint
+ quant = args.quant
+ quant_image_dir = args.quant_image_dir
+
+ # load deploy_cfg
+ deploy_cfg, model_cfg = load_config(deploy_cfg_path, model_cfg_path)
+
+ # create work_dir if not
+ mmcv.mkdir_or_exist(osp.abspath(args.work_dir))
+
+ if args.dump_info:
+ export2SDK(
+ deploy_cfg,
+ model_cfg,
+ args.work_dir,
+ pth=checkpoint_path,
+ device=args.device)
+
+ ret_value = mp.Value('d', 0, lock=False)
+
+ # convert to IR
+ ir_config = get_ir_config(deploy_cfg)
+ ir_save_file = ir_config['save_file']
+ ir_type = IR.get(ir_config['type'])
+ torch2ir(ir_type)(
+ args.img,
+ args.work_dir,
+ ir_save_file,
+ deploy_cfg_path,
+ model_cfg_path,
+ checkpoint_path,
+ device=args.device)
+
+ # convert backend
+ ir_files = [osp.join(args.work_dir, ir_save_file)]
+
+ # partition model
+ partition_cfgs = get_partition_config(deploy_cfg)
+
+ if partition_cfgs is not None:
+
+ if 'partition_cfg' in partition_cfgs:
+ partition_cfgs = partition_cfgs.get('partition_cfg', None)
+ else:
+ assert 'type' in partition_cfgs
+ partition_cfgs = get_predefined_partition_cfg(
+ deploy_cfg, partition_cfgs['type'])
+
+ origin_ir_file = ir_files[0]
+ ir_files = []
+ for partition_cfg in partition_cfgs:
+ save_file = partition_cfg['save_file']
+ save_path = osp.join(args.work_dir, save_file)
+ start = partition_cfg['start']
+ end = partition_cfg['end']
+ dynamic_axes = partition_cfg.get('dynamic_axes', None)
+
+ extract_model(
+ origin_ir_file,
+ start,
+ end,
+ dynamic_axes=dynamic_axes,
+ save_file=save_path)
+
+ ir_files.append(save_path)
+
+ # calib data
+ calib_filename = get_calib_filename(deploy_cfg)
+ if calib_filename is not None:
+ calib_path = osp.join(args.work_dir, calib_filename)
+ create_calib_input_data(
+ calib_path,
+ deploy_cfg_path,
+ model_cfg_path,
+ checkpoint_path,
+ dataset_cfg=args.calib_dataset_cfg,
+ dataset_type='val',
+ device=args.device)
+
+ backend_files = ir_files
+ # convert backend
+ backend = get_backend(deploy_cfg)
+
+ # preprocess deploy_cfg
+ if backend == Backend.RKNN:
+ # TODO: Add this to task_processor in the future
+ import tempfile
+
+ from mmdeploy.utils import (get_common_config, get_normalization,
+ get_quantization_config,
+ get_rknn_quantization)
+ quantization_cfg = get_quantization_config(deploy_cfg)
+ common_params = get_common_config(deploy_cfg)
+ if get_rknn_quantization(deploy_cfg) is True:
+ transform = get_normalization(model_cfg)
+ common_params.update(
+ dict(
+ mean_values=[transform['mean']],
+ std_values=[transform['std']]))
+
+ dataset_file = tempfile.NamedTemporaryFile(suffix='.txt').name
+ with open(dataset_file, 'w') as f:
+ f.writelines([osp.abspath(args.img)])
+ quantization_cfg.setdefault('dataset', dataset_file)
+ if backend == Backend.ASCEND:
+ # TODO: Add this to backend manager in the future
+ if args.dump_info:
+ from mmdeploy.backend.ascend import update_sdk_pipeline
+ update_sdk_pipeline(args.work_dir)
+
+ # convert to backend
+ PIPELINE_MANAGER.set_log_level(log_level, [to_backend])
+ if backend == Backend.TENSORRT:
+ PIPELINE_MANAGER.enable_multiprocess(True, [to_backend])
+ backend_files = to_backend(
+ backend,
+ ir_files,
+ work_dir=args.work_dir,
+ deploy_cfg=deploy_cfg,
+ log_level=log_level,
+ device=args.device,
+ uri=args.uri)
+
+ # ncnn quantization
+ if backend == Backend.NCNN and quant:
+ from onnx2ncnn_quant_table import get_table
+
+ from mmdeploy.apis.ncnn import get_quant_model_file, ncnn2int8
+ model_param_paths = backend_files[::2]
+ model_bin_paths = backend_files[1::2]
+ backend_files = []
+ for onnx_path, model_param_path, model_bin_path in zip(
+ ir_files, model_param_paths, model_bin_paths):
+
+ deploy_cfg, model_cfg = load_config(deploy_cfg_path,
+ model_cfg_path)
+ quant_onnx, quant_table, quant_param, quant_bin = get_quant_model_file( # noqa: E501
+ onnx_path, args.work_dir)
+
+ create_process(
+ 'ncnn quant table',
+ target=get_table,
+ args=(onnx_path, deploy_cfg, model_cfg, quant_onnx,
+ quant_table, quant_image_dir, args.device),
+ kwargs=dict(),
+ ret_value=ret_value)
+
+ create_process(
+ 'ncnn_int8',
+ target=ncnn2int8,
+ args=(model_param_path, model_bin_path, quant_table,
+ quant_param, quant_bin),
+ kwargs=dict(),
+ ret_value=ret_value)
+ backend_files += [quant_param, quant_bin]
+
+ if args.test_img is None:
+ args.test_img = args.img
+
+ extra = dict(
+ backend=backend,
+ output_file=osp.join(args.work_dir, f'output_{backend.value}.jpg'),
+ show_result=args.show)
+ if backend == Backend.SNPE:
+ extra['uri'] = args.uri
+
+ # get backend inference result, try render
+ create_process(
+ f'visualize {backend.value} model',
+ target=visualize_model,
+ args=(model_cfg_path, deploy_cfg_path, backend_files, args.test_img,
+ args.device),
+ kwargs=extra,
+ ret_value=ret_value)
+
+ # get pytorch model inference result, try visualize if possible
+ create_process(
+ 'visualize pytorch model',
+ target=visualize_model,
+ args=(model_cfg_path, deploy_cfg_path, [checkpoint_path],
+ args.test_img, args.device),
+ kwargs=dict(
+ backend=Backend.PYTORCH,
+ output_file=osp.join(args.work_dir, 'output_pytorch.jpg'),
+ show_result=args.show),
+ ret_value=ret_value)
+ logger.info('All process success.')
+
+
+if __name__ == '__main__':
+ main()
diff --git a/detection/deploy/configs/_base_/backends/tensorrt-fp16.py b/detection/deploy/configs/_base_/backends/tensorrt-fp16.py
new file mode 100644
index 00000000..347cc2a7
--- /dev/null
+++ b/detection/deploy/configs/_base_/backends/tensorrt-fp16.py
@@ -0,0 +1,2 @@
+backend_config = dict(
+ type='tensorrt', common_config=dict(fp16_mode=True, max_workspace_size=0))
diff --git a/detection/deploy/configs/_base_/backends/tensorrt.py b/detection/deploy/configs/_base_/backends/tensorrt.py
new file mode 100644
index 00000000..a7f4c569
--- /dev/null
+++ b/detection/deploy/configs/_base_/backends/tensorrt.py
@@ -0,0 +1,2 @@
+backend_config = dict(
+ type='tensorrt', common_config=dict(fp16_mode=False, max_workspace_size=0))
diff --git a/detection/deploy/configs/_base_/onnx_config.py b/detection/deploy/configs/_base_/onnx_config.py
new file mode 100644
index 00000000..43621b12
--- /dev/null
+++ b/detection/deploy/configs/_base_/onnx_config.py
@@ -0,0 +1,10 @@
+onnx_config = dict(
+ type='onnx',
+ export_params=True,
+ keep_initializers_as_inputs=False,
+ opset_version=11,
+ save_file='end2end.onnx',
+ input_names=['input'],
+ output_names=['output'],
+ input_shape=None,
+ optimize=True)
diff --git a/detection/deploy/configs/mmdet/_base_/base_dynamic.py b/detection/deploy/configs/mmdet/_base_/base_dynamic.py
new file mode 100644
index 00000000..497db262
--- /dev/null
+++ b/detection/deploy/configs/mmdet/_base_/base_dynamic.py
@@ -0,0 +1,17 @@
+_base_ = ['./base_static.py']
+onnx_config = dict(
+ dynamic_axes={
+ 'input': {
+ 0: 'batch',
+ 2: 'height',
+ 3: 'width'
+ },
+ 'dets': {
+ 0: 'batch',
+ 1: 'num_dets',
+ },
+ 'labels': {
+ 0: 'batch',
+ 1: 'num_dets',
+ },
+ }, )
diff --git a/detection/deploy/configs/mmdet/_base_/base_instance-seg_dynamic.py b/detection/deploy/configs/mmdet/_base_/base_instance-seg_dynamic.py
new file mode 100644
index 00000000..69a30f7f
--- /dev/null
+++ b/detection/deploy/configs/mmdet/_base_/base_instance-seg_dynamic.py
@@ -0,0 +1,23 @@
+_base_ = ['./base_instance-seg_static.py']
+onnx_config = dict(
+ dynamic_axes={
+ 'input': {
+ 0: 'batch',
+ 2: 'height',
+ 3: 'width'
+ },
+ 'dets': {
+ 0: 'batch',
+ 1: 'num_dets',
+ },
+ 'labels': {
+ 0: 'batch',
+ 1: 'num_dets',
+ },
+ 'masks': {
+ 0: 'batch',
+ 1: 'num_dets',
+ 2: 'height',
+ 3: 'width'
+ },
+ })
diff --git a/detection/deploy/configs/mmdet/_base_/base_instance-seg_static.py b/detection/deploy/configs/mmdet/_base_/base_instance-seg_static.py
new file mode 100644
index 00000000..db33f0e5
--- /dev/null
+++ b/detection/deploy/configs/mmdet/_base_/base_instance-seg_static.py
@@ -0,0 +1,4 @@
+_base_ = ['./base_static.py']
+
+onnx_config = dict(output_names=['dets', 'labels', 'masks'])
+codebase_config = dict(post_processing=dict(export_postprocess_mask=False))
diff --git a/detection/deploy/configs/mmdet/_base_/base_static.py b/detection/deploy/configs/mmdet/_base_/base_static.py
new file mode 100644
index 00000000..9fe0d343
--- /dev/null
+++ b/detection/deploy/configs/mmdet/_base_/base_static.py
@@ -0,0 +1,16 @@
+_base_ = ['../../_base_/onnx_config.py']
+
+onnx_config = dict(output_names=['dets', 'labels'], input_shape=None)
+codebase_config = dict(
+ type='mmdet',
+ task='ObjectDetection',
+ model_type='end2end',
+ post_processing=dict(
+ score_threshold=0.05,
+ confidence_threshold=0.005, # for YOLOv3
+ iou_threshold=0.5,
+ max_output_boxes_per_class=200,
+ pre_top_k=5000,
+ keep_top_k=100,
+ background_label_id=-1,
+ ))
diff --git a/detection/deploy/configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py b/detection/deploy/configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py
new file mode 100644
index 00000000..9d4d0559
--- /dev/null
+++ b/detection/deploy/configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py
@@ -0,0 +1,15 @@
+_base_ = [
+ '../_base_/base_instance-seg_dynamic.py',
+ '../../_base_/backends/tensorrt.py'
+]
+
+backend_config = dict(
+ common_config=dict(max_workspace_size=1 << 30),
+ model_inputs=[
+ dict(
+ input_shapes=dict(
+ input=dict(
+ min_shape=[1, 3, 320, 320],
+ opt_shape=[1, 3, 800, 1344],
+ max_shape=[1, 3, 1344, 1344])))
+ ])
diff --git a/detection/deploy/demo.jpg b/detection/deploy/demo.jpg
new file mode 100644
index 00000000..dd613cee
Binary files /dev/null and b/detection/deploy/demo.jpg differ
diff --git a/detection/ops_dcnv3/functions/dcnv3_func.py b/detection/ops_dcnv3/functions/dcnv3_func.py
index 433bd0ff..4dac8fbd 100644
--- a/detection/ops_dcnv3/functions/dcnv3_func.py
+++ b/detection/ops_dcnv3/functions/dcnv3_func.py
@@ -60,6 +60,33 @@ def backward(ctx, grad_output):
return grad_input, grad_offset, grad_mask, \
None, None, None, None, None, None, None, None, None, None, None, None
+ @staticmethod
+ def symbolic(g, input, offset, mask, kernel_h, kernel_w, stride_h,
+ stride_w, pad_h, pad_w, dilation_h, dilation_w, group,
+ group_channels, offset_scale, im2col_step):
+ """Symbolic function for mmdeploy::DCNv3.
+
+ Returns:
+ DCNv3 op for onnx.
+ """
+ return g.op(
+ 'mmdeploy::TRTDCNv3',
+ input,
+ offset,
+ mask,
+ kernel_h_i=int(kernel_h),
+ kernel_w_i=int(kernel_w),
+ stride_h_i=int(stride_h),
+ stride_w_i=int(stride_w),
+ pad_h_i=int(pad_h),
+ pad_w_i=int(pad_w),
+ dilation_h_i=int(dilation_h),
+ dilation_w_i=int(dilation_w),
+ group_i=int(group),
+ group_channels_i=int(group_channels),
+ offset_scale_f=float(offset_scale),
+ im2col_step_i=int(im2col_step),
+ )
def _get_reference_points(spatial_shapes, device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h=0, pad_w=0, stride_h=1, stride_w=1):
_, H_, W_, _ = spatial_shapes
diff --git a/segmentation/README.md b/segmentation/README.md
index d7cdecf2..68ca5e9c 100644
--- a/segmentation/README.md
+++ b/segmentation/README.md
@@ -52,6 +52,8 @@ sh ./make.sh
# unit test (should see all checking is True)
python test.py
```
+- You can also install the operator using .whl files
+[DCNv3-1.0-whl](https://github.com/OpenGVLab/InternImage/releases/tag/whl_files)
### Data Preparation
@@ -109,3 +111,35 @@ CUDA_VISIBLE_DEVICES=0 python image_demo.py \
checkpoint_dir/seg/upernet_internimage_t_512_160k_ade20k.pth \
--palette ade20k
```
+
+### Export
+
+To export a segmentation model from PyTorch to TensorRT, run:
+```shell
+MODEL="model_name"
+CKPT_PATH="/path/to/model/ckpt.pth"
+
+python deploy.py \
+ "./deploy/configs/mmseg/segmentation_tensorrt_static-512x512.py" \
+ "./configs/ade20k/${MODEL}.py" \
+ "${CKPT_PATH}" \
+ "./deploy/demo.png" \
+ --work-dir "./work_dirs/mmseg/${MODEL}" \
+ --device cuda \
+ --dump-info
+```
+
+For example, to export `upernet_internimage_t_512_160k_ade20k` from PyTorch to TensorRT, run:
+```shell
+MODEL="upernet_internimage_t_512_160k_ade20k"
+CKPT_PATH="/path/to/model/ckpt/upernet_internimage_t_512_160k_ade20k.pth"
+
+python deploy.py \
+ "./deploy/configs/mmseg/segmentation_tensorrt_static-512x512.py" \
+ "./configs/ade20k/${MODEL}.py" \
+ "${CKPT_PATH}" \
+ "./deploy/demo.png" \
+ --work-dir "./work_dirs/mmseg/${MODEL}" \
+ --device cuda \
+ --dump-info
+```
diff --git a/segmentation/deploy.py b/segmentation/deploy.py
new file mode 100644
index 00000000..448eb2d1
--- /dev/null
+++ b/segmentation/deploy.py
@@ -0,0 +1,310 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import logging
+import os
+import os.path as osp
+from functools import partial
+
+import mmcv
+import torch.multiprocessing as mp
+from torch.multiprocessing import Process, set_start_method
+
+from mmdeploy.apis import (create_calib_input_data, extract_model,
+ get_predefined_partition_cfg, torch2onnx,
+ torch2torchscript, visualize_model)
+from mmdeploy.apis.core import PIPELINE_MANAGER
+from mmdeploy.apis.utils import to_backend
+from mmdeploy.backend.sdk.export_info import export2SDK
+from mmdeploy.utils import (IR, Backend, get_backend, get_calib_filename,
+ get_ir_config, get_partition_config,
+ get_root_logger, load_config, target_wrapper)
+
+import mmcv_custom
+import mmseg_custom
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='Export model to backends.')
+ parser.add_argument('deploy_cfg', help='deploy config path')
+ parser.add_argument('model_cfg', help='model config path')
+ parser.add_argument('checkpoint', help='model checkpoint path')
+ parser.add_argument('img', help='image used to convert model model')
+ parser.add_argument(
+ '--test-img',
+ default=None,
+ type=str,
+ nargs='+',
+ help='image used to test model')
+ parser.add_argument(
+ '--work-dir',
+ default=os.getcwd(),
+ help='the dir to save logs and models')
+ parser.add_argument(
+ '--calib-dataset-cfg',
+ help=('dataset config path used to calibrate in int8 mode. If not '
+ 'specified, it will use "val" dataset in model config instead.'),
+ default=None)
+ parser.add_argument(
+ '--device', help='device used for conversion', default='cpu')
+ parser.add_argument(
+ '--log-level',
+ help='set log level',
+ default='INFO',
+ choices=list(logging._nameToLevel.keys()))
+ parser.add_argument(
+ '--show', action='store_true', help='Show detection outputs')
+ parser.add_argument(
+ '--dump-info', action='store_true', help='Output information for SDK')
+ parser.add_argument(
+ '--quant-image-dir',
+ default=None,
+ help='Image directory for quantize model.')
+ parser.add_argument(
+ '--quant', action='store_true', help='Quantize model to low bit.')
+ parser.add_argument(
+ '--uri',
+ default='192.168.1.1:60000',
+ help='Remote ipv4:port or ipv6:port for inference on edge device.')
+ args = parser.parse_args()
+ return args
+
+
+def create_process(name, target, args, kwargs, ret_value=None):
+ logger = get_root_logger()
+ logger.info(f'{name} start.')
+ log_level = logger.level
+
+ wrap_func = partial(target_wrapper, target, log_level, ret_value)
+
+ process = Process(target=wrap_func, args=args, kwargs=kwargs)
+ process.start()
+ process.join()
+
+ if ret_value is not None:
+ if ret_value.value != 0:
+ logger.error(f'{name} failed.')
+ exit(1)
+ else:
+ logger.info(f'{name} success.')
+
+
+def torch2ir(ir_type: IR):
+ """Return the conversion function from torch to the intermediate
+ representation.
+
+ Args:
+ ir_type (IR): The type of the intermediate representation.
+ """
+ if ir_type == IR.ONNX:
+ return torch2onnx
+ elif ir_type == IR.TORCHSCRIPT:
+ return torch2torchscript
+ else:
+ raise KeyError(f'Unexpected IR type {ir_type}')
+
+
+def main():
+ args = parse_args()
+ set_start_method('spawn', force=True)
+ logger = get_root_logger()
+ log_level = logging.getLevelName(args.log_level)
+ logger.setLevel(log_level)
+
+ pipeline_funcs = [
+ torch2onnx, torch2torchscript, extract_model, create_calib_input_data
+ ]
+ PIPELINE_MANAGER.enable_multiprocess(True, pipeline_funcs)
+ PIPELINE_MANAGER.set_log_level(log_level, pipeline_funcs)
+
+ deploy_cfg_path = args.deploy_cfg
+ model_cfg_path = args.model_cfg
+ checkpoint_path = args.checkpoint
+ quant = args.quant
+ quant_image_dir = args.quant_image_dir
+
+ # load deploy_cfg
+ deploy_cfg, model_cfg = load_config(deploy_cfg_path, model_cfg_path)
+
+ # create work_dir if not
+ mmcv.mkdir_or_exist(osp.abspath(args.work_dir))
+
+ if args.dump_info:
+ export2SDK(
+ deploy_cfg,
+ model_cfg,
+ args.work_dir,
+ pth=checkpoint_path,
+ device=args.device)
+
+ ret_value = mp.Value('d', 0, lock=False)
+
+ # convert to IR
+ ir_config = get_ir_config(deploy_cfg)
+ ir_save_file = ir_config['save_file']
+ ir_type = IR.get(ir_config['type'])
+ torch2ir(ir_type)(
+ args.img,
+ args.work_dir,
+ ir_save_file,
+ deploy_cfg_path,
+ model_cfg_path,
+ checkpoint_path,
+ device=args.device)
+
+ # convert backend
+ ir_files = [osp.join(args.work_dir, ir_save_file)]
+
+ # partition model
+ partition_cfgs = get_partition_config(deploy_cfg)
+
+ if partition_cfgs is not None:
+
+ if 'partition_cfg' in partition_cfgs:
+ partition_cfgs = partition_cfgs.get('partition_cfg', None)
+ else:
+ assert 'type' in partition_cfgs
+ partition_cfgs = get_predefined_partition_cfg(
+ deploy_cfg, partition_cfgs['type'])
+
+ origin_ir_file = ir_files[0]
+ ir_files = []
+ for partition_cfg in partition_cfgs:
+ save_file = partition_cfg['save_file']
+ save_path = osp.join(args.work_dir, save_file)
+ start = partition_cfg['start']
+ end = partition_cfg['end']
+ dynamic_axes = partition_cfg.get('dynamic_axes', None)
+
+ extract_model(
+ origin_ir_file,
+ start,
+ end,
+ dynamic_axes=dynamic_axes,
+ save_file=save_path)
+
+ ir_files.append(save_path)
+
+ # calib data
+ calib_filename = get_calib_filename(deploy_cfg)
+ if calib_filename is not None:
+ calib_path = osp.join(args.work_dir, calib_filename)
+ create_calib_input_data(
+ calib_path,
+ deploy_cfg_path,
+ model_cfg_path,
+ checkpoint_path,
+ dataset_cfg=args.calib_dataset_cfg,
+ dataset_type='val',
+ device=args.device)
+
+ backend_files = ir_files
+ # convert backend
+ backend = get_backend(deploy_cfg)
+
+ # preprocess deploy_cfg
+ if backend == Backend.RKNN:
+ # TODO: Add this to task_processor in the future
+ import tempfile
+
+ from mmdeploy.utils import (get_common_config, get_normalization,
+ get_quantization_config,
+ get_rknn_quantization)
+ quantization_cfg = get_quantization_config(deploy_cfg)
+ common_params = get_common_config(deploy_cfg)
+ if get_rknn_quantization(deploy_cfg) is True:
+ transform = get_normalization(model_cfg)
+ common_params.update(
+ dict(
+ mean_values=[transform['mean']],
+ std_values=[transform['std']]))
+
+ dataset_file = tempfile.NamedTemporaryFile(suffix='.txt').name
+ with open(dataset_file, 'w') as f:
+ f.writelines([osp.abspath(args.img)])
+ quantization_cfg.setdefault('dataset', dataset_file)
+ if backend == Backend.ASCEND:
+ # TODO: Add this to backend manager in the future
+ if args.dump_info:
+ from mmdeploy.backend.ascend import update_sdk_pipeline
+ update_sdk_pipeline(args.work_dir)
+
+ # convert to backend
+ PIPELINE_MANAGER.set_log_level(log_level, [to_backend])
+ if backend == Backend.TENSORRT:
+ PIPELINE_MANAGER.enable_multiprocess(True, [to_backend])
+ backend_files = to_backend(
+ backend,
+ ir_files,
+ work_dir=args.work_dir,
+ deploy_cfg=deploy_cfg,
+ log_level=log_level,
+ device=args.device,
+ uri=args.uri)
+
+ # ncnn quantization
+ if backend == Backend.NCNN and quant:
+ from onnx2ncnn_quant_table import get_table
+
+ from mmdeploy.apis.ncnn import get_quant_model_file, ncnn2int8
+ model_param_paths = backend_files[::2]
+ model_bin_paths = backend_files[1::2]
+ backend_files = []
+ for onnx_path, model_param_path, model_bin_path in zip(
+ ir_files, model_param_paths, model_bin_paths):
+
+ deploy_cfg, model_cfg = load_config(deploy_cfg_path,
+ model_cfg_path)
+ quant_onnx, quant_table, quant_param, quant_bin = get_quant_model_file( # noqa: E501
+ onnx_path, args.work_dir)
+
+ create_process(
+ 'ncnn quant table',
+ target=get_table,
+ args=(onnx_path, deploy_cfg, model_cfg, quant_onnx,
+ quant_table, quant_image_dir, args.device),
+ kwargs=dict(),
+ ret_value=ret_value)
+
+ create_process(
+ 'ncnn_int8',
+ target=ncnn2int8,
+ args=(model_param_path, model_bin_path, quant_table,
+ quant_param, quant_bin),
+ kwargs=dict(),
+ ret_value=ret_value)
+ backend_files += [quant_param, quant_bin]
+
+ if args.test_img is None:
+ args.test_img = args.img
+
+ extra = dict(
+ backend=backend,
+ output_file=osp.join(args.work_dir, f'output_{backend.value}.jpg'),
+ show_result=args.show)
+ if backend == Backend.SNPE:
+ extra['uri'] = args.uri
+
+ # get backend inference result, try render
+ create_process(
+ f'visualize {backend.value} model',
+ target=visualize_model,
+ args=(model_cfg_path, deploy_cfg_path, backend_files, args.test_img,
+ args.device),
+ kwargs=extra,
+ ret_value=ret_value)
+
+ # get pytorch model inference result, try visualize if possible
+ create_process(
+ 'visualize pytorch model',
+ target=visualize_model,
+ args=(model_cfg_path, deploy_cfg_path, [checkpoint_path],
+ args.test_img, args.device),
+ kwargs=dict(
+ backend=Backend.PYTORCH,
+ output_file=osp.join(args.work_dir, 'output_pytorch.jpg'),
+ show_result=args.show),
+ ret_value=ret_value)
+ logger.info('All process success.')
+
+
+if __name__ == '__main__':
+ main()
diff --git a/segmentation/deploy/configs/_base_/backends/tensorrt.py b/segmentation/deploy/configs/_base_/backends/tensorrt.py
new file mode 100644
index 00000000..a7f4c569
--- /dev/null
+++ b/segmentation/deploy/configs/_base_/backends/tensorrt.py
@@ -0,0 +1,2 @@
+backend_config = dict(
+ type='tensorrt', common_config=dict(fp16_mode=False, max_workspace_size=0))
diff --git a/segmentation/deploy/configs/_base_/onnx_config.py b/segmentation/deploy/configs/_base_/onnx_config.py
new file mode 100644
index 00000000..43621b12
--- /dev/null
+++ b/segmentation/deploy/configs/_base_/onnx_config.py
@@ -0,0 +1,10 @@
+onnx_config = dict(
+ type='onnx',
+ export_params=True,
+ keep_initializers_as_inputs=False,
+ opset_version=11,
+ save_file='end2end.onnx',
+ input_names=['input'],
+ output_names=['output'],
+ input_shape=None,
+ optimize=True)
diff --git a/segmentation/deploy/configs/mmseg/segmentation_static.py b/segmentation/deploy/configs/mmseg/segmentation_static.py
new file mode 100644
index 00000000..416b781a
--- /dev/null
+++ b/segmentation/deploy/configs/mmseg/segmentation_static.py
@@ -0,0 +1,2 @@
+_base_ = ['../_base_/onnx_config.py']
+codebase_config = dict(type='mmseg', task='Segmentation', with_argmax=True)
diff --git a/segmentation/deploy/configs/mmseg/segmentation_tensorrt_static-512x512.py b/segmentation/deploy/configs/mmseg/segmentation_tensorrt_static-512x512.py
new file mode 100644
index 00000000..1fa5ef66
--- /dev/null
+++ b/segmentation/deploy/configs/mmseg/segmentation_tensorrt_static-512x512.py
@@ -0,0 +1,13 @@
+_base_ = ['./segmentation_static.py', '../_base_/backends/tensorrt.py']
+
+onnx_config = dict(input_shape=[512, 512])
+backend_config = dict(
+ common_config=dict(max_workspace_size=1 << 30),
+ model_inputs=[
+ dict(
+ input_shapes=dict(
+ input=dict(
+ min_shape=[1, 3, 512, 512],
+ opt_shape=[1, 3, 512, 512],
+ max_shape=[1, 3, 512, 512])))
+ ])
diff --git a/segmentation/deploy/demo.png b/segmentation/deploy/demo.png
new file mode 100644
index 00000000..1e82d7a0
Binary files /dev/null and b/segmentation/deploy/demo.png differ
diff --git a/segmentation/ops_dcnv3/functions/dcnv3_func.py b/segmentation/ops_dcnv3/functions/dcnv3_func.py
index 433bd0ff..4dac8fbd 100644
--- a/segmentation/ops_dcnv3/functions/dcnv3_func.py
+++ b/segmentation/ops_dcnv3/functions/dcnv3_func.py
@@ -60,6 +60,33 @@ def backward(ctx, grad_output):
return grad_input, grad_offset, grad_mask, \
None, None, None, None, None, None, None, None, None, None, None, None
+ @staticmethod
+ def symbolic(g, input, offset, mask, kernel_h, kernel_w, stride_h,
+ stride_w, pad_h, pad_w, dilation_h, dilation_w, group,
+ group_channels, offset_scale, im2col_step):
+ """Symbolic function for mmdeploy::DCNv3.
+
+ Returns:
+ DCNv3 op for onnx.
+ """
+ return g.op(
+ 'mmdeploy::TRTDCNv3',
+ input,
+ offset,
+ mask,
+ kernel_h_i=int(kernel_h),
+ kernel_w_i=int(kernel_w),
+ stride_h_i=int(stride_h),
+ stride_w_i=int(stride_w),
+ pad_h_i=int(pad_h),
+ pad_w_i=int(pad_w),
+ dilation_h_i=int(dilation_h),
+ dilation_w_i=int(dilation_w),
+ group_i=int(group),
+ group_channels_i=int(group_channels),
+ offset_scale_f=float(offset_scale),
+ im2col_step_i=int(im2col_step),
+ )
def _get_reference_points(spatial_shapes, device, kernel_h, kernel_w, dilation_h, dilation_w, pad_h=0, pad_w=0, stride_h=1, stride_w=1):
_, H_, W_, _ = spatial_shapes