Lists (20)
Sort Name ascending (A-Z)
⭐ ASR
⭐ Captioning
⭐ Labels
⭐ Latest MLLM
⭐ LLM
⭐ Project & Design
⭐ Region Understanding
⭐ Spatial Annotations
⭐ Spatio-Temporal Grounding
⭐ TAL
⭐ Temporal Annotations
⭐ Tool
⭐ Tracking
⭐ VLLM Experiment
⭐ VLLM Survey
⭐ VLM
⭐ VLoom Spatial Only
⭐ VLoom Spatial & Temporal
⭐ VLoom Temporal Only
⭐ VTG
Stars
Self-hinting RL increases the usage rate of hard prompts, and improves LLM's performance.
[ICCV 2025] Code for "SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning"
From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems
[CVPR 2026] FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.5, DeepSeek-R1, GLM-5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, Phi4, ...)…
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
[CVPR-26] Official repository of "CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization"
Official Implementation of "Learning Accurate Segmentation Purely from Self-Supervision"
[CVPR 2026] FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
ASID-Caption: Attribute-Structured and Quality-Verified Audiovisual Instruction Dataset and Training Pipeline for Fine-Grained Video Understanding.
Efficient Triton Kernels for LLM Training
D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning
Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
ChronusOmni: Improving Time Awareness of Omni Large Language Models
arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv
VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
[AAAI 2026] OwlCap: A motion-detail balanced video captioning MLLM.
The Source Code for OmniVideoBench @ICLR 2026
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [🔥The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]
SGLang is a high-performance serving framework for large language models and multimodal models.
Official repo and evaluation implementation of VSI-Bench
Structured Video Comprehension of Real-World Shorts
Official repo for paper "HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies"