Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.5, DeepSeek-R1, GLM-5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, Phi4, ...)…

Python 13,255 1,284 Updated Mar 20, 2026

yayafengzi / LMM-HiMTok

HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model

Python 91 4 Updated Jul 17, 2025

ShareLab-SII / CaTok

[CVPR-26] Official repository of "CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization"

Python 15 Updated Mar 9, 2026

geshang777 / Selfment

Official Implementation of "Learning Accurate Segmentation Purely from Self-Supervision"

Python 10 Updated Mar 2, 2026

DISL-Lab / FineSurE-ACL24

The official repo of FineSure (ACL-2024)

Python 36 9 Updated Jul 8, 2024

YiwengXie / FluxMem

[CVPR 2026] FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

Python 44 Updated Mar 16, 2026

HVision-NKU / ASID-Caption

ASID-Caption: Attribute-Structured and Quality-Verified Audiovisual Instruction Dataset and Training Pipeline for Fine-Grained Video Understanding.

Python 49 2 Updated Mar 3, 2026

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 6,217 502 Updated Mar 20, 2026

WeChatCV / D-ORCA

D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

Python 7 Updated Feb 11, 2026

yaolinli / TimeChat-Captioner

Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Python 26 Updated Feb 11, 2026

DiaDem-Captioner / DiaDem

https://diadem-captioner.github.io/

Python 4 Updated Jan 31, 2026

snap-research / Panda-70M

[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Python 676 27 Updated Oct 25, 2024

TencentARC / GRPO-CARE

Python 83 2 Updated Jun 23, 2025

YJCX330 / Chronus

ChronusOmni: Improving Time Awareness of Omni Large Language Models

Python 13 Updated Jan 18, 2026

google-research / arxiv-latex-cleaner

arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv

Python 6,750 389 Updated Mar 16, 2026

JPShi12 / VideoLoom

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Python 26 Updated Jan 23, 2026

clzhongg / OwlCap

[AAAI 2026] OwlCap: A motion-detail balanced video captioning MLLM.

Python 2 1 Updated Dec 23, 2025

NJU-LINK / OmniVideoBench

The Source Code for OmniVideoBench @ICLR 2026

Python 66 3 Updated Feb 12, 2026

HarryHsing / EchoInk

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [🔥The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]

Python 76 4 Updated May 18, 2025

sgl-project / sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 24,793 4,916 Updated Mar 20, 2026

vision-x-nyu / thinking-in-space

Official repo and evaluation implementation of VSI-Bench

Python 685 43 Updated Aug 5, 2025

TencentARC / ARC-Hunyuan-Video-7B

Structured Video Comprehension of Real-World Shorts

Python 233 7 Updated Sep 21, 2025

ZhiyingDu / HiMoE-VLA

Official repo for paper "HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies"

Python 27 Updated Dec 12, 2025

AVoCaDO-Captioner / AVoCaDO

https://avocado-captioner.github.io/

Python 31 1 Updated Oct 16, 2025

HUNTER SJP2022

Lists (20)

⭐ ASR

⭐ Captioning

⭐ Labels

⭐ Latest MLLM

⭐ LLM

⭐ Project & Design

⭐ Region Understanding

⭐ Spatial Annotations

⭐ Spatio-Temporal Grounding

⭐ TAL

⭐ Temporal Annotations

⭐ Tool

⭐ Tracking

⭐ VLLM Experiment

⭐ VLLM Survey

⭐ VLM

⭐ VLoom Spatial Only

⭐ VLoom Spatial & Temporal

⭐ VLoom Temporal Only

⭐ VTG

Stars