Skip to content

Espere-1119-Song/Video-MMLU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

】# Video-MMLU

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark.

Video-MMLU specifically targets videos that focus on theorem demonstrations and probleming-solving, covering mathematics, physics, and chemistry. The videos deliver dense information through numbers and formulas, pose significant challenges for video LMMs in dynamic OCR recognition and comprehension.

Imagine a classroom where a large multimodal model is the student and Video-MMLUacts as the teacher.Video-MMLU evaluates whether the student can perceive and comprehend multi-discipline lectures, much like a student taking notes and being tested later. For each video, we generate a detailed caption as the standard "notes" to assess the model’s visual perception. Additionally, we create 15 questions as a "quiz" to evaluate content reasoning, challenging the model’s ability to apply learned knowledge.

News

  • [2025/3/27] Release Video-MMLU benchmark, as well as the evaluation code on lmms-eval and VLMEvalkit.

Evaluation Pipeline

We evaluate the Video-MMLU benchmark on two open-source multimodal large model evaluation frameworks, lmms-eval and VLMEvalkit. We use Qwen2.5-72B-Instruct as the judge model. Since loading the judge model will occupy memory and cause waste, we provide two ways to evaluate, including using SiliconFlowAPI (only support by VLMEvalkit) and local load Qwen2.5-72B for post-processing.

For more detailed, please refer to Eval Docs.

Leaderboard

We evaluate a total of 96 models across three categories:

  • 3 Proprietary Models, including Gemini-1.5-Flash, GPT-4o, and Claude-3.5-sonnet,
  • 78 Open-Source LMMs, encompassing state-of-the-art video-specific LMMs and image-based LMMs capable of processing multiple images, with model sizes ranging from 256M to 40B,
  • 9 token compression models, especially for visual token reduction,
  • 6 Vision-Blind Baselines.

To submit your model results, please send an email with your logs to enxin.23@intl.zju.edu.cn or open an issue in our repository.

To-Do List

  • Release Arxiv version
  • Upload source video, detailed captions and QA pairs
  • Upload lmms-eval code
  • Upload VLMEvalkit code
  • Upload figures_in_paper
  • Upload the frame captions, video captions and the transcripts
  • Upload keyframes

About

A Massive Multi-Discipline Lecture Understanding Benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors