Skip to content

supersymmetryinm/FunBench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FunBench: Benchmarking Fundus Reading Skills of MLLMs

View on Arxiv

News

  • [2025-05-13] FunBench has been early accepted by MICCAI 2025! 🎉🎉🎉
  • [2025-03-28] FunBench is publicly available on Hugging Face

Introduction

Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis. However, their capabilities in interpreting fundus images, a critical skill for ophthalmology, remain under-evaluated. Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE). This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs’ fundus reading skills. FunBench features a hierarchical task organization across four levels (modality perception, anatomy perception, lesion analysis, and disease diagnosis). It also offers three targeted evaluation modes: linear-probe based VE evaluation, knowledge-prompted LLM evaluation, and holistic evaluation. Experiments on ten open-source MLLMs plus GPT-4o reveal significant deficiencies in fundus reading skills, particularly in basic tasks such as laterality recognition. The results highlight the limitations of current MLLMs and emphasize the need for domain-specific training and improved LLMs and VEs.

Hierarchical Task Organization

FunBench consists of a total of 10 tasks, divided into 4 levels.

  • Level 1 (L1): Modality perception
  • Level 2 (L2): Anatomy perception
  • Level 3 (L3): Lesion analysis
  • Level 4 (L4): Disease diagnosis

Targeted Evaluation Modes

Three targeted evaluation modes (E-mode) are presented.

  • E-Mode I: Linear-probe based VE Evaluation
  • E-Mode II: Knowledge-prompted LLM evaluation
  • E-Mode III: Holistic Evaluation

Results

Results in TDIUC (general field) vs. results in FunBench.

Preparation

1. Download FunBench

FunBench is available at https://huggingface.co/datasets/AIMClab-RUC/FunBench

2. Download images

We adopt 14 public datasets in FunBench. Please download the images from the provided links and place them in the same directory.

3. Image preprocess

We perform preprocessing preprocess.py on CFP images to cut out the retina areas and ensure the images are square. Specifically, some images in Retinal-Lesions will be rotated for 180 degrees to ensure consistency between their laterality labels and the image contents.

The preprocessing may take 1-2 hours.

Evaluation

Run predict.py to get results from MLLMs and evaluation.py to calculate metrics. The Predictor Class in predict.py is custom for different MLLMs.

Citation

If you find this our work useful, please consider citing:

@inproceedings{miccai25-funbench,
title = {FunBench: Benchmarking Fundus Reading Skills of MLLMs},
author = {Qijie Wei and Kaiheng Qian and Xirong Li},
booktitle = {MICCAI},
year={2025}
}

Contact

If you encounter any issue, please feel free to reach us either by creating a new issue in the GitHub or by emailing

About

[MICCAI 2025] Benchmarking Fundus Reading Skills of MLLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%