FunBench: Benchmarking Fundus Reading Skills of MLLMs

News

[2025-05-13] FunBench has been early accepted by MICCAI 2025! 🎉🎉🎉
[2025-03-28] FunBench is publicly available on Hugging Face

Introduction

Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis. However, their capabilities in interpreting fundus images, a critical skill for ophthalmology, remain under-evaluated. Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE). This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs’ fundus reading skills. FunBench features a hierarchical task organization across four levels (modality perception, anatomy perception, lesion analysis, and disease diagnosis). It also offers three targeted evaluation modes: linear-probe based VE evaluation, knowledge-prompted LLM evaluation, and holistic evaluation. Experiments on ten open-source MLLMs plus GPT-4o reveal significant deficiencies in fundus reading skills, particularly in basic tasks such as laterality recognition. The results highlight the limitations of current MLLMs and emphasize the need for domain-specific training and improved LLMs and VEs.

Hierarchical Task Organization

FunBench consists of a total of 10 tasks, divided into 4 levels.

Level 1 (L1): Modality perception
Level 2 (L2): Anatomy perception
Level 3 (L3): Lesion analysis
Level 4 (L4): Disease diagnosis

Targeted Evaluation Modes

Three targeted evaluation modes (E-mode) are presented.

E-Mode I: Linear-probe based VE Evaluation
E-Mode II: Knowledge-prompted LLM evaluation
E-Mode III: Holistic Evaluation

Results

Results in TDIUC (general field) vs. results in FunBench.

Preparation

1. Download FunBench

FunBench is available at https://huggingface.co/datasets/AIMClab-RUC/FunBench

2. Download images

We adopt 14 public datasets in FunBench. Please download the images from the provided links and place them in the same directory.

Six CFP datasets: IDRiD, DDR, JSIEC, RFMiD, OIA-ODIR and Retinal-Lesions
Five OCT datasets: OCTDL, NEH, OCTID, UCSD and RETOUCH
One UWF dataset: TOP
Two multimodal datasets: MMC-AMD and DeepDRiD

3. Image preprocess

We perform preprocessing preprocess.py on CFP images to cut out the retina areas and ensure the images are square. Specifically, some images in Retinal-Lesions will be rotated for 180 degrees to ensure consistency between their laterality labels and the image contents.

The preprocessing may take 1-2 hours.

Evaluation

Run predict.py to get results from MLLMs and evaluation.py to calculate metrics. The Predictor Class in predict.py is custom for different MLLMs.

Citation

If you find this our work useful, please consider citing:

@inproceedings{miccai25-funbench,
title = {FunBench: Benchmarking Fundus Reading Skills of MLLMs},
author = {Qijie Wei and Kaiheng Qian and Xirong Li},
booktitle = {MICCAI},
year={2025}
}

Contact

If you encounter any issue, please feel free to reach us either by creating a new issue in the GitHub or by emailing

Qijie Wei (qijie.wei@ruc.edu.cn)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
predict.py		predict.py
preprocess.py		preprocess.py
preprocess_info.json		preprocess_info.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FunBench: Benchmarking Fundus Reading Skills of MLLMs

News

Introduction

Hierarchical Task Organization

Targeted Evaluation Modes

Results

Preparation

1. Download FunBench

2. Download images

3. Image preprocess

Evaluation

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FunBench: Benchmarking Fundus Reading Skills of MLLMs

News

Introduction

Hierarchical Task Organization

Targeted Evaluation Modes

Results

Preparation

1. Download FunBench

2. Download images

3. Image preprocess

Evaluation

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages