- [2025-05-13] FunBench has been early accepted by MICCAI 2025! 🎉🎉🎉
- [2025-03-28] FunBench is publicly available on Hugging Face
Multimodal Large Language Models (MLLMs) have shown significant potential in medical image analysis. However, their capabilities in interpreting fundus images, a critical skill for ophthalmology, remain under-evaluated. Existing benchmarks lack fine-grained task divisions and fail to provide modular analysis of its two key modules, i.e., large language model (LLM) and vision encoder (VE). This paper introduces FunBench, a novel visual question answering (VQA) benchmark designed to comprehensively evaluate MLLMs’ fundus reading skills. FunBench features a hierarchical task organization across four levels (modality perception, anatomy perception, lesion analysis, and disease diagnosis). It also offers three targeted evaluation modes: linear-probe based VE evaluation, knowledge-prompted LLM evaluation, and holistic evaluation. Experiments on ten open-source MLLMs plus GPT-4o reveal significant deficiencies in fundus reading skills, particularly in basic tasks such as laterality recognition. The results highlight the limitations of current MLLMs and emphasize the need for domain-specific training and improved LLMs and VEs.
FunBench consists of a total of 10 tasks, divided into 4 levels.
- Level 1 (L1): Modality perception
- Level 2 (L2): Anatomy perception
- Level 3 (L3): Lesion analysis
- Level 4 (L4): Disease diagnosis
Three targeted evaluation modes (E-mode) are presented.
- E-Mode I: Linear-probe based VE Evaluation
- E-Mode II: Knowledge-prompted LLM evaluation
- E-Mode III: Holistic Evaluation
Results in TDIUC (general field) vs. results in FunBench.
FunBench is available at https://huggingface.co/datasets/AIMClab-RUC/FunBench
We adopt 14 public datasets in FunBench. Please download the images from the provided links and place them in the same directory.
- Six CFP datasets:
IDRiD,DDR,JSIEC,RFMiD,OIA-ODIRandRetinal-Lesions - Five OCT datasets:
OCTDL,NEH,OCTID,UCSDandRETOUCH - One UWF dataset:
TOP - Two multimodal datasets:
MMC-AMDandDeepDRiD
We perform preprocessing preprocess.py on CFP images to cut out the retina areas and ensure the images are square.
Specifically, some images in Retinal-Lesions will be rotated for 180 degrees to ensure consistency between their laterality labels and the image contents.
The preprocessing may take 1-2 hours.
Run predict.py to get results from MLLMs and evaluation.py to calculate metrics.
The Predictor Class in predict.py is custom for different MLLMs.
If you find this our work useful, please consider citing:
@inproceedings{miccai25-funbench,
title = {FunBench: Benchmarking Fundus Reading Skills of MLLMs},
author = {Qijie Wei and Kaiheng Qian and Xirong Li},
booktitle = {MICCAI},
year={2025}
}
If you encounter any issue, please feel free to reach us either by creating a new issue in the GitHub or by emailing
- Qijie Wei (qijie.wei@ruc.edu.cn)



