ChildBench is a high-quality benchmark dataset that undergoes manual image construction and manual annotation, specifically designed to evaluate the core capabilities of multimodal large language models in early education scenarios. The dataset includes 4,890 images and 5,346 annotated samples, covering 10 task types across 5 capability assessments, including core capability dimensions such as spatial reasoning, counting, and visual tracking. To ensure data quality, ChildBench has formulated detailed construction and annotation processes, with a team of professional annotators and auditors conducting three rounds of rigorous quality control.
The following figures list some classic examples in our dataset. You can click out Examples to view partial details of the dataset.
The following table Splits lists the detailed information statistics of the splited dataset.
You can find our dataset through the following path (Dataset/dataset) for more details.
Due to the fact that only redirecting to the specified file is valid in anonymous links, redirecting to the specified directory is invalid. Therefore, we use bold and italicized font to indicate the markings of all specified directories, making it easier for reviewers to search. Thank you!
All images in the dataset can be viewed after downloading the compressed package in the (Dataset/images) directory.
As reported in the folloeing table, ChildBench contains 5,346 samples, divided into training, validation, and test sets according to a 7:1:2 ratio.
All the splited data sets are in the directory (Dataset/dataset).
Each json file is of the following format:
[
{
"question_id": "1",
"question": "Determine rows 1-4 separately. Which shape in each row is different from the other shapes?",
"category": "Visual_Discrimination_odd_one_out",
"input_image": [
"Visual_Discrimination_odd_one_out_0012.png"
],
"answer_type": "choice",
"options": [
"A.A,B,A,D",
"B.B,B,A,C",
"C.A,B,A,C",
"D.C,B,A,C"
],
"answer": "A"
},
{
"question_id": "2",
"question": "What are the corresponding letters of the overlapping shapes in the picture from top to bottom?",
"category": "Spatial_Reasoning_image_overlapping_reasoning",
"input_image": [
"Spatial_Reasoning_image_overlapping_reasoning_0252.png"
],
"answer_type": "order",
"answer": "A, E, C, D, B"
},
{
"question_id": "3",
"question": "What numbers correspond to the different shapes from top to bottom in the picture? ",
"category": "Visual_Tracking",
"input_image": [
"Visual_Tracking_0009.png"
],
"answer_type": "choice",
"options": [
"A.2,3,5,4,1",
"B.2,4,5,3,1",
"C.3,1,2,5,4",
"D.4,2,5,3,1"
],
"answer": "B"
},
{
"question_id": "4",
"question": "How many cubes are there in the picture?",
"category": "Counting_Skill_spatial_counting",
"input_image": [
"Counting_Skill_spatial_counting_0084.png"
],
"answer_type": "blank",
"answer": "7"
}
]Each line is an individual data point.
question is the question with manual annotation, input_image denotes name of the image, options is reasonable numerical options.
It encompasses 5 capability evaluations across 10 task types, precisely aligning with early - education assessment scenarios.
We have disclosed the inference code for the model in the directory (Code/experiment), as well as the fine-tuning code in the directory (Code/finetune).
Note: Before using any code or scripts in this project, you need to manually supplement necessary path information in the relevant files, including but not limited to model path, training file path, and output path.
- For all 8 open-sourse MLLMs, you can directly execute Python files in the directory (Code/experiment) to perform inference on models before and after fine-tuning:
nohup python DeepSeek-VL.py > log/DeepSeek-VL.log 2>1& &
nohup python InternVL3.py > log/InternVL3.log 2>1& &
nohup python Llama-3.2-Vision.py > log/Llama-3.2-Vision.log 2>1& &
nohup python LLaVA-v1.5.py > log/LLaVA-v1.6.log 2>1& &
nohup python MiniCPM-V-2.6.py > log/MiniCPM-V-2.6.log 2>1& &
nohup python mPLUG-Owl3.py > log/mPLUG-Owl3.log 2>1& &
nohup python Phi-3.5-vision.py > log/Phi-3.5-vision.log 2>1& &
nohup python Qwen2.5-VL.py > log/Qwen2.5-VL.log 2>1& &
Due to the large amount of open source model code, you need to download it yourself through channels or call it directly from platforms such as huggingface.
- For open-source models, You can execute Bash files using LLaMA-Factory and ms-swift in the directory (Code/finetune) to perform fine-tuning.For models without provided Bash scripts, you can directly use LLaMA Factory's webui for fine-tuning with default parameters:
nohup bash DeepSeek-VL.sh > log/DeepSeek-VL_train.log 2>1& &
nohup bash InternVL3.sh > log/InternVL3_train.log 2>1& &
nohup bash Llama-3.2-Vision.sh > log/Llama-3.2-Vision_train.log 2>1& &
nohup bash LLaVA-v1.5.sh > log/LLaVA-v1.6_train.log 2>1& &
nohup bash mPLUG-Owl3.sh > log/mPLUG-Owl3_train.log 2>1& &
nohup bash Phi-3.5-vision.sh > log/Phi-3.5-vision_train.log 2>1& &
- For gemini-2.5-pro and gpt-4o, you can directly execute our Python file in the directory (Code/close_models) to perform inferencing of the zero-shot, few-shot, provided that you prepare a key:
python gemini_2-5_zero_shot.py
python gemini_2-5_one_shot.py
python gpt4o_zero_shot.py
python gpt4o_one_shot.py
Gemini needs to apply on the official website, and GPT4 needs to be purchased on the official website.
You can process the results of model inference through the code we provide to calculate overall accuracy, overall P, R, F1 indicators, and use a graphical interface to statistically calculate the accuracy rate of each subtask. We integrate the calculation process into the Python files in the directory (Code/eval):
python calculate_prf1.py
python calculate_acc.py
The environment configuration required for debugging code is placed in directory (Code/requirement)
This project is licensed under the Apache-2.0 License.