This is the evalaution module for our work Perception-Aware Policy Optimization for Multimodal Reasoning
- This module is also embedded into PAPO for convenient inference and evaluation
- Feel free to directly use PAPO for complete training-evaluation workflow!
We follow the environment setup instructions from LLaMA-Factory:
cd PAPO-Eval
conda env create -f env.yml
conda activate papo_eval
pip install -e ".[all]"All evaluation data can be downloaded from: https://huggingface.co/datasets/PAPO-Galaxy/PAPO_eval
Prepare evaluation dataset for PAPO evaluation:
-
Set the specific dataset(s) you would like to use for evaluation:
AUTO_UNZIP(bool): Whether to automatically upzip images- If set to
true, the downloaded image ZIP file will be automatically unzipped, and the ZIP file will be removed
- If set to
SPLIT_NAME(str): Which dataset to use for evalaution. Current available datasets:- hiyouga/geometry3k:
SPLIT_NAME="hiyouga_geometry3k" - AI4Math/MathVerse:
SPLIT_NAME="AI4Math_MathVerse" - AI4Math/MathVista:
SPLIT_NAME="AI4Math_MathVista" - We_Math/We_Math:
SPLIT_NAME="We_Math" - FanqingM/MMK12:
SPLIT_NAME="PAPO_MMK12" - Vision-dependent subset of AI4Math/MathVerse:
SPLIT_NAME="AI4Math_MathVerse_vision_dependent" - BUAADreamer/clevr_count_70k:
SPLIT_NAME="BUAADreamer_clevr_count_70k" - lscpku/LogicVista:
SPLIT_NAME="lscpku_LogicVista" - MMMU/MMMU_Pro:
SPLIT_NAME="MMMU_MMMU_Pro
- hiyouga/geometry3k:
-
Run data preprocessing
cd PAPO-Eval
bash papo_eval/preprocess/preprocess.sh-
Please set the dataset and other eval parameters in
PAPO-Eval/papo_eval/run_infer.shDATASET(str): The dataset you would like to run inference on- hiyouga/geometry3k:
DATASET="hiyouga_geometry3k" - AI4Math/MathVerse:
DATASET="AI4Math_MathVerse" - AI4Math/MathVista:
DATASET="AI4Math_MathVista" - We_Math/We_Math:
DATASET="We-Math_We-Math" - FanqingM/MMK12:
SPLIT_NAME="PAPO_MMK12" - Vision-dependent subset of AI4Math/MathVerse:
DATASET="AI4Math_MathVerse_vision_dependent" - BUAADreamer/clevr_count_70k:
DATASET="BUAADreamer_clevr_count_70k" - lscpku/LogicVista:
DATASET="lscpku_LogicVista" - MMMU/MMMU_Pro:
DATASET="MMMU_MMMU_Pro"
- hiyouga/geometry3k:
Model(str): PAPO model you would like to run inference
-
Run inference:
cd PAPO-Eval bash papo_eval/run_infer.sh -
Inference outputs will be saved under
PAPO-Eval/infer_outputs- The first and last output line will also show the exact save path
-
Please set the dataset and other eval parameters in
PAPO-Eval/papo_eval/run_eval.shJSONL_PATH(str): Path to your to-be-eval inference results- JSONL path: Directly give the JSONL path if evaluate accuracy of a specific dataset inference results
- Model dir: Give only model dir without JSONL path if evaluate vision-dependent accuracy
N_ROLLOUT(int): Number of rollout- We set
N_ROLLOUT=8in our paper
- We set
-
Run evaluation:
cd PAPO-Eval bash papo_eval/run_eval.sh -
Detailed results will be saved to
./eval_results/<eval_output_name>.json- Results will also be printed out in the final section of the output, together with the exact save path of evaluation results
Huge thanks for providing this awesome codebase!
- We thank LLaMA-Factory team for providing this foundational codebase that we adapted to implement model inference and evaluation for PAPO.
@article{wang2025perception,
title={Perception-Aware Policy Optimization for Multimodal Reasoning},
author={Wang, Zhenhailong and Guo, Xuehang and Stoica, Sofia and Xu, Haiyang and Wang, Hongru and Ha, Hyeonjeong and Chen, Xiusi and Chen, Yangyi and Yan, Ming and Huang, Fei and others},
journal={arXiv preprint arXiv:2507.06448},
year={2025}
}