PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning

This is the evalaution module for our work Perception-Aware Policy Optimization for Multimodal Reasoning

This module is also embedded into PAPO for convenient inference and evaluation
Feel free to directly use PAPO for complete training-evaluation workflow!

🚀 Evaluation for PAPO

1. Env Setup

We follow the environment setup instructions from LLaMA-Factory:

cd PAPO-Eval
conda env create -f env.yml
conda activate papo_eval
pip install -e ".[all]"

2. Data Preprocessing

All evaluation data can be downloaded from: https://huggingface.co/datasets/PAPO-Galaxy/PAPO_eval

Prepare evaluation dataset for PAPO evaluation:

Set the specific dataset(s) you would like to use for evaluation:
- AUTO_UNZIP (bool): Whether to automatically upzip images
  - If set to true, the downloaded image ZIP file will be automatically unzipped, and the ZIP file will be removed
- SPLIT_NAME (str): Which dataset to use for evalaution. Current available datasets:
  - hiyouga/geometry3k: SPLIT_NAME="hiyouga_geometry3k"
  - AI4Math/MathVerse: SPLIT_NAME="AI4Math_MathVerse"
  - AI4Math/MathVista: SPLIT_NAME="AI4Math_MathVista"
  - We_Math/We_Math: SPLIT_NAME="We_Math"
  - FanqingM/MMK12: SPLIT_NAME="PAPO_MMK12"
  - Vision-dependent subset of AI4Math/MathVerse: SPLIT_NAME="AI4Math_MathVerse_vision_dependent"
  - BUAADreamer/clevr_count_70k: SPLIT_NAME="BUAADreamer_clevr_count_70k"
  - lscpku/LogicVista: SPLIT_NAME="lscpku_LogicVista"
  - MMMU/MMMU_Pro: SPLIT_NAME="MMMU_MMMU_Pro
Run data preprocessing

cd PAPO-Eval
bash papo_eval/preprocess/preprocess.sh

3. Run Evaluation

3.1 Run Model Inference

Please set the dataset and other eval parameters in PAPO-Eval/papo_eval/run_infer.sh
- DATASET (str): The dataset you would like to run inference on
  - hiyouga/geometry3k: DATASET="hiyouga_geometry3k"
  - AI4Math/MathVerse: DATASET="AI4Math_MathVerse"
  - AI4Math/MathVista: DATASET="AI4Math_MathVista"
  - We_Math/We_Math: DATASET="We-Math_We-Math"
  - FanqingM/MMK12: SPLIT_NAME="PAPO_MMK12"
  - Vision-dependent subset of AI4Math/MathVerse: DATASET="AI4Math_MathVerse_vision_dependent"
  - BUAADreamer/clevr_count_70k: DATASET="BUAADreamer_clevr_count_70k"
  - lscpku/LogicVista: DATASET="lscpku_LogicVista"
  - MMMU/MMMU_Pro: DATASET="MMMU_MMMU_Pro"
- Model (str): PAPO model you would like to run inference
  - For example: MODEL="PAPOGalaxy/PAPO-G-H-Qwen2.5-VL-7B"
  - Our model collection on Hugging Face: PAPO-Qwen
    - PAPO-GRPO model collection: PAPO-G
    - PAPO-DAPO model collection: PAPO-D

Run inference:

cd PAPO-Eval
bash papo_eval/run_infer.sh

Inference outputs will be saved under PAPO-Eval/infer_outputs
- The first and last output line will also show the exact save path

3.2 Run Evaluation On Model Inference

Please set the dataset and other eval parameters in PAPO-Eval/papo_eval/run_eval.sh
- JSONL_PATH (str): Path to your to-be-eval inference results
  - JSONL path: Directly give the JSONL path if evaluate accuracy of a specific dataset inference results
  - Model dir: Give only model dir without JSONL path if evaluate vision-dependent accuracy
- N_ROLLOUT (int): Number of rollout
  - We set N_ROLLOUT=8 in our paper
Run evaluation:
```
cd PAPO-Eval
bash papo_eval/run_eval.sh
```
Detailed results will be saved to ./eval_results/<eval_output_name>.json
- Results will also be printed out in the final section of the output, together with the exact save path of evaluation results

🥰 Acknowledgement

Huge thanks for providing this awesome codebase!

We thank LLaMA-Factory team for providing this foundational codebase that we adapted to implement model inference and evaluation for PAPO.

📝 Citation

@article{wang2025perception,
  title={Perception-Aware Policy Optimization for Multimodal Reasoning},
  author={Wang, Zhenhailong and Guo, Xuehang and Stoica, Sofia and Xu, Haiyang and Wang, Hongru and Ha, Hyeonjeong and Chen, Xiusi and Chen, Yangyi and Yan, Ming and Huang, Fei and others},
  journal={arXiv preprint arXiv:2507.06448},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
data		data
docker		docker
evaluation		evaluation
papo_eval		papo_eval
src		src
.dockerignore		.dockerignore
.env.local		.env.local
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
env.yml		env.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning

🚀 Evaluation for PAPO

1. Env Setup

2. Data Preprocessing

3. Run Evaluation

3.1 Run Model Inference

3.2 Run Evaluation On Model Inference

🥰 Acknowledgement

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

xhguo7/PAPO-Eval

Folders and files

Latest commit

History

Repository files navigation

PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning

🚀 Evaluation for PAPO

1. Env Setup

2. Data Preprocessing

3. Run Evaluation

3.1 Run Model Inference

3.2 Run Evaluation On Model Inference

🥰 Acknowledgement

📝 Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages