π Homepage | π¬ Paper | π©βπ» Code | π Dataset | π Evaluation | π Leaderboard
MMComposition aims to provide a comprehensive assessment of compositionality for Vision-Language Models (VLMs) -- the ability to understand and produce novel combinations of known visual and textual components. This research endeavor is designed to help researchers and practitioners better understand the capabilities, limitations, and critical areas for model improvement in VLM. MMComposition comprises 13 complex vision-language composition tasks, including:
Attribute PerceptionObject PerceptionCounting PerceptionRelation PerceptionDifference SpottingText RenderingVisual SimilarityAttribute ReasoningObject ReasoningCounting ReasoningRelation ReasoningObject InteractionCompositional Probing
@article{hua2024mmcomposition,
title={MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models},
author={Hua, Hang and Tang, Yunlong and Zeng, Ziyun and Cao, Liangliang and Yang, Zhengyuan and He, Hangfeng and Xu, Chenliang and Luo, Jiebo},
journal={arXiv preprint arXiv:2410.09733},
year={2024}
}