Skip to content

AI-TestBot/DeepTest-Resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

203 Commits
 
 

Repository files navigation

DeepTest Resources

📒Introduction

Large Language Models (LLMs) Testing Resources: A curated list of Awesome LLMs Testing Papers with Codes, check 📖Contents for more details. This repo is still updated frequently ~ 👨‍💻‍ Welcome to star ⭐️ or submit a PR to this repo! I will review and merge it.

📖Contents

📖Leaderboard

Date Title Paper HomePage Github DataSets Organization
2023 Open LLM Leaderboard. - [homepage] - - Huggingface
2023 Chatbot arena: An open platform for evaluating llms by human preference. [arXiv] [homepage] - - UC Berkeley
2024 AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. [NeurIPS] [homepage] - - Stanford University
2023 OpenCompass-司南大模型评测平台. - [homepage] [Github] - 上海人工智能实验室
2023 FlagEval-天秤大模型评测平台. - [homepage] - - 北京智源人工智能研究院
2023 Superclue: A comprehensive chinese large language model benchmark. [arXiv] [homepage] - - SUPERCLUE
2023 SuperBench-大模型综合能力评测框架. - - - - 清华大学-基础模型研究中心
2023 LLMEval: A Preliminary Study on How to Evaluate Large Language Models. [AAAI] [homepage] [Github] - 复旦大学
2023 CLiB-chinese-llm-benchmark. - - [Github] - -

📖Review

Evaluating large language models: A comprehensive survey.
Z Guo, R Jin, C Liu, Y Huang, D Shi, L Yu, Y Liu, J Li, B Xiong, D Xiong.
ArXiv, 2023. [ArXiv] [Github]

A Survey on Evaluation of Large Language Models.
Y Chang, X Wang, J Wang, Y Wu, L Yang, K Zhu, H Chen, X Yi, C Wang, Y Wang, W Ye, et al.
ACM Transactions on Intelligent Systems and Technology, 2024. [Paper] [ArXiv] [Github]

Through the lens of core competency: Survey on evaluation of large language models.
Z Ziyu, C Qiguang, M Longxuan, L Mingda, et al.
CCL, 2024. [Paper]

大语言模型评测综述.
罗 文,王厚峰.
中文信息学报, 2024.

A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
Y Bang, S Cahyawijaya, N Lee, W Dai, D Su, et al.
arXiv, 2023. [Paper]

A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets.
MTR Laskar, MS Bari, M Rahman, MAH Bhuiyan, S Joty, JX Huang.
arXiv:2305.18486, 2023. [Paper]

📖General

📖G-Comprehensive

Holistic evaluation of language models.
R Bommasani, P Liang, T Lee, et al.
ArXiv, 2023. [Homepage] [ArXiv] [Github]

Alignbench: Benchmarking chinese alignment of large language models.
X Liu, X Lei, S Wang, Y Huang, Z Feng, B Wen, J Cheng, P Ke, Y Xu, WL Tam, X Zhang, et al.
arXiv:2311.18743, 2023. [ArXiv] [Github]

TencentLLMEval: a hierarchical evaluation of Real-World capabilities for human-aligned LLMs.
S Xie, W Yao, Y Dai, S Wang, D Zhou, L Jin, X Feng, P Wei, Y Lin, Z Hu, D Yu, Z Zhang, et al.
arXiv:2311.05374, 2023. [ArXiv]

Evaluation of openai o1: Opportunities and challenges of agi.
T Zhong, Z Liu, Y Pan, Y Zhang, Y Zhou, S Liang, Z Wu, Y Lyu, P Shu, X Yu, C Cao, H Jiang, et al.
arXiv:2409.18486, 2024. [ArXiv]

📖Understanding

Date Task Title Paper HomePage Github DataSets
2018 Comprehensive GLUE: A multi-task benchmark and analysis platform for natural language understanding. [ArXiv] [Homepage] - -
2019 Comprehensive Superglue: A stickier benchmark for general-purpose language understanding systems. [NeurIPS] [Homepage] - -
2020 Comprehensive CLUE: A Chinese language understanding evaluation benchmark. [ArXiv] [Homepage] - -
2019 Comprehensive Fewclue: A chinese few-shot learning evaluation benchmark. [ArXiv] [Homepage] - -
2017 Reading Race: Large-scale reading comprehension dataset from examinations. [ArXiv] - [Github] [Datasets]
2017 Reading Know what you don't know: Unanswerable questions for SQuAD. [ArXiv] - - -
2017 Reading Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. [ArXiv] [Homepage] - -
2019 Reading DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. [ArXiv] [Homepage] - -
2019 Reading BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. [ArXiv] [Homepage] - -
2023 Reading The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. [ArXiv] - - -
2024 Reading AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models. [ArXiv] - [Github] -
2023 Semantic The two word test: A semantic benchmark for large language models. [ArXiv] - - -
2023 Semantic This is not a dataset: A large negation benchmark to challenge large language models. [ArXiv] - [Github] -
2023 Graph Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking. [ArXiv] - - -
2017 Knowledge Crowdsourcing multiple choice science questions. [ArXiv] - - [DataSets]
2018 Knowledge Can a suit of armor conduct electricity? a new dataset for open book question answering. [ArXiv] - [Github] -
2021 Knowledge Measuring massive multitask language understanding. [ICLR] - [Github] [Huggingface]
2023 Knowledge C-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models. [ArXiv] - [Github] -
2023 Knowledge Cmmlu: Measuring massive multitask language understanding in chinese. [ArXiv] - [Github] -
2023 Knowledge Measuring massive multitask chinese understanding. [ArXiv] - - -
2024 Knowledge Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. [ArXiv] - [Github] [DataSets]
2023 Metrics Rethinking the Evaluating Framework for Natural Language Understanding in AI Systems: Language Acquisition as a Core for Future Metrics. [ArXiv] - - -

📖Generation

Date Task Title Paper HomePage Github DataSets
2015 Summarization Lcsts: A large scale chinese short text summarization dataset. [EMNLP] [Homepage] - -
2019 Summarization Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. [ArXiv] - [Github] -
2019 Summarization SAMSum corpus A human-annotated dialogue dataset for abstractive summarization. [ArXiv] - - -
2021 Summarization DialogSum: A real-life scenario dialogue summarization dataset. [ArXiv] - [Github] -
2023 Summarization Clinical text summarization: adapting large language models can outperform human experts. [ArXiv] - - -
2023 Summarization Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. [ArXiv] - [Github] -
2024 Summarization Benchmarking large language models for news summarization. [TACL] - - -
2013 QA Semantic parsing on freebase from question-answer pairs. [EMNLP] - - -
2018 QA The web as a knowledge-base for answering complex questions. [ArXiv] - - [Datasets]
2019 QA Natural Questions A Benchmark for Question Answering Research. [ACL] [Homepage] [Github] -
2022 QA MiQA: A benchmark for inference on metaphorical questions. [ArXiv] - [Github] -
2023 QA Emotionally numb or empathetic? evaluating how llms feel using emotionbench. [ArXiv] - [Github] -
2023 QA Evaluating open-domain question answering in the era of large language models. [ArXiv] - [Github] -
2023 QA Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. [ArXiv] - - -
2023 QA Can ChatGPT replace traditional KBQA models? An in-depth analysis of the question answering performance of the GPT LLM family. [ISWC] - [Github] -
2024 QA Compmix: A benchmark for heterogeneous question answering. [ACMWC] - - -
2024 QA MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues. [ArXiv] - [Github] -
2024 QA Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. [NeurIPS] [Homepage] - -
2024 Content Benchmarking large language models on controllable generation under diversified instructions. [AAAI] - [Github] -
2023 Graph Evaluating generative models for graph-to-text generation. [ArXiv] - [Github] -
2023 Graph Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text. [ArXiv] - [Github] -

📖Reasoning

Date Task Title Paper HomePage Github DataSets
2022 Comprehensive Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. [ArXiv] - - -
2023 Comprehensive Arb: Advanced reasoning benchmark for large language models. [ArXiv] - - -
2023 Comprehensive Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes. [ArXiv] - [Github] -
2024 Comprehensive Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study. [ArXiv] - - -
2012 Commonsense The winograd schema challenge. [AAAI] - - -
2018 Commonsense Commonsenseqa: A question answering challenge targeting commonsense knowledge. [ArXiv] - - -
2019 Commonsense Hellaswag Can a machine really finish your sentence. [ArXiv] [Homepage] - -
2019 Commonsense Socialiqa: Commonsense reasoning about social interactions. [ArXiv] [Homepage] - -
2020 Commonsense Piqa: Reasoning about physical commonsense in natural language. [AAAI] - - -
2021 Commonsense Winogrande An adversarial winograd schema challenge at scale. [CACM] - - -
2023 Commonsense Worldsense: A synthetic benchmark for grounded reasoning in large language models. [ArXiv] - [Github] -
2024 Commonsense Corecode: A common sense annotated dialogue dataset with benchmark tasks for chinese large language models. [AAAI] - [Github] -
2024 Commonsense Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations. [ArXiv] - [Github] -
2017 Math Deep Neural Solver for Math Word Problems. [EMNLP] - - [DataSets]
2021 Math Measuring Mathematical Problem Solving With the MATH Dataset. [NeurIPS] - - [DataSets]
2021 Math Training verifiers to solve math word problems. [NeurIPS] - [Github] [DataSets]
2023 Math Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs. [ArXiv] - - [DataSets]
2023 Math CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? [ArXiv] - - [DataSets]
2023 Math MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. [ArXiv] - [Github] [DataSets]
2023 Math TheoremQA: A Theorem-driven Question Answering Dataset. [ArXiv] - - -
2024 Math GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. [ArXiv] - - -
2024 Math MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark. [ArXiv] - [Github] -
2024 Math Mustard: Mastering uniform synthesis of theorem and proof data. [ArXiv] - [Github] -
2024 Math Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models. [ArXiv] - [Github] -
2016 Logic Story cloze evaluator: Vector space representation evaluation by predicting what happens next. [ArXiv] - - -
2016 Logic The LAMBADA dataset: Word prediction requiring a broad discourse context. [ArXiv] - - -
2023 Logic RoCar: A Relationship Network-based Evaluation Method to Large Language Models. [ArXiv] - [Github] -
2023 Logic Towards benchmarking and improving the temporal reasoning capability of large language models. [ArXiv] - [Github] -
2023 Logic Towards logiglue: A brief survey and a benchmark for analyzing logical reasoning capabilities of language models. [ArXiv] - - -
2022 Causal Wikiwhy: Answering and explaining cause-and-effect questions. [ArXiv] - - -
2024 Causal CausalBench: A Comprehensive Benchmark for Causal Learning Capability of Large Language Models. [ArXiv] - - -
2024 Causal Cladder: A benchmark to assess causal reasoning capabilities of language models. [NeurIPS - [Github] [Huggingface]
2023 Step Art: Automatic multi-step reasoning and tool-use for large language models. [ArXiv] - [Github -
2023 Step STEPS: A Benchmark for Order Reasoning in Sequential Tasks. [ArXiv] - [Github] -
2023 Complex Have llms advanced enough? a challenging problem solving benchmark for large language models. [ArXiv] - [Github] -
2024 Complex MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures. [ArXiv] - [Github] [Huggingface]
2024 Complex Livebench: A challenging, contamination-free llm benchmark. [ArXiv] [Homepage] [Github] [Huggingface]
2024 Complex OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI. [ArXiv] - [Github] -
2024 Complex Evaluation of OpenAI o1: Opportunities and Challenges of AGI. [ArXiv] - [Github] -

📖Knowledge

Think you have solved question answering? try arc, the ai2 reasoning challenge.
P Clark, I Cowhey, O Etzioni, T Khot, A Sabharwal, C Schoenick, O Tafjord.
arXiv:1803.05457, 2018. [ArXiv]

Agieval: A human-centric benchmark for evaluating foundation models.
W Zhong, R Cui, Y Guo, Y Liang, S Lu, Y Wang, et al.
arXiv, 2023. [ArXiv] [Github]

Do llms understand social knowledge? evaluating the sociability of large language models with socket benchmark.
M Choi, J Pei, S Kumar, C Shu, D Jurgens.
arXiv:2305.14938, 2023. [ArXiv] [Github]

Eva-kellm: A new benchmark for evaluating knowledge editing of llms.
S Wu, M Peng, Y Chen, J Su, M Sun.
arXiv:2308.09954, 2023. [ArXiv]

KoLA: Carefully Benchmarking World Knowledge of Large Language Models.
J Yu, X Wang, S Tu, S Cao, D Zhang-Li, X Lv, H Peng, Z Yao, X Zhang, H Li, C Li, Z Zhang, et al.
ArXiv, 2023. [ArXiv] [Homepages]

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models.
B Zhang, H Xie, P Du, J Chen, P Cao, Y Chen, S Liu, K Liu, J Zhao.
arXiv:2308.14353, 2023. [ArXiv] [Homepage]

Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation.
Z Gu, X Zhu, H Ye, L Zhang, J Wang, Y Zhu, et al.
AAAI, 2024. [AAAI] [Github]

📖Discipline

Evaluating the performance of large language models on gaokao benchmark.
X Zhang, C Li, Y Zong, Z Ying, L He, X Qiu.
ArXiv, 2023. [ArXiv] [Github]

M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models.
W Zhang, M Aljunied, C Gao, YK Chia, L Bing.
Advances in Neural Information Processing Systems, 2023. [NeurIPS] [Github]

M3ke: A massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models.
C Liu, R **, Y Ren, L Yu, T Dong, X Peng, et al.
arXiv, 2024. [ArXiv] [Github]

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI.
Z Huang, Z Wang, S Xia, X Li, H Zou, R Xu, RZ Fan, L Ye, E Chern, Y Ye, Y Zhang, Y Yang, et al.
arXiv:2406.12753, 2024. [ArXiv] [Github]

📖Multilingual

XNLI: Evaluating cross-lingual sentence representations.
A Conneau, G Lample, R Rinott, A Williams, SR Bowman, H Schwenk, V Stoyanov.
arxiv:1809.05053, 2018. [ArXiv]

Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.
A Siddhant, J Hu, M Johnson, O Firat, et al.
ICML, 2020. [ArXiv]

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages.
JH Clark, E Choi, M Collins, D Garrette, T Kwiatkowski, V Nikolaev, J Palomaki.
Transactions of the Association for Computational Linguistics, 2020. [ArXiv]

The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation.
N Goyal, C Gao, V Chaudhary, PJ Chen, G Wenzek, D Ju, S Krishnan, MA Ranzato, et al.
Transactions of the Association for Computational Linguistics, 2022. [ArXiv]

Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning.
VD Lai, NT Ngo, APB Veyseh, H Man, et al.
arXiv, 2023. [ArXiv]

Mega: Multilingual evaluation of generative ai.
K Ahuja, H Diddee, R Hada, M Ochieng, K Ramesh, P Jain, A Nambi, T Ganu, S Segal, et al.
arXiv:2303.12528, 2023. [ArXiv]

Megaverse: Benchmarking large language models across languages, modalities, models and tasks.
S Ahuja, D Aggarwal, V Gumma, I Watts, A Sathe, M Ochieng, R Hada, P Jain, M Axmed, et al.
arXiv:2311.07463, 2023. [ArXiv]

MELA: Multilingual Evaluation of Linguistic Acceptability.
Z Zhang, Y Liu, W Huang, J Mao, R Wang, H Hu.
arXiv:2311.09033, 2023. [ArXiv] [Github]

mSCAN: A Dataset for Multilingual Compositional Generalisation Evaluation.
A Reymond, S Steinert-Threlkeld.
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, 2023. [Paper]

SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning.
B Wang, Z Liu, X Huang, F Jiao, Y Ding, AT Aw, NF Chen.
ArXiv, 2023. [ArXiv] [Github]

Evaluating the elementary multilingual capabilities of large language models with MultiQ.
C Holtermann, P Röttger, T Dill, A Lauscher.
arXiv:2403.03814, 2024. [ArXiv] [Github]

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models.
H Wang, J Xu, S Xie, R Wang, J Li, Z Xie, B Zhang, C Xiong, X Chen.
arXiv:2405.15638, 2024. [ArXiv] [Github]

📖Long-Context

ANALOGICAL--A Novel Benchmark for Long Text Analogy Evaluation in Large Language Models.
T Wijesiriwardene, R Wickramarachchi, BG Gajera, SM Gowaikar, C Gupta, A Chadha, et al.
arXiv:2305.05050, 2023. [ArXiv]

Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models.
Z Dong, T Tang, J Li, WX Zhao, JR Wen.
arXiv:2309.13345, 2023. [ArXiv] [Github]

L-eval: Instituting standardized evaluation for long context language models.
C An, S Gong, M Zhong, X Zhao, M Li, J Zhang, L Kong, X Qiu.
arXiv:2307.11088, 2023. [ArXiv] [Github]

Longbench: A bilingual, multitask benchmark for long context understandings.
Y Bai, X Lv, J Zhang, H Lyu, J Tang, Z Huang, Z Du, X Liu, A Zeng, L Hou, Y Dong, J Tang, et al.
arXiv:2308.14508, 2023. [ArXiv] [Github]

M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models.
WC Kwan, X Zeng, Y Wang, Y Sun, L Li, L Shang, Q Liu, KF Wong.
arXiv:2310.19240, 2023. [ArXiv] [Github]

Zeroscrolls: A zero-shot benchmark for long text understanding.
U Shaham, M Ivgi, A Efrat, J Berant, O Levy.
arXiv:2305.14196, 2023. [ArXiv] [Homepage]

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models.
Z Huang, J Li, S Huang, W Zhong, I King.
ArXiv, 2024. [ArXiv] [Github]

LooGLE: Can Long-Context Language Models Understand Long Contexts?
J Li, M Wang, Z Zheng, M Zhang.
arxiv:2311.04939, 2023. [ArXiv] [Github] [DataSets]

Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k.
T Yuan, X Ning, D Zhou, Z Yang, S Li, M Zhuang, Z Tan, Z Yao, D Lin, B Li, G Dai, S Yan, et al.
arXiv:2402.05136, 2024. [ArXiv] [Github]

📖Chain-of-Thought

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance.
Y Fu, L Ou, M Chen, Y Wan, H Peng, T Khot.
ArXiv, 2023. [ArXiv] [Github]

Cue-CoT: Chain-of-thought prompting for responding to in-depth dialogue questions with LLMs.
H Wang, R Wang, F Mi, Y Deng, Z Wang, B Liang, R Xu, KF Wong.
arXiv:2305.11792, 2023. [ArXiv] [Github]

📖Role-Playing

Charactereval: A chinese benchmark for role-playing conversational agent evaluation.
Q Tu, S Fan, Z Tian, R Yan.
ArXiv, 2024. [ArXiv] [Github]

Roleeval: A bilingual role evaluation benchmark for large language models.
T Shen, S Li, D Xiong.
arXiv:2312.16132, 2023. [ArXiv] [Github]

Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models.
ZM Wang, Z Peng, H Que, J Liu, W Zhou, Y Wu, et al.
ArXiv, 2023. [ArXiv] [Github]

📖Tools

Api-bank: A comprehensive benchmark for tool-augmented llms.
M Li, Y Zhao, B Yu, F Song, H Li, H Yu, Z Li, et al.
arxiv, 2023. [ArXiv] [Github]

Metatool benchmark for large language models: Deciding whether to use tools and which to use.
Y Huang, J Shi, Y Li, C Fan, S Wu, Q Zhang, Y Liu, P Zhou, Y Wan, NZ Gong, L Sun.
arxiv:2310.03128, 2023. [ArXiv] [Github]

Mint: Evaluating llms in multi-turn interaction with tools and language feedback.
X Wang, Z Wang, J Liu, Y Chen, L Yuan, H Peng, H Ji.
ArXiv, 2023. [ArXiv] [Github]

On the tool manipulation capability of open-source large language models.
Q Xu, F Hong, B Li, C Hu, Z Chen, J Zhang.
arxiv:2305.16504, 2023. [ArXiv] [Github]

T-eval: Evaluating the tool utilization capability step by step.
Z Chen, W Du, W Zhang, K Liu, J Liu, M Zheng, J Zhuo, S Zhang, D Lin, K Chen, F Zhao.
arXiv:2312.14033, 2023. [ArXiv] [Github]

Toolqa: A dataset for llm question answering with external tools.
Y Zhuang, Y Yu, K Wang, H Sun, C Zhang.
Advances in Neural Information Processing Systems, 2024. [NeurIPS] [Github]

📖Instruction-Following

Followbench: A multi-level fine-grained constraints following benchmark for large language models.
Y Jiang, Y Wang, X Zeng, W Zhong, L Li, F Mi, L Shang, X Jiang, Q Liu, W Wang.
ArXiv, 2023. [ArXiv] [Github]

Instructeval: Towards holistic evaluation of instruction-tuned large language models.
YK Chia, P Hong, L Bing, S Poria.
arXiv:2306.04757, 2023. [ArXiv] [Github] [DataSets]

Instruction-following evaluation for large language models.
J Zhou, T Lu, S Mishra, S Brahma, S Basu, Y Luan, D Zhou, L Hou.
arXiv:2311.07911, 2023. [ArXiv] [Github] [DataSets]

Benchmarking complex instruction-following with multiple constraints composition.
B Wen, P Ke, X Gu, L Wu, H Huang, J Zhou, W Li, B Hu, W Gao, J Xu, Y Liu, J Tang, H Wang, et al.
arXiv:2407.03978, 2024. [ArXiv] [Github]

Cfbench: A comprehensive constraints-following benchmark for llms.
T Zhang, Y Shen, W Luo, Y Zhang, H Liang, F Yang, M Lin, Y Qiao, W Chen, B Cui, W Zhang, et al.
arXiv:2408.01122, 2024. [ArXiv] [Github]

Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models.
H Sun, L Liu, J Li, F Wang, B Dong, R Lin, R Huang.
arXiv:2404.02823, 2024. [ArXiv] [Github]

Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation.
R Hida, J Ohmura, T Sekiya.
arXiv:2406.16356, 2024. [ArXiv]

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models.
Q He, J Zeng, Q He, J Liang, Y Xiao.
arXiv:2404.15846, 2024. [ArXiv] [Github]

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models.
Q He, J Zeng, Q He, J Liang, Y Xiao.
arXiv:2404.15846, 2024. [ArXiv] [Github]

InFoBench: Evaluating Instruction Following Ability in Large Language Models.
Y Qin, K Song, Y Hu, W Yao, S Cho, X Wang, X Wu, F Liu, P Liu, D Yu.
arXiv:2401.03601, 2024. [ArXiv] [Github]

INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models.
H Oh, H Lee, S Ye, H Shin, H Jang, C Jun, M Seo.
arXiv:2402.14334, 2024. [ArXiv] [Github]

SysBench: Can Large Language Models Follow System Messages?
Y Qin, T Zhang, Y Shen, W Luo, H Sun, Y Zhang, Y Qiao, W Chen, Z Zhou, W Zhang, B Cui.
arXiv:2408.10943, 2024. [ArXiv] [Github]

📖Reliable

Date Task Title Paper HomePage Github DataSets
2022 Hallucination Truthfulqa: Measuring how models mimic human falsehoods. [ArXiv] - [Github] -
2023 Hallucination Autohall: Automated hallucination dataset generation for large language models. [ArXiv] - - -
2023 Hallucination Evaluating hallucinations in chinese large language models. [ArXiv] - [Github] -
2023 Hallucination HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. [ArXiv] - - [DataSets]
2023 Hallucination Halo: Estimation and reduction of hallucinations in open-source weak large language models. [ArXiv] - [Github] -
2023 Hallucination Halueval: A large-scale hallucination evaluation benchmark for large language models. [ArXiv] - [Github] -
2023 Hallucination Med-halt: Medical domain hallucination test for large language models. [ArXiv] - [Github] -
2023 Hallucination Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation. [ArXiv] - [Github] -
2024 Hallucination DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models. [ArXiv] - [Github] -
2024 Hallucination Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models. [ArXiv] - - -
2024 Hallucination HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation. [ArXiv] - [Github] -
2024 Hallucination HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild. [ArXiv] - [Github] -
2024 Factuality [Simple-evals ] Measuring short-form factuality in large language models. [Paper] - [Github] -

📖Robust

RobustQA: Benchmarking the robustness of domain adaptation for open-domain question answering.
R Han, P Qi, Y Zhang, L Liu, J Burger, WY Wang, Z Huang, B **ang, D Roth.
Findings of the Association for Computational Linguistics: ACL 2023, 2023. [ArXiv] [Github]

Are Large Language Models Really Robust to Word-Level Perturbations?
H Wang, G Ma, C Yu, N Gui, L Zhang, Z Huang, S Ma, Y Chang, S Zhang, L Shen, X Wang, et al.
arxiv:2309.11166, 2023. [ArXiv] [Github]

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility.
W Ye, M Ou, T Li, X Ma, Y Yanggong, S Wu, J Fu, G Chen, H Wang, J Zhao.
arxiv:2305.10235, 2023. [ArXiv] [Github]

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection.
Zekun Li, et al.
arXiv:2308.10819v2, 2023. [ArXiv]

Intuitive or Dependent Investigating LLMs' Robustness to Conflicting Prompts.
J Ying, Y Cao, K **ong, Y He, L Cui, Y Liu.
arxiv:2309.17415, 2023. [ArXiv]

Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.
K Zhu, J Wang, J Zhou, Z Wang, H Chen, Y Wang, L Yang, W Ye, Y Zhang, NZ Gong, X **e.
ArXiv, 2023. [ArXiv] [Github]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting.
M Sclar, Y Choi, Y Tsvetkov, A Suhr.
arxiv:2310.11324, 2023. [ArXiv] [Github]

Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models.
Y Liu, T Cong, Z Zhao, M Backes, Y Shen, Y Zhang.
arxiv:2308.07847, 2023. [ArXiv]

Robut: A systematic study of table qa robustness against human-annotated adversarial perturbations.
Y Zhao, C Zhao, L Nan, Z Qi, W Zhang, X Tang, B Mi, D Radev.
arxiv:2306.14321, 2023. [ArXiv] [Github]

Revisit input perturbation problems for llms: A unified robustness evaluation framework for noisy slot filling task.
G Dong, J Zhao, T Hui, D Guo, W Wang, B Feng, Y Qiu, Z Gongque, K He, Z Wang, W Xu.
CCF International Conference on Natural Language Processing and Chinese Computing, 2023. [ArXiv] [Github]

📖Application

📖A-Comprehensive

GAIA: a benchmark for General AI Assistants.
G Mialon, C Fourrier, C Swift, T Wolf, Y LeCun, T Scialom.
ArXiv, 2023. [ArXiv] [Datasets]

An empirical study on large language models in accuracy and robustness under chinese industrial scenarios.
Z Li, W Qiu, P Ma, Y Li, Y Li, S He, B Jiang, S Wang, W Gu.
arxiv:2402.01723, 2024. [ArXiv]

What is the best model? Application-driven Evaluation for Large Language Models.
S Lian, K Zhao, X Liu, X Lei, B Yang, W Zhang, K Wang, Z Liu.
arxiv:2406.10307, 2024. [ArXiv] [Datasets]

📖Chatbot

Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems.
SE Finch, JD Finch, JD Choi.
arxiv:2212.09180, 2022. [ArXiv] [Github]

Benchmarking LLM powered chatbots: methods and metrics.
D Banerjee, P Singh, A Avadhanam, S Srivastava.
arXiv:2308.04624, 2023. [ArXiv]

Benchmarking, ethical alignment, and evaluation framework for conversational AI: Advancing responsible development of ChatGPT.
PP Ray.
BenchCouncil Transactions on Benchmarks, Standards, 2023. [Paper]

BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues.
H Duan, J Wei, C Wang, H Liu, Y Fang, S Zhang, D Lin, K Chen.
ArXiv, 2023. [ArXiv] [Github]

DialogBench: Evaluating LLMs as Human-like Dialogue Systems.
J Ou, J Lu, C Liu, Y Tang, F Zhang, D Zhang, Z Wang, K Gai.
ArXiv, 2023. [ArXiv]

Lmsys-chat-1m: A large-scale real-world llm conversation dataset.
Lianmin Zheng, et al.
ArXiv, 2023. [ArXiv] [DataSets]

ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark.
H Wakaki, Y Mitsufuji, Y Maeda, Y Nishimura, S Gao, M Zhao, K Yamada, A Bosselut.
arXiv:2406.11228, 2024. [ArXiv]

DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents.
J Kim, W Chay, H Hwang, D Kyung, H Chung, E Cho, Y Jo, E Choi.
arXiv:2406.13144, 2024. [ArXiv] [Github]

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words.
J Ao, Y Wang, X Tian, D Chen, J Zhang, L Lu, Y Wang, H Li, Z Wu.
arXiv:2406.13340, 2024. [ArXiv] [Github]

📖Knowledge-Assistant

Docmath-eval: Evaluating numerical reasoning capabilities of llms in understanding long documents with tabular data.
Y Zhao, Y Long, H Liu, L Nan, L Chen, R Kamoi, Y Liu, X Tang, R Zhang, A Cohan.
arXiv:2311.09805, 2023. [ArXiv] [Github]

Evaluating LLMs on document-based QA: Exact answer selection and numerical extraction using CogTale dataset.
Z Rasool, S Kurniawan, S Balugo, S Barnett, et al.
Natural Language Processing Journal, 2024. [Paper]

Kitab: Evaluating llms on constraint satisfaction for information retrieval.
MI Abdin, S Gunasekar, V Chandrasekaran, J Li, M Yuksekgonul, RG Peshawaria, R Naik, et al.
arXiv:2310.15511, 2023. [ArXiv] [Huggingface]

📖RAG

Ragas: Automated evaluation of retrieval augmented generation.
S Es, J James, L Espinosa-Anke, S Schockaert.
arXiv:2309.15217, 2023. [Paper] [Github]

Benchmarking Large Language Models in Retrieval-Augmented Generation.
J Chen, H Lin, X Han, L Sun.
AAAI, 2024. [Paper] [Github]

Ares: An automated evaluation framework for retrieval-augmented generation systems.
J Saad-Falcon, O Khattab, C Potts, M Zaharia.
arXiv:2311.09476, 2023. [Paper] [Github]

CRAG--Comprehensive RAG Benchmark.
X Yang, K Sun, H Xin, Y Sun, N Bhalla, X Chen.
arXiv, 2024. [Paper] [Github]

Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.
Y Lyu, Z Li, S Niu, F Xiong, B Tang, W Wang, H Wu, H Liu, T Xu, E Chen.
arXiv:2401.17043, 2024. [Paper]

📖Data-Analysis

Chartqa: A benchmark for question answering about charts with visual and logical reasoning.
A Masry, DX Long, JQ Tan, S Joty, E Hoque.
arXiv:2203.10244, 2022. [ArXiv] [Github]

QTSumm: Query-focused summarization over tabular data.
Y Zhao, Z Qi, L Nan, B Mi, Y Liu, W Zou, S Han, R Chen, X Tang, Y Xu, D Radev, A Cohan.
arXiv:2305.14303, 2023. [ArXiv] [Github]

TableQAKit: A Comprehensive and Practical Toolkit for Table-based Question Answering.
F Lei, T Luo, P Yang, W Liu, H Liu, J Lei, Y Huang, Y Wei, S He, J Zhao, K Liu.
arXiv:2310.15075, 2023. [ArXiv] [Github]

Datatales: Investigating the use of large language models for authoring data-driven articles.
N Sultanum, A Srinivasan.
IEEE Visualization and Visual Analytics (VIS), 2023. [ArXiv] [Github]

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation.
Z Kasner, O Dušek.
ACL, 2024. [ACL] [Github]

Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data.
X Liu, Z Wu, X Wu, P Lu, KW Chang, Y Feng.
arxiv:2402.17644, 2024. [ArXiv] [Github]

BIBench: Benchmarking Data Analysis Knowledge of Large Language Models.
S Liu, S Zhao, C Jia, X Zhuang, Z Long, M Lan.
arXiv:2401.02982, 2024. [ArXiv] [Github]

Chartbench: A benchmark for complex visual reasoning in charts.
Z Xu, S Du, Y Qi, C Xu, C Yuan, J Guo.
arXiv:2312.15915, 2023. [ArXiv] [Github]

Infiagent-dabench: Evaluating agents on data analysis tasks.
X Hu, Z Zhao, S Wei, Z Chai, G Wang, X Wang, J Su, J Xu, M Zhu, Y Cheng, J Yuan, et al.
arxiv:2401.05507, 2024. [ArXiv]

Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents.
J Li, N Huo, Y Gao, J Shi, Y Zhao, G Qu, Y Wu, C Ma, JG Lou, R Cheng.
ArXiv, 2024. [ArXiv] [Github]

Viseval: A benchmark for data visualization in the era of large language models.
N Chen, Y Zhang, J Xu, K Ren, Y Yang.
IEEE Transactions on Visualization and Computer Graphics, 2024. [ArXiv]

Table meets llm: Can large language models understand structured table data? a benchmark and empirical study.
Y Sui, M Zhou, M Zhou, S Han, D Zhang.
WSDM, 2024. [ArXiv]

📖Code-Assistant

Date Task Title Paper HomePage Github DataSets
2021 Software [Codexglue] Codexglue: A machine learning benchmark dataset for code understanding and generation. [ArXiv] - [Github] -
2021 Software [HumanEval] Evaluating large language models trained on code. [ArXiv] - [Github] -
2021 Software [APPS] Measuring coding challenge competence with apps. [ArXiv] - [Github] -
2021 Software [MBPP] Program synthesis with large language models. [ArXiv] - [Github] -
2021 Software [ClassEval] Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. [ArXiv] - [Github] -
2023 Software [Codescope] Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation. [ArXiv] - [Github] -
2023 Software [StudentEval] StudentEval: a benchmark of student-written prompts for large language models of code. [ArXiv] - - -
2023 Software Testing LLMs on Code Generation with Varying Levels of Prompt Specificity. [ArXiv] - [Github] -
2023 Software Text-to-sql empowered by large language models: A benchmark evaluation. [ArXiv] - [Github] -
2024 Software Competition-Level Problems are Effective LLM Evaluators. [ACL] [Homepage] - -
2024 Software Benchmarking the text-to-sql capability of large language models: A comprehensive evaluation. [ArXiv] - - -
2024 Software Livecodebench: Holistic and contamination free evaluation of large language models for code. [ArXiv] - [Github] -
2024 Software Codereval: A benchmark of pragmatic code generation with generative pre-trained models. [ICSE] - - -

📖Office-Assistant

Pptc benchmark: Evaluating large language models for powerpoint task completion.
Y Guo, Z Zhang, Y Liang, D Zhao, D Nan.
ArXiv, 2023. [ArXiv] [Github]

PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion.
Z Zhang, Y Guo, Y Liang, D Zhao, N Duan.
ArXiv, 2024. [ArXiv] [Github]

📖Content-Generation

KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions.
F Xu, K Lo, L Soldaini, B Kuehl, E Choi, D Wadden.
ArXiv, 2024. [ArXiv] [Homepage]

📖TaskPlanning

Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change).
K Valmeekam, A Olmo, S Sreedharan, S Kambhampati.
arXiv:2206.10498, 2022.

On the planning abilities of large language models (a critical investigation with a proposed benchmark).
K Valmeekam, S Sreedharan, M Marquez, A Olmo, S Kambhampati.
arXiv:2302.06706, 2023.

On the Planning Abilities of Large Language Models--A Critical Investigation.
K Valmeekam, M Marquez, S Sreedharan, S Kambhampati.
Thirty-seventh Conference on Neural Information Processing Systems, 2023.

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change.
K Valmeekam, M Marquez, A Olmo, S Sreedharan, S Kambhampati.
Thirty-seventh Conference on Neural Information Processing Systems Datasets, 2023.

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench.
K Valmeekam, K Stechly, S Kambhampati.
arXiv:2409.13373.

📖Agent

Agentsims: An open-source sandbox for large language model evaluation.
J Lin, H Zhao, A Zhang, Y Wu, H Ping, Q Chen.
arXiv:2308.04026, 2023. [ArXiv] [Homepage]

Bolaa: Benchmarking and orchestrating llm-augmented autonomous agentsn.
Z Liu, W Yao, J Zhang, L Xue, S Heinecke, R Murthy, Y Feng, Z Chen, JC Niebles, D Arpit, et al.
arXiv:2308.05960, 2023. [ArXiv] [Homepage]

Smartplay: A benchmark for llms as intelligent agents.
Y Wu, X Tang, TM Mitchell, Y Li.
arXiv:2310.01557, 2023. [ArXiv] [Homepage]

Agentbench: Evaluating llms as agents.
X Liu, H Yu, H Zhang, Y Xu, X Lei, H Lai, Y Gu, H Ding, K Men, K Yang, S Zhang, X Deng, et al.
ICLR, 2024. [ICLR] [Homepage]

Webarena: A realistic web environment for building autonomous agents.
S Zhou, FF Xu, H Zhu, X Zhou, R Lo, A Sridhar, et al.
arXiv, 2023. [ArXiv] [Homepage]

📖EmbodiedAI

Artificial-General-Intelligence-Testing-Resources.
Resources for AGI & Embodied AI (EAI) Testing.
[Github]

📖Security

📖S-Comprehensive

Fft: Towards harmlessness evaluation and analysis for llms with factuality, fairness, toxicity.
S Cui, Z Zhang, Y Chen, W Zhang, T Liu, S Wang, T Liu.
arXiv:2311.18580, 2023. [ArXiv] [Github]

Safety assessment of chinese large language models.
H Sun, Z Zhang, J Deng, J Cheng, M Huang.
arXiv:2304.10436, 2023. [ArXiv] [Github]

Safetybench: Evaluating the safety of large language models with multiple choice questions.
Z Zhang, L Lei, L Wu, R Sun, Y Huang, C Long, et al.
arXiv, 2023. [ArXiv] [Github]

Sc-safety: A multi-round open-ended question adversarial safety benchmark for large language models in chinese.
L Xu, K Zhao, L Zhu, H Xue.
arXiv:2310.05818, 2023. [ArXiv] [Github]

Trustgpt: A benchmark for trustworthy and responsible large language models.
Y Huang, Q Zhang, L Sun.
arXiv:2306.11507, 2023. [ArXiv] [Github]

Trustworthy llms: a survey and guideline for evaluating large language models' alignment.
Y Liu, Y Yao, JF Ton, X Zhang, R Guo, H Cheng, Y Klochkov, MF Taufiq, H L.
arXiv:2308.05374, 2023. [ArXiv] [Github]

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.
B Wang, W Chen, H Pei, C Xie, M Kang, C Zhang, C Xu, Z Xiong, R Dutta, R Schaeffer, et al.
NeurIPS, 2023. [ArXiv] [Github]

Towards ai safety: A taxonomy for ai system evaluation.
B Xia, Q Lu, L Zhu, Z Xing.
arXiv:2404.05388, 2024. [ArXiv]

📖Content-Security

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.
T Hartvigsen, S Gabriel, H Palangi, M Sap, D Ray, E Kamar.
arXiv:2203.09509, 2022. [ArXiv] [Github]

A chinese prompt attack dataset for llms with evil content.
C Liu, F Zhao, L Qing, Y Kang, C Sun, K Kuang, F Wu.
arXiv:2309.11830, 2023. [ArXiv] [Github]

Control risk for potential misuse of artificial intelligence in science.
J He, W Feng, Y Min, J Yi, K Tang, S Li, J Zhang, K Chen, W Zhou, X Xie, W Zhang, N Yu, et al.
arXiv:2312.06632, 2023. [ArXiv] [Github]

Do-not-answer: A dataset for evaluating safeguards in llms.
Y Wang, H Li, X Han, P Nakov, T Baldwin.
arXiv:2308.13387, 2023. [ArXiv] [Github]

Examining user-friendly and open-sourced large gpt models: A survey on language, multimodal, and scientific gpt models.
K Gao, S He, Z He, J Lin, QZ Pei, J Shao, W Zhang.
arXiv:2308.14149, 2023. [ArXiv] [Github]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models.
P Röttger, HR Kirk, B Vidgen, G Attanasio, F Bianchi, D Hovy.
arXiv:2308.01263, 2023. [ArXiv] [Github]

JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models.
M Zhang, X Pan, M Yang.
ArXiv, 2023. [ArXiv] [Github]

CARE-MI: chinese benchmark for misinformation evaluation in maternity and infant care.
T Xiang, L Li, W Li, M Bai, L Wei, B Wang, N Garcia.
Advances in Neural Information Processing Systems, 2023. [ArXiv] [Github]

CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain.
X Tong, B Jin, Z Lin, B Wang, T Yu.
arXiv:2402.07234, 2024. [ArXiv]

📖Dialogue

A benchmark for understanding dialogue safety in mental health support.
H Qiu, T Zhao, A Li, S Zhang, H He, Z Lan.
CCF International Conference on Natural Language Processing and Chinese, 2023. [ArXiv] [Github]

Cosafe: Evaluating large language model safety in multi-turn dialogue coreference.
E Yu, J Li, M Liao, S Wang, Z Gao, F Mi, L Hongn.
arXiv:2406.17626, 2024. [ArXiv] [Github]

Jailbreak

Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models.
H Qiu, S Zhang, A Li, H He, Z Lan.
arXiv:2307.08487, 2023. [ArXiv] [Github]

Multilingual jailbreak challenges in large language models.
Y Deng, W Zhang, SJ Pan, L Bing.
arXiv:2310.06474, 2023. [ArXiv] [Github]

Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity.
TY Zhuo, Y Huang, C Chen, Z Xing.
arXiv:2301.12867, 2023. [ArXiv]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models.
P Chao, E Debenedetti, A Robey, M Andriushchenko, F Croce, V Sehwag, E Dobriban, et al.
arXiv:2404.01318, 2024. [ArXiv] [Github]

📖Value-Aligement

Value

Cvalues: Measuring the values of chinese large language models from safety to responsibility.
G Xu, J Liu, M Yan, H Xu, J Si, Z Zhou, P Yi, X Gao, J Sang, R Zhang, J Zhang, C Peng, et al.
ArXiv, 2023. [ArXiv] [Github]

Flames: Benchmarking value alignment of chinese large language models.
K Huang, X Liu, Q Guo, T Sun, J Sun, Y Wang, Z Zhou, Y Wang, Y Teng, X Qiu, Y Wang, et al.
arXiv:2311.06899, 2023. [ArXiv] [Github]

CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models.
L Yu, Y Leng, Y Huang, S Wu, H Liu, X Ji, et al.
ArXiv, 2024. [ArXiv] [Github]

Localvaluebench: A collaboratively built and extensible benchmark for evaluating localized value alignment and ethical safety in large language models.
GI Meadows, NWL Lau, EA Susanto, CL Yu, et al.
ArXiv, 2024. [ArXiv]

Fairness

CrowS-pairs: A challenge dataset for measuring social biases in masked language models.
N Nangia, C Vania, R Bhalerao, SR Bowman.
arXiv, 2020. [ArXiv] [Github]

Bold: Dataset and metrics for measuring biases in open-ended language generation.
J Dhamala, T Sun, V Kumar, S Krishna, Y Pruksachatkun, KW Chang, R Gupta.
FAccT, 2021. [ArXiv] [Github]

BBQ: A hand-built bias benchmark for question answering.
A Parrish, A Chen, N Nangia, V Padmakumar, J Phang, J Thompson, PM Htut, SR Bowman.
ACL, 2022. [ArXiv] [Github]

CBBQ: A chinese bias benchmark dataset curated with human-ai collaboration for large language models.
Y Huang, D Xiong.
arXiv:2306.16244, 2023. [ArXiv] [Github]

Evaluating and mitigating discrimination in language model decisions.
A Tamkin, A Askell, L Lovitt, E Durmus, N Joseph, S Kravec, K Nguyen, J Kaplan, D Ganguli.
arXiv:2312.03689, 2023. [ArXiv] [Github]

Winoqueer: A community-in-the-loop benchmark for anti-lgbtq+ bias in large language models.
VK Felkner, HCH Chang, E Jang, J May.
arXiv:2306.15087, 2023. [ArXiv] [Github]

A comparative analysis to evaluate bias and fairness across large language models with benchmarks.
MY Chan, SM Wong.
arXiv, 2024. [ArXiv]

📖Model-Security

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents.
T Yuan, Z He, L Dong, Y Wang, R Zhao, T **a, L Xu, B Zhou, F Li, Z Zhang, R Wang, G Liu.
ArXiv, 2024. [ArXiv] [Github]

I Think, Therefore I am: Awareness in Large Language Models.
Y Li, Y Huang, Y Lin, S Wu, Y Wan, L Sun.
ArXiv, 2024. [ArXiv] [Github]

📖Privacy-Security

Can llms keep a secret? testing privacy implications of language models via contextual integrity theory.
N Mireshghallah, H Kim, X Zhou, Y Tsvetkov, M Sap, R Shokri, Y Choi.
ArXiv, 2023. [ArXiv] [Github]

Llm-pbe: Assessing data privacy in large language models.
Q Li, J Hong, C Xie, J Tan, R Xin, J Hou, X Yin, et al.
ArXiv, 2024. [ArXiv] [Github]

📖Industry

📖Finance

BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark.
Dakuan Lu, Hengkui Wu, Jiaqing Liang, Yipei Xu, Qianyu He, Yipeng Geng, Mengkun Han, Yingsi Xin, Yanghua Xiao.
ArXiv, 2023. [ArXiv] [Github]

CFBenchmark: Chinese financial assistant benchmark for large language model.
Y Lei, J Li, M Jiang, J Hu, D Cheng, Z Ding, C Jiang.
arXiv:2311.05812, 2023. [ArXiv] [Github]

FinanceBench: A New Benchmark for Financial Question Answering.
Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, Bertie Vidgen.
ArXiv, 2023. [ArXiv] [Github]

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models.
Liwen Zhang, Weige Cai, Zhaowei Liu, Zhi Yang, Wei Dai, Yujie Liao, Qianru Qin, Yifei Li, Xingyu Liu, Zhiqiang Liu, Zhoufan Zhu, Anbo Wu, Xin Guo, Yun Chen.
ArXiv, 2023. [ArXiv] [Github] [Datasets]

FinGPT: Open-Source Financial Large Language Models.
Hongyang Yang, Xiao-Yang Liu, Christina Dan Wang.
ArXiv, 2023. [ArXiv] [Github] [Datasets]

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance.
Q Xie, W Han, X Zhang, Y Lai, M Peng, A Lopez-Lira, J Huang.
ArXiv, 2023. [ArXiv] [Github] [Datasets]

WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain.
Raj Sanjay Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, Wendi Du, Sudheer Chava, Natraj Raman, Charese Smiley, Jiaao Chen, Diyi Yang.
ArXiv, 2022. [ArXiv] [Github] [Datasets]

📖Medical

PubMedQA: A Dataset for Biomedical Research Question Answering.
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, Xinghua Lu.
EMMNLP, 2019. [ArXiv] [Github]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, Peter Szolovitsu.
AS, 2021. [ArXiv] [Github] [Datasets]

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering.
Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu.
PMLR, 2022. [ArXiv] [Github] [Datasets]

Benchmarking Large Language Models on CMExam - A comprehensive Chinese Medical Exam Dataset.
Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, LEI ZHU, Michael Lingzhi Li.
Nips, 2023. [Paper] [Github]

ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist Examination.
Dongfang Li, Jindi Yu, Baotian Hu, Zhenran Xu, Min Zhang.
EMNLP, 2023. [ArXiv] [Github]

CMB: A Comprehensive Medical Benchmark in Chinese.
Xidong Wang, Guiming Hardy Chen, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, Haizhou Li.
ArXiv, 2023. [ArXiv] [Datasets]

MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine.
Jie Xu, Lu Lu, Sen Yang, Bilin Liang, Xinwei Peng, Jiali Pang, Jinru Ding, Xiaoming Shi, Lingrui Yang, Huan Song, Kang Li, Xin Sun, Shaoting Zhang.
ArXiv, 2023. [ArXiv]

PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain.
Wei Zhu, Xiaoling Wang, Huanran Zheng, Mosha Chen, Buzhou Tang.
ArXiv, 2023. [ArXiv] [Datasets]

Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench.
Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu.
ArXiv, 2023. [ArXiv] [Github]

Large Language Models Encode Clinical Knowledge.
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, et al.
Natrue, 2023. [HomePage]

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models.
Cai, Y., Wang, L., Wang, Y., de Melo, G., Zhang, Y., Wang, Y., & He, L.
AAAI, 2024. [Paper] [Github]

Evaluation of ChatGPT-generated medical responses: a systematic review and meta-analysis.
Q Wei, Z Yao, Y Cui, B Wei, Z Jin, X Xu.
Journal of Biomedical Informatics, 2024. [Paper]

Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI.
M Abbasian, E Khatibi, I Azimi, D Oniani, Z Shakeri Hossein Abad, A Thieme, R Sriram, et al.
NPJ Digital Medicine, 2024. [Paper]

📖Law

JEC-QA: A Legal-Domain Question Answering Dataset.
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, Maosong Sun.
ArXiv, 2023. [Paper]

CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review.
Dan Hendrycks, Collin Burns, Anya Chen, Spencer Ball.
ArXiv, 2021. [ArXiv] [Github] [Datasets]

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning.
Neel Guha, Daniel E. Ho, Julian Nyarko, Christopher Ré.
ArXiv, 2022. [ArXiv] [Github]

LAiW: A Chinese Legal Large Language Models Benchmark A Technical Report.
Yongfu Dai, Duanyu Feng, Jimin Huang, Haochen Jia, Qianqian Xie, Yifang Zhang, Weiguang Han, Wei Tian, Hao Wang.
ArXiv, 2023. [ArXiv] [Github]

LawBench: Benchmarking Legal Knowledge of Large Language Models.
Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, Jidong Ge.
ArXiv, 2023. [ArXiv] [Github]\

司法大语言模型评估框架路线分析.
李海涛,艾清遥,吴玥悦,刘奕群.
CAAI, 2023.

法律大模型评估指标和测评方法.
许建峰,刘程远,况琨,何浩,孙常龙,李宝善,魏斌,杨力,金耀辉,吴飞.
中国人工智能学会, 2024. [Paper]

📖Engineering

Date Task Title Paper HomePage Github DataSets
2023 Software Empower large language model to perform better on industrial domain-specific question answering. [ArXiv] - [Github] -
2023 Software Exploring the effectiveness of llms in automated logging generation: An empirical study. [ArXiv] - [Github] -
2023 Software OpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models. [ArXiv] - [Github] -
2024 Software CloudEval-YAML A Practical Benchmark for Cloud Native YAML Configuration Generation. [MLSys] - [Github] -
2024 Software CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery. [ArXiv] - [Github] -

📖Education

Curriculum-Driven Edubot: A Framework for Developing Language Learning Chatbots Through Synthesizing Conversational Data.
Y Li, S Qu, J Shen, S Min, Z Yu.
arXiv:2309.16804, 2023. [Paper]

CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation.
W You, P Wang, C Li, Z Ji, J Bai.
Proceedings of the AAAI Conference on Artificial Intelligence, 2024. [Paper] [Github]

Adapting large language models for education: Foundational capabilities, potentials, and challenges.
Q Li, L Fu, W Zhang, X Chen, J Yu, W Xia, W Zhang, R Tang, Y Yu.
arXiv:2401.08664, 2023. [Paper]

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models.
J Hou, C Ao, H Wu, X Kong, Z Zheng, D Tang, C Li, X Hu, R Xu, S Ni, M Yang.
arXiv:2401.15927, 2024. [ArXiv] [Github]

Large language models for education: A survey and outlook.
S Wang, T Xu, H Li, C Zhang, J Liang, J Tang, PS Yu, Q Wen.
arXiv:2403.18105, 2024. [ArXiv] [Github]

📖Research

Date Task Title Paper HomePage Github DataSets
2023 Comprehensive Benchmarking large language models as ai research agents. [ArXiv] - [Github] -
2023 Comprehensive GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of ChatGPT in Science. [ArXiv] - - -
2023 Comprehensive LLMs for science: Usage for code generation and data analysis. [JSEP] - - -
2023 Comprehensive MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. [ArXiv] - [Github] -
2023 Comprehensive Scibench: Evaluating college-level scientific problem-solving abilities of large language models. [ArXiv] - [Github] -
2023 Comprehensive Scieval: A multi-level large language model evaluation benchmark for scientific research. [AAAI] - [Github] -
2023 Comprehensive The sciqa scientific question answering benchmark for scholarly knowledge. [SR] - [Github] [DataSets]
2024 Biomedical Bioinfo-Bench: A Simple Benchmark Framework for LLM Bioinformatics Skills Evaluation. [bioRxiv] - [Github] -
2023 Chemistry [ChemLLMBench] Do large language models understand chemistry? a conversation with chatgpt. [JCIM] [Github] -
2024 Chemistry [ChemLLMBench] What can large language models do in chemistry? a comprehensive benchmark on eight tasks. [NeurIPS] [Github] -
2024 Geoscience [GeoBench] K2: A foundation language model for geoscience knowledge understanding and utilization. [WSDM] - [Github] -
2023 Materials [MaScQA] MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models. [ArXiv] - [Github] [-

📖Goverment-Affairs

To be refreshed...

📖Communication

TeleQnA: A Benchmark Dataset to Assess Large Language Models Telecommunications Knowledge.
Ali Maatouk, Fadhel Ayed, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, Zhi-Quan Luo.
ArXiv, 2023. [ArXiv] [Github]

An Empirical Study of NetOps Capability of Pre-Trained Large Language Models.
Yukai Miao, Yu Bai, Li Chen, Dan Li, Haifeng Sun, Xizheng Wang, Ziqiu Luo, Yanyu Ren, Dapeng Sun, Xiuting Xu, Qi Zhang, Chao Xiang, Xinchi Li.
ArXiv, 2023. [ArXiv] [Datasets]

NetConfEval: Can LLMs Facilitate Network Configuration?
C Wang, M Scazzariello, A Farshin, S Ferlin, D Kostić, M Chiesa.
Proceedings of the ACM on Networking, 2024. [ArXiv]

📖Power

NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain.
A Acharya, S Munikoti, A Hellinger, S Smith, S Wagle, S Horawalavithana.
ArXiv, 2023. [ArXiv] [Github]

📖Transportation

Open-transmind: A new baseline and benchmark for 1st foundation model challenge of intelligent transportation.
Y Shi, F Lv, X Wang, C **a, S Li, S Yang, T **, G Zhang.
CVPR, 2023. [Paper] [Github]

📖Industry(工业)

工业大模型:体系架构、关键技术与典型应用.
任磊, 王海腾, 董家宝等.
中国科学: 信息科学,2024.(在审)

📖Media

Evaluating the Effectiveness of GPT Large Language Model for News Classification in the IPTC News Ontology.
B Fatemi, F Rabbi, AL Opdahl.
ArXiv, 2023. [Paper]

📖Design

How Good is ChatGPT in Giving Advice on Your Visualization Design.
NW Kim, G Myers, B Bach.
arXiv:2310.09617, 2023. [Paper]

📖Internet

Llmrec: Benchmarking large language models on recommendation task.
J Liu, C Liu, P Zhou, Q Ye, D Chong, K Zhou, Y Xie, Y Cao, S Wang, C You, PS Yu.
arXiv:2308.12241, 2023. [Paper]

📖Game

Gameeval: Evaluating llms on conversational games.
D Qiao, C Wu, Y Liang, J Li, N Duan.
arXiv:2308.10032, 2023. [Paper] [Github]

AvalonBench: Evaluating LLMs Playing the Game of Avalon.
J Light, M Cai, S Shen, Z Hu.
NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. [Paper] [Github]

📖Robot

Artificial-General-Intelligence-Testing-Resources.
Resources for AGI & Embodied AI (EAI) Testing.
[Github]

📖Human-Machine-Interaction

📖User-Experience

A User-Centric Benchmark for Evaluating Large Language Models.
J Wang, F Mo, W Ma, P Sun, M Zhang, et al.
ArXiv, 2024. [ArXiv] [Github]

Understanding User Experience in Large Language Model Interactions.
J Wang, W Ma, P Sun, M Zhang, JY Nie.
ArXiv, 2024. [ArXiv]

📖Social-Intelligence

Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models.
Y He, Y Wu, Y Jia, R Mihalcea, Y Chen, N Deng.
arXiv:2310.16755, 2023. [ArXiv] [Github]

Sotopia: Interactive evaluation for social intelligence in language agents.
X Zhou, H Zhu, L Mathur, R Zhang, H Yu, Z Qi, LP Morency, Y Bisk, D Fried, G Neubig, et al.
arXiv:2310.11667, 2023. [ArXiv] [Homepage]

Academically intelligent LLMs are not necessarily socially intelligent.
R Xu, H Lin, X Han, L Sun, Y Sun.
arXiv:2403.06591, 2024. [ArXiv] [Homepage]

InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context.
Z Liu, A Anand, P Zhou, J Huang, J Zhao.
arXiv:2403.06591, 2024. [ArXiv]

Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities.
J Wang, C Zhang, J Li, Y Ma, L Niu, J Han, Y Peng, Y Zhu, L Fan.
arXiv:2405.11841, 2024. [ArXiv] [Github]

ToMBench: Benchmarking Theory of Mind in Large Language Models.
Z Chen, J Wu, J Zhou, B Wen, G Bi, G Jiang, Y Cao, M Hu, Y Lai, Z Xiong, M Huang.
arXiv:2402.15052, 2024. [ArXiv] [Github]

Testing theory of mind in large language models and humans.
JWA Strachan, D Albergo, G Borghini, O Pansardi, E Scaliti, S Gupta, K Saxena, A Rufo, et al.
Nature Human Behaviour, 2024. [ArXiv]

📖Emotional-Intelligence

Emotionally numb or empathetic? evaluating how llms feel using emotionbench.
J Huang, MH Lam, EJ Li, S Ren, W Wang.
arXiv, 2023. [ArXiv] [Github]

Can Generative Agents Predict Emotion?
C Regan, N Iwahashi, S Tanaka, M Oka.
arXiv:2402.04232, 2024. [ArXiv] [Github]

EmoBench: Evaluating the Emotional Intelligence of Large Language Models.
S Sabour, S Liu, Z Zhang, JM Liu, J Zhou, AS Sunaryo, J Li, T Lee, R Mihalcea, M Huang.
arXiv:2402.12071, 2024. [ArXiv] [Github]

GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models.
L Wang, Y Jin, T Shen, T Zheng, X Du, C Zhang, W Huang, J Liu, S Wang, G Zhang, L Xiang, et al.
arXiv:2406.14903, 2024. [ArXiv] [Github]

📖Performance-Cost

📖Model-Compression

A Comprehensive Evaluation of Quantization Strategies for Large Language Models.
M Zhang, X Pan, M Yang.
ACL, 2024. [ArXiv]

Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox.
Y Liu, Y Meng, F Wu, S Peng, H Yao, C Guan, C Tang, X Ma, Z Wang, W Zhu.
arxiv:2406.12928, 2024. [ArXiv]

📖Edge-Model

MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases.
R Murthy, L Yang, J Tan, TM Awalgaonkar, Y Zhou, S Heinecke, S Desai, J Wu, R Xu, S Tan, et al.
arxiv:2406.10290, 2024. [ArXiv]

📖Carbon-Emission

OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models.
Z Yu, Y Wu, Z Deng, Y Tang, XP Zhang.
arXiv:2405.12843, 2024. [ArXiv]

📖Testing-DataSets

📖Datasets-Quality

Multimodal-Data-Optimization-Resources.
Test DataSets Evluation
[Github]

📖Datasets-Generation

Multimodal-Data-Generation-Resources.
Test DataSets Generation
[Github]

📖Testing-Methods

📖NLG-Evaluation

Are large language model-based evaluators the solution to scaling up multilingual evaluation?
R Hada, V Gumma, A de Wynter, H Diddee, M Ahmed, M Choudhury, K Bali, S Sitaram.
arXiv:2309.07462, 2023. [ArXiv]

Automated evaluation of personalized text generation using large language models.
Y Wang, J Jiang, M Zhang, C Li, Y Liang, Q Mei, M Bendersky.
arXiv:2310.11593, 2023. [ArXiv]

Calibrating LLM-Based Evaluator.
Y Liu, T Yang, S Huang, Z Zhang, H Huang, et al.
arXiv, 2023. [ArXiv]

Can large language models be an alternative to human evaluations?
CH Chiang, H Lee.
arXiv:2305.01937, 2023. [ArXiv]

Chateval: Towards better llm-based evaluators through multi-agent debate.
CM Chan, W Chen, Y Su, J Yu, W Xue, S Zhang, J Fu, Z Liu.
arXiv:2308.07201, 2023. [ArXiv]

CRITIQUELLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation.
P Ke, B Wen, Z Feng, X Liu, X Lei, J Cheng, S Wang, A Zeng, Y Dong, H Wang, J Tang, and et al.
ArXiv, 2023. [ArXiv] [Github]

Generative judge for evaluating alignment.
J Li, S Sun, W Yuan, RZ Fan, H Zhao, P Liu.
arxiv:2310.05470, 2023. [ArXiv]

G-eval: Nlg evaluation using gpt-4 with better human alignment.
Y Liu, D Iter, Y Xu, S Wang, R Xu, C Zhu.
arxiv:2303.16634, 2023. [ArXiv]

G-eval: Nlg evaluation using gpt-4 with better human alignment.
Y Liu, D Iter, Y Xu, S Wang, R Xu, C Zhu.
arxiv:2303.16634, 2023. [ArXiv]

JudgeLM: Fine-tuned Large Language Models are Scalable Judges.
L Zhu, X Wang, X Wang.
ArXiv, 2023. [ArXiv] [Github]

Prd: Peer rank and discussion improve large language model based evaluations.
R Li, T Patel, X Du.
arxiv:2307.02762, 2023. [ArXiv] [Github]

Split and merge: Aligning position biases in large language model based evaluators.
Z Li, C Wang, P Ma, D Wu, S Wang, C Gao, et al.
arxiv, 2023. [ArXiv]

Wider and deeper llm networks are fairer llm evaluators.
X Zhang, B Yu, H Yu, Y Lv, T Liu, F Huang, H Xu, Y Li.
arxiv:2308.01862, 2023. [ArXiv]

Large Language Models are not Fair Evaluators.
P Wang, L Li, L Chen, Z Cai, D Zhu, B Lin, Y Cao, Q Liu, T Liu, Z Sui.
ACL, 2024. [ArXiv]

Aligning with human judgement: The role of pairwise preference in large language model evaluators.
Y Liu, H Zhou, Z Guo, E Shareghi, I Vulic, A Korhonen, N Collier.
arxiv:2403.16950, 2024. [ArXiv]

Agent-as-a-Judge: Evaluate Agents with Agents.
M Zhuge, C Zhao, D Ashley, W Wang, D Khizbullin, Y **ong, Z Liu, E Chang, et al.
arxiv:2410.10934, 2024. [ArXiv]

An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers.
H Huang, Y Qu, J Liu, M Yang, T Zhao.
arxiv:2403.02839, 2024. [ArXiv]

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution.
M Cao, A Lam, H Duan, H Liu, S Zhang, K Chen.
arXiv:2410.16256, 2024. [ArXiv] [Github]

Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework.
M Li, Z Liu, S Deng, S Joty, NF Chen, MY Kan.
arXiv:2405.15329, 2024. [ArXiv]

Length-controlled alpacaeval: A simple way to debias automatic evaluators.
Y Dubois, B Galambosi, P Liang, TB Hashimoto.
arxiv:2404.04475, 2024. [ArXiv]

Leveraging large language models for nlg evaluation: A survey.
Z Li, X Xu, T Shen, C Xu, JC Gu, C Tao.
arxiv:2401.07103, 2024. [ArXiv]

Peer-review-in-LLMs: Automatic Evaluation Method for LLMs in Open-environment.
KP Ning, S Yang, YY Liu, JY Yao, ZH Liu, Y Wang, M Pang, L Yuan.
arxiv:2402.01830, 2024. [ArXiv]

Pre: A peer review based large language model evaluator.
Z Chu, Q Ai, Y Tu, H Li, Y Liu.
arxiv:2401.15641, 2024. [ArXiv]

Prometheus 2: An open source language model specialized in evaluating other language models.
S Kim, J Suk, S Longpre, BY Lin, J Shin, S Welleck, G Neubig, M Lee, K Lee, M Seo.
arxiv:2405.01535, 2024. [ArXiv]

Self-taught evaluators.
T Wang, I Kulikov, O Golovneva, P Yu, W Yuan, et al.
arXiv, 2024. [ArXiv]

The Critique of Critique.
S Sun, et al.
arXiv:2401.04518v1, 2024.

Evaluating large language models at evaluating instruction following.
Z Zeng, J Yu, T Gao, Y Meng, T Goyal, D Chen.
ICLR, 2024. [ArXiv]

Flask: Fine-grained language model evaluation based on alignment skill sets.
S Ye, D Kim, S Kim, H Hwang, S Kim, Y Jo, J Thorne, J Kim, M Seo.
ICLR, 2024. [ArXiv]

Benchmarking foundation models with language-model-as-an-examiner.
Y Bai, J Ying, Y Cao, X Lv, Y He, X Wang, J Yu, K Zeng, Y **ao, H Lyu, J Zhang, J Li, L Hou.
Advances in Neural Information Processing Systems, 2024. [ArXiv]

📖Accurate-Testing

Efficiently measuring the cognitive ability of llms: An adaptive testing perspective.
Y Zhuang, Q Liu, Y Ning, W Huang, R Lv, Z Huang, G Zhao, Z Zhang, Q Mao, S Wang, et al.
arxiv:2306.10512, 2023. [ArXiv]

Large language model routing with benchmark datasets.
T Shnitzer, A Ou, M Silva, K Soule, Y Sun, et al.
arxiv, 2023. [ArXiv]

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models.
J Cheng, Y Lu, X Gu, P Ke, X Liu, Y Dong, H Wang, J Tang, M Huang.
arXiv:2406.16714, 2024. [ArXiv] [Github]

Efficient benchmarking (of language models).
Y Perlitz, E Bandel, A Gera, O Arviv, L Ein-Dor, E Shnarch, N Slonim, M Shmueli-Scheuer, et al.
arxiv:2308.11696, 2023. [ArXiv]

MixEval Deriving Wisdom of the Crowd from LLM Benchmark Mixtures.
J Ni, F Xue, X Yue, Y Deng, M Shah, K Jain, et al.
arXiv, 2024. [ArXiv] [Github]

tinyBenchmarks: evaluating LLMs with fewer examples.
FM Polo, L Weber, L Choshen, Y Sun, G Xu, et al.
arXiv, 2024. [ArXiv] [Github]

📖Dynamic-Testing

Dynabench Rethinking Benchmarking in NLP.
Douwe Kiela, et al.
arXiv, 2021.

Beyond static datasets: A deep interaction approach to llm evaluation.
J Li, R Li, Q Liu, et al.
arXiv, 2023. [ArXiv]

LLMEval: A Preliminary Study on How to Evaluate Large Language Models.
Y Zhang, M Zhang, H Yuan, S Liu, Y Shi, T Gui, Q Zhang, X Huang.
Proceedings of the AAAI Conference on Artificial Intelligence, 2024. [HomePage] [Paper] [Github]

Have Seen Me Before Automating Dataset Updates Towards Reliable and Timely Evaluation.
Jiahao Ying, et al.
arXiv:2402.11894v2, 2024.

Livebench: A challenging, contamination-free llm benchmark.
C White, S Dooley, M Roberts, A Pal, B Feuer, et al.
arXiv, 2024. [HomePage] [Paper]

📖Human-Interaction-Testing

Beyond static datasets: A deep interaction approach to llm evaluation.
J Li, R Li, Q Liu.
arXiv, 2023. [ArXiv]

Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks.
L Ibrahim, S Huang, L Ahmad, M Anderljung.
arXiv:2405.10632, 2024. [ArXiv]

📖Others

Branch-solve-merge improves large language model evaluation and generation.
Swarnadeep Saha, et al.
arXiv:2310.15123v1, 2023.

Evaluating general-purpose ai with psychometrics.
X Wang, L Jiang, J Hernandez-Orallo, D Stillwell, L Sun, F Luo, X **e.
arxiv:2310.16379, 2023. [ArXiv]

State of what art? a call for multi-prompt llm evaluation.
M Mizrahi, G Kaplan, D Malkin, R Dror, D Shahaf, G Stanovsky.
Transactions of the Association for Computational Linguistics, 2024. [TACL]

📖Testing-Tools

Evals
Openai
[Github]

Language Model Evaluation Harness.
EleutherAI
[Github]

DeepEval.
Confident AI
[Github]

OpenCompass
司南大模型评测平台
上海人工智能实验室
[HomePage] [Github]

FlagEval
天秤大模型评测平台
北京智源研究院
[HomePage] [Github]

Cleva: Chinese language models evaluation platform.
Y Li, J Zhao, D Zheng, ZY Hu, Z Chen, X Su, Y Huang, S Huang, D Lin, MR Lyu, L Wang.
ArXiv, 2023. [ArXiv]

GPT-Fathom.
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond.
S Zheng, Y Zhang, Y Zhu, C **, P Gao, X Zhou, KCC Chang.
ArXiv, 2023. [ArXiv] [Github]

Catwalk.
Catwalk: A Unified Language Model Evaluation Framework for Many Datasets.
D Groeneveld, A Awadalla, I Beltagy, A Bhagia, I Magnusson, H Peng, O Tafjord, P Walsh, et al.
ArXiv, 2023. [ArXiv] [Github]

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking.
F Dalvi, M Hasanain, S Boughorbel, B Mousi, S Abdaljalil, N Nazar, A Abdelali, and et al.
ArXiv, 2023. [ArXiv] [Github]

HumanELY: Human evaluation of LLM yield, using a novel web-based evaluation tool.
R Awasthi, S Mishra, D Mahapatra, A Khanna, K Maheshwari, J Cywinski, F Papay, P Mathur.
medRxiv, 2023. [ArXiv]

UltraEval.
UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs.
C He, R Luo, X Han, Z Liu, M Sun, and et al.
ArXiv, 2024. [ArXiv] [Github]

FreeEval.
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models.
Z Yu, C Gao, W Yao, Y Wang, Z Zeng, W Ye, J Wang, Y Zhang, S Zhang.
ArXiv, 2024. [ArXiv] [Github]

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety.
C Liu, L Yu, J Li, R **, Y Huang, L Shi, J Zhang, et al.
ArXiv, 2024. [HomePage]

📖Challenges

📖Contamination

Clean-eval: Clean evaluation on contaminated large language models.
W Zhu, H Hao, Z He, Y Song, Y Zhang, H Hu, Y Wei, R Wang, H Lu.
arXiv:2311.09154, 2023. [Paper]

Data contamination through the lens of time.
M Roberts, H Thakur, C Herlihy, C White, S Dooley.
arXiv:2310.10628, 2023. [Paper]

Investigating data contamination in modern benchmarks for large language models.
C Deng, Y Zhao, X Tang, M Gerstein, A Cohan.
arXiv:2311.09783, 2023. [Paper]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark.
O Sainz, JA Campos, I García-Ferrero, J Etxaniz, OL de Lacalle, E Agirre.
arXiv:2310.18018, 2023. [Paper] [Github]

Rethinking benchmark and contamination for language models with rephrased samples.
S Yang, WL Chiang, L Zheng, JE Gonzalez, I Stoica.
arXiv:2311.04850, 2023. [Paper] [Github]

Task contamination: Language models may not be few-shot anymore.
C Li, J Flanigan.
Proceedings of the AAAI Conference on Artificial Intelligence, 2024. [Paper]

Investigating data contamination for pre-training language models.
M Jiang, KZ Liu, M Zhong, R Schaeffer, S Ouyang, J Han, S Koyejo.
arXiv:2401.06059, 2024. [Paper]

KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models.
Z Yu, C Gao, W Yao, Y Wang, W Ye, J Wang, X Xie, Y Zhang, S Zhang.
ArXiv, 2024. [ArXiv] [Github]

📖Other

Large language models sensitivity to the order of options in multiple-choice questions.
P Pezeshkpour, E Hruschka.
arXiv:2308.11483, 2023. [Paper]

Don't make your llm an evaluation benchmark cheater.
K Zhou, Y Zhu, Z Chen, W Chen, WX Zhao, X Chen, Y Lin, JR Wen, J Han.
arXiv:2311.01964, 2023. [Paper] [Github]

Inadequacies of large language model benchmarks in the era of generative artificial intelligence.
TR McIntosh, T Susnjak, N Arachchilage, T Liu, P Watters, MN Halgamuge.
arXiv:2402.09880, 2024. [Paper]

📖Supported-Elements

📖Organization

LMSYS Org
UC Berkeley.
[Homepage]

📖Group

Name Organization HomePage Github Scholar Benchmark
Sun Maosun Tsinghua University [homepage] - [scholar] -
Tang Jie Tsinghua University [homepage] - [scholar] -
Huang Minlie Tsinghua University [homepage] - [scholar] -
Zheng Haitao Tsinghua University [homepage] - [scholar] -
Yewei Peking University [homepage] - [scholar] -
Qiu Xipeng Fudan University [homepage] - [scholar] -
Xiao Yanghua Fudan University [homepage] - [scholar] -
Xiong Deyi Tianjin University [homepage] [github] [scholar] -
Chen Kai Shanghai AI Lab - - [scholar] -
Zhang Songyang Shanghai AI Lab - - [scholar] -

📖Conference

NeurIPS (Datasets and Benchmarks Track).
[Homepage]

📖Company

Patronus AI.
[Homepage]

About

Resources for AI Model and Agent Testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors