Large Language Models (LLMs) Testing Resources: A curated list of Awesome LLMs Testing Papers with Codes, check 📖Contents for more details. This repo is still updated frequently ~ 👨💻 Welcome to star ⭐️ or submit a PR to this repo! I will review and merge it.
- 📖Leaderboard
- 📖Review
- 📖General
- 📖Application
- 📖Security
- 📖Industry
- 📖Human-Machine-Interaction
- 📖Performance-Cost
- 📖Testing-DataSets
- 📖Testing-Methods
- 📖Testing-Tools
- 📖Challenges
- 📖Supported-Elements
| Date | Title | Paper | HomePage | Github | DataSets | Organization |
|---|---|---|---|---|---|---|
| 2023 | Open LLM Leaderboard. | - | [homepage] | - | - | Huggingface |
| 2023 | Chatbot arena: An open platform for evaluating llms by human preference. | [arXiv] | [homepage] | - | - | UC Berkeley |
| 2024 | AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. | [NeurIPS] | [homepage] | - | - | Stanford University |
| 2023 | OpenCompass-司南大模型评测平台. | - | [homepage] | [Github] | - | 上海人工智能实验室 |
| 2023 | FlagEval-天秤大模型评测平台. | - | [homepage] | - | - | 北京智源人工智能研究院 |
| 2023 | Superclue: A comprehensive chinese large language model benchmark. | [arXiv] | [homepage] | - | - | SUPERCLUE |
| 2023 | SuperBench-大模型综合能力评测框架. | - | - | - | - | 清华大学-基础模型研究中心 |
| 2023 | LLMEval: A Preliminary Study on How to Evaluate Large Language Models. | [AAAI] | [homepage] | [Github] | - | 复旦大学 |
| 2023 | CLiB-chinese-llm-benchmark. | - | - | [Github] | - | - |
Evaluating large language models: A comprehensive survey.
Z Guo, R Jin, C Liu, Y Huang, D Shi, L Yu, Y Liu, J Li, B Xiong, D Xiong.
ArXiv, 2023.
[ArXiv]
[Github]
A Survey on Evaluation of Large Language Models.
Y Chang, X Wang, J Wang, Y Wu, L Yang, K Zhu, H Chen, X Yi, C Wang, Y Wang, W Ye, et al.
ACM Transactions on Intelligent Systems and Technology, 2024.
[Paper]
[ArXiv]
[Github]
Through the lens of core competency: Survey on evaluation of large language models.
Z Ziyu, C Qiguang, M Longxuan, L Mingda, et al.
CCL, 2024.
[Paper]
大语言模型评测综述.
罗 文,王厚峰.
中文信息学报, 2024.
A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
Y Bang, S Cahyawijaya, N Lee, W Dai, D Su, et al.
arXiv, 2023.
[Paper]
A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets.
MTR Laskar, MS Bari, M Rahman, MAH Bhuiyan, S Joty, JX Huang.
arXiv:2305.18486, 2023.
[Paper]
Holistic evaluation of language models.
R Bommasani, P Liang, T Lee, et al.
ArXiv, 2023.
[Homepage]
[ArXiv]
[Github]
Alignbench: Benchmarking chinese alignment of large language models.
X Liu, X Lei, S Wang, Y Huang, Z Feng, B Wen, J Cheng, P Ke, Y Xu, WL Tam, X Zhang, et al.
arXiv:2311.18743, 2023.
[ArXiv]
[Github]
TencentLLMEval: a hierarchical evaluation of Real-World capabilities for human-aligned LLMs.
S Xie, W Yao, Y Dai, S Wang, D Zhou, L Jin, X Feng, P Wei, Y Lin, Z Hu, D Yu, Z Zhang, et al.
arXiv:2311.05374, 2023.
[ArXiv]
Evaluation of openai o1: Opportunities and challenges of agi.
T Zhong, Z Liu, Y Pan, Y Zhang, Y Zhou, S Liang, Z Wu, Y Lyu, P Shu, X Yu, C Cao, H Jiang, et al.
arXiv:2409.18486, 2024.
[ArXiv]
| Date | Task | Title | Paper | HomePage | Github | DataSets |
|---|---|---|---|---|---|---|
| 2018 | Comprehensive | GLUE: A multi-task benchmark and analysis platform for natural language understanding. | [ArXiv] | [Homepage] | - | - |
| 2019 | Comprehensive | Superglue: A stickier benchmark for general-purpose language understanding systems. | [NeurIPS] | [Homepage] | - | - |
| 2020 | Comprehensive | CLUE: A Chinese language understanding evaluation benchmark. | [ArXiv] | [Homepage] | - | - |
| 2019 | Comprehensive | Fewclue: A chinese few-shot learning evaluation benchmark. | [ArXiv] | [Homepage] | - | - |
| 2017 | Reading | Race: Large-scale reading comprehension dataset from examinations. | [ArXiv] | - | [Github] | [Datasets] |
| 2017 | Reading | Know what you don't know: Unanswerable questions for SQuAD. | [ArXiv] | - | - | - |
| 2017 | Reading | Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. | [ArXiv] | [Homepage] | - | - |
| 2019 | Reading | DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. | [ArXiv] | [Homepage] | - | - |
| 2019 | Reading | BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. | [ArXiv] | [Homepage] | - | - |
| 2023 | Reading | The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. | [ArXiv] | - | - | - |
| 2024 | Reading | AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models. | [ArXiv] | - | [Github] | - |
| 2023 | Semantic | The two word test: A semantic benchmark for large language models. | [ArXiv] | - | - | - |
| 2023 | Semantic | This is not a dataset: A large negation benchmark to challenge large language models. | [ArXiv] | - | [Github] | - |
| 2023 | Graph | Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking. | [ArXiv] | - | - | - |
| 2017 | Knowledge | Crowdsourcing multiple choice science questions. | [ArXiv] | - | - | [DataSets] |
| 2018 | Knowledge | Can a suit of armor conduct electricity? a new dataset for open book question answering. | [ArXiv] | - | [Github] | - |
| 2021 | Knowledge | Measuring massive multitask language understanding. | [ICLR] | - | [Github] | [Huggingface] |
| 2023 | Knowledge | C-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models. | [ArXiv] | - | [Github] | - |
| 2023 | Knowledge | Cmmlu: Measuring massive multitask language understanding in chinese. | [ArXiv] | - | [Github] | - |
| 2023 | Knowledge | Measuring massive multitask chinese understanding. | [ArXiv] | - | - | - |
| 2024 | Knowledge | Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. | [ArXiv] | - | [Github] | [DataSets] |
| 2023 | Metrics | Rethinking the Evaluating Framework for Natural Language Understanding in AI Systems: Language Acquisition as a Core for Future Metrics. | [ArXiv] | - | - | - |
| Date | Task | Title | Paper | HomePage | Github | DataSets |
|---|---|---|---|---|---|---|
| 2015 | Summarization | Lcsts: A large scale chinese short text summarization dataset. | [EMNLP] | [Homepage] | - | - |
| 2019 | Summarization | Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. | [ArXiv] | - | [Github] | - |
| 2019 | Summarization | SAMSum corpus A human-annotated dialogue dataset for abstractive summarization. | [ArXiv] | - | - | - |
| 2021 | Summarization | DialogSum: A real-life scenario dialogue summarization dataset. | [ArXiv] | - | [Github] | - |
| 2023 | Summarization | Clinical text summarization: adapting large language models can outperform human experts. | [ArXiv] | - | - | - |
| 2023 | Summarization | Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. | [ArXiv] | - | [Github] | - |
| 2024 | Summarization | Benchmarking large language models for news summarization. | [TACL] | - | - | - |
| 2013 | QA | Semantic parsing on freebase from question-answer pairs. | [EMNLP] | - | - | - |
| 2018 | QA | The web as a knowledge-base for answering complex questions. | [ArXiv] | - | - | [Datasets] |
| 2019 | QA | Natural Questions A Benchmark for Question Answering Research. | [ACL] | [Homepage] | [Github] | - |
| 2022 | QA | MiQA: A benchmark for inference on metaphorical questions. | [ArXiv] | - | [Github] | - |
| 2023 | QA | Emotionally numb or empathetic? evaluating how llms feel using emotionbench. | [ArXiv] | - | [Github] | - |
| 2023 | QA | Evaluating open-domain question answering in the era of large language models. | [ArXiv] | - | [Github] | - |
| 2023 | QA | Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. | [ArXiv] | - | - | - |
| 2023 | QA | Can ChatGPT replace traditional KBQA models? An in-depth analysis of the question answering performance of the GPT LLM family. | [ISWC] | - | [Github] | - |
| 2024 | QA | Compmix: A benchmark for heterogeneous question answering. | [ACMWC] | - | - | - |
| 2024 | QA | MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues. | [ArXiv] | - | [Github] | - |
| 2024 | QA | Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. | [NeurIPS] | [Homepage] | - | - |
| 2024 | Content | Benchmarking large language models on controllable generation under diversified instructions. | [AAAI] | - | [Github] | - |
| 2023 | Graph | Evaluating generative models for graph-to-text generation. | [ArXiv] | - | [Github] | - |
| 2023 | Graph | Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text. | [ArXiv] | - | [Github] | - |
| Date | Task | Title | Paper | HomePage | Github | DataSets |
|---|---|---|---|---|---|---|
| 2022 | Comprehensive | Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. | [ArXiv] | - | - | - |
| 2023 | Comprehensive | Arb: Advanced reasoning benchmark for large language models. | [ArXiv] | - | - | - |
| 2023 | Comprehensive | Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes. | [ArXiv] | - | [Github] | - |
| 2024 | Comprehensive | Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study. | [ArXiv] | - | - | - |
| 2012 | Commonsense | The winograd schema challenge. | [AAAI] | - | - | - |
| 2018 | Commonsense | Commonsenseqa: A question answering challenge targeting commonsense knowledge. | [ArXiv] | - | - | - |
| 2019 | Commonsense | Hellaswag Can a machine really finish your sentence. | [ArXiv] | [Homepage] | - | - |
| 2019 | Commonsense | Socialiqa: Commonsense reasoning about social interactions. | [ArXiv] | [Homepage] | - | - |
| 2020 | Commonsense | Piqa: Reasoning about physical commonsense in natural language. | [AAAI] | - | - | - |
| 2021 | Commonsense | Winogrande An adversarial winograd schema challenge at scale. | [CACM] | - | - | - |
| 2023 | Commonsense | Worldsense: A synthetic benchmark for grounded reasoning in large language models. | [ArXiv] | - | [Github] | - |
| 2024 | Commonsense | Corecode: A common sense annotated dialogue dataset with benchmark tasks for chinese large language models. | [AAAI] | - | [Github] | - |
| 2024 | Commonsense | Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations. | [ArXiv] | - | [Github] | - |
| 2017 | Math | Deep Neural Solver for Math Word Problems. | [EMNLP] | - | - | [DataSets] |
| 2021 | Math | Measuring Mathematical Problem Solving With the MATH Dataset. | [NeurIPS] | - | - | [DataSets] |
| 2021 | Math | Training verifiers to solve math word problems. | [NeurIPS] | - | [Github] | [DataSets] |
| 2023 | Math | Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs. | [ArXiv] | - | - | [DataSets] |
| 2023 | Math | CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? | [ArXiv] | - | - | [DataSets] |
| 2023 | Math | MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. | [ArXiv] | - | [Github] | [DataSets] |
| 2023 | Math | TheoremQA: A Theorem-driven Question Answering Dataset. | [ArXiv] | - | - | - |
| 2024 | Math | GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. | [ArXiv] | - | - | - |
| 2024 | Math | MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark. | [ArXiv] | - | [Github] | - |
| 2024 | Math | Mustard: Mastering uniform synthesis of theorem and proof data. | [ArXiv] | - | [Github] | - |
| 2024 | Math | Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models. | [ArXiv] | - | [Github] | - |
| 2016 | Logic | Story cloze evaluator: Vector space representation evaluation by predicting what happens next. | [ArXiv] | - | - | - |
| 2016 | Logic | The LAMBADA dataset: Word prediction requiring a broad discourse context. | [ArXiv] | - | - | - |
| 2023 | Logic | RoCar: A Relationship Network-based Evaluation Method to Large Language Models. | [ArXiv] | - | [Github] | - |
| 2023 | Logic | Towards benchmarking and improving the temporal reasoning capability of large language models. | [ArXiv] | - | [Github] | - |
| 2023 | Logic | Towards logiglue: A brief survey and a benchmark for analyzing logical reasoning capabilities of language models. | [ArXiv] | - | - | - |
| 2022 | Causal | Wikiwhy: Answering and explaining cause-and-effect questions. | [ArXiv] | - | - | - |
| 2024 | Causal | CausalBench: A Comprehensive Benchmark for Causal Learning Capability of Large Language Models. | [ArXiv] | - | - | - |
| 2024 | Causal | Cladder: A benchmark to assess causal reasoning capabilities of language models. | [NeurIPS | - | [Github] | [Huggingface] |
| 2023 | Step | Art: Automatic multi-step reasoning and tool-use for large language models. | [ArXiv] | - | [Github | - |
| 2023 | Step | STEPS: A Benchmark for Order Reasoning in Sequential Tasks. | [ArXiv] | - | [Github] | - |
| 2023 | Complex | Have llms advanced enough? a challenging problem solving benchmark for large language models. | [ArXiv] | - | [Github] | - |
| 2024 | Complex | MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures. | [ArXiv] | - | [Github] | [Huggingface] |
| 2024 | Complex | Livebench: A challenging, contamination-free llm benchmark. | [ArXiv] | [Homepage] | [Github] | [Huggingface] |
| 2024 | Complex | OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI. | [ArXiv] | - | [Github] | - |
| 2024 | Complex | Evaluation of OpenAI o1: Opportunities and Challenges of AGI. | [ArXiv] | - | [Github] | - |
Think you have solved question answering? try arc, the ai2 reasoning challenge.
P Clark, I Cowhey, O Etzioni, T Khot, A Sabharwal, C Schoenick, O Tafjord.
arXiv:1803.05457, 2018.
[ArXiv]
Agieval: A human-centric benchmark for evaluating foundation models.
W Zhong, R Cui, Y Guo, Y Liang, S Lu, Y Wang, et al.
arXiv, 2023.
[ArXiv]
[Github]
Do llms understand social knowledge? evaluating the sociability of large language models with socket benchmark.
M Choi, J Pei, S Kumar, C Shu, D Jurgens.
arXiv:2305.14938, 2023.
[ArXiv]
[Github]
Eva-kellm: A new benchmark for evaluating knowledge editing of llms.
S Wu, M Peng, Y Chen, J Su, M Sun.
arXiv:2308.09954, 2023.
[ArXiv]
KoLA: Carefully Benchmarking World Knowledge of Large Language Models.
J Yu, X Wang, S Tu, S Cao, D Zhang-Li, X Lv, H Peng, Z Yao, X Zhang, H Li, C Li, Z Zhang, et al.
ArXiv, 2023.
[ArXiv]
[Homepages]
ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models.
B Zhang, H Xie, P Du, J Chen, P Cao, Y Chen, S Liu, K Liu, J Zhao.
arXiv:2308.14353, 2023.
[ArXiv]
[Homepage]
Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation.
Z Gu, X Zhu, H Ye, L Zhang, J Wang, Y Zhu, et al.
AAAI, 2024.
[AAAI]
[Github]
Evaluating the performance of large language models on gaokao benchmark.
X Zhang, C Li, Y Zong, Z Ying, L He, X Qiu.
ArXiv, 2023.
[ArXiv]
[Github]
M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models.
W Zhang, M Aljunied, C Gao, YK Chia, L Bing.
Advances in Neural Information Processing Systems, 2023.
[NeurIPS]
[Github]
M3ke: A massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models.
C Liu, R **, Y Ren, L Yu, T Dong, X Peng, et al.
arXiv, 2024.
[ArXiv]
[Github]
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI.
Z Huang, Z Wang, S Xia, X Li, H Zou, R Xu, RZ Fan, L Ye, E Chern, Y Ye, Y Zhang, Y Yang, et al.
arXiv:2406.12753, 2024.
[ArXiv]
[Github]
XNLI: Evaluating cross-lingual sentence representations.
A Conneau, G Lample, R Rinott, A Williams, SR Bowman, H Schwenk, V Stoyanov.
arxiv:1809.05053, 2018.
[ArXiv]
Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.
A Siddhant, J Hu, M Johnson, O Firat, et al.
ICML, 2020.
[ArXiv]
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages.
JH Clark, E Choi, M Collins, D Garrette, T Kwiatkowski, V Nikolaev, J Palomaki.
Transactions of the Association for Computational Linguistics, 2020.
[ArXiv]
The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation.
N Goyal, C Gao, V Chaudhary, PJ Chen, G Wenzek, D Ju, S Krishnan, MA Ranzato, et al.
Transactions of the Association for Computational Linguistics, 2022.
[ArXiv]
Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning.
VD Lai, NT Ngo, APB Veyseh, H Man, et al.
arXiv, 2023.
[ArXiv]
Mega: Multilingual evaluation of generative ai.
K Ahuja, H Diddee, R Hada, M Ochieng, K Ramesh, P Jain, A Nambi, T Ganu, S Segal, et al.
arXiv:2303.12528, 2023.
[ArXiv]
Megaverse: Benchmarking large language models across languages, modalities, models and tasks.
S Ahuja, D Aggarwal, V Gumma, I Watts, A Sathe, M Ochieng, R Hada, P Jain, M Axmed, et al.
arXiv:2311.07463, 2023.
[ArXiv]
MELA: Multilingual Evaluation of Linguistic Acceptability.
Z Zhang, Y Liu, W Huang, J Mao, R Wang, H Hu.
arXiv:2311.09033, 2023.
[ArXiv]
[Github]
mSCAN: A Dataset for Multilingual Compositional Generalisation Evaluation.
A Reymond, S Steinert-Threlkeld.
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, 2023.
[Paper]
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning.
B Wang, Z Liu, X Huang, F Jiao, Y Ding, AT Aw, NF Chen.
ArXiv, 2023.
[ArXiv]
[Github]
Evaluating the elementary multilingual capabilities of large language models with MultiQ.
C Holtermann, P Röttger, T Dill, A Lauscher.
arXiv:2403.03814, 2024.
[ArXiv]
[Github]
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models.
H Wang, J Xu, S Xie, R Wang, J Li, Z Xie, B Zhang, C Xiong, X Chen.
arXiv:2405.15638, 2024.
[ArXiv]
[Github]
ANALOGICAL--A Novel Benchmark for Long Text Analogy Evaluation in Large Language Models.
T Wijesiriwardene, R Wickramarachchi, BG Gajera, SM Gowaikar, C Gupta, A Chadha, et al.
arXiv:2305.05050, 2023.
[ArXiv]
Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models.
Z Dong, T Tang, J Li, WX Zhao, JR Wen.
arXiv:2309.13345, 2023.
[ArXiv]
[Github]
L-eval: Instituting standardized evaluation for long context language models.
C An, S Gong, M Zhong, X Zhao, M Li, J Zhang, L Kong, X Qiu.
arXiv:2307.11088, 2023.
[ArXiv]
[Github]
Longbench: A bilingual, multitask benchmark for long context understandings.
Y Bai, X Lv, J Zhang, H Lyu, J Tang, Z Huang, Z Du, X Liu, A Zeng, L Hou, Y Dong, J Tang, et al.
arXiv:2308.14508, 2023.
[ArXiv]
[Github]
M4le: A multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models.
WC Kwan, X Zeng, Y Wang, Y Sun, L Li, L Shang, Q Liu, KF Wong.
arXiv:2310.19240, 2023.
[ArXiv]
[Github]
Zeroscrolls: A zero-shot benchmark for long text understanding.
U Shaham, M Ivgi, A Efrat, J Berant, O Levy.
arXiv:2305.14196, 2023.
[ArXiv]
[Homepage]
CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models.
Z Huang, J Li, S Huang, W Zhong, I King.
ArXiv, 2024.
[ArXiv]
[Github]
LooGLE: Can Long-Context Language Models Understand Long Contexts?
J Li, M Wang, Z Zheng, M Zhang.
arxiv:2311.04939, 2023.
[ArXiv]
[Github]
[DataSets]
Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k.
T Yuan, X Ning, D Zhou, Z Yang, S Li, M Zhuang, Z Tan, Z Yao, D Lin, B Li, G Dai, S Yan, et al.
arXiv:2402.05136, 2024.
[ArXiv]
[Github]
Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance.
Y Fu, L Ou, M Chen, Y Wan, H Peng, T Khot.
ArXiv, 2023.
[ArXiv]
[Github]
Cue-CoT: Chain-of-thought prompting for responding to in-depth dialogue questions with LLMs.
H Wang, R Wang, F Mi, Y Deng, Z Wang, B Liang, R Xu, KF Wong.
arXiv:2305.11792, 2023.
[ArXiv]
[Github]
Charactereval: A chinese benchmark for role-playing conversational agent evaluation.
Q Tu, S Fan, Z Tian, R Yan.
ArXiv, 2024.
[ArXiv]
[Github]
Roleeval: A bilingual role evaluation benchmark for large language models.
T Shen, S Li, D Xiong.
arXiv:2312.16132, 2023.
[ArXiv]
[Github]
Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models.
ZM Wang, Z Peng, H Que, J Liu, W Zhou, Y Wu, et al.
ArXiv, 2023.
[ArXiv]
[Github]
Api-bank: A comprehensive benchmark for tool-augmented llms.
M Li, Y Zhao, B Yu, F Song, H Li, H Yu, Z Li, et al.
arxiv, 2023.
[ArXiv]
[Github]
Metatool benchmark for large language models: Deciding whether to use tools and which to use.
Y Huang, J Shi, Y Li, C Fan, S Wu, Q Zhang, Y Liu, P Zhou, Y Wan, NZ Gong, L Sun.
arxiv:2310.03128, 2023.
[ArXiv]
[Github]
Mint: Evaluating llms in multi-turn interaction with tools and language feedback.
X Wang, Z Wang, J Liu, Y Chen, L Yuan, H Peng, H Ji.
ArXiv, 2023.
[ArXiv]
[Github]
On the tool manipulation capability of open-source large language models.
Q Xu, F Hong, B Li, C Hu, Z Chen, J Zhang.
arxiv:2305.16504, 2023.
[ArXiv]
[Github]
T-eval: Evaluating the tool utilization capability step by step.
Z Chen, W Du, W Zhang, K Liu, J Liu, M Zheng, J Zhuo, S Zhang, D Lin, K Chen, F Zhao.
arXiv:2312.14033, 2023.
[ArXiv]
[Github]
Toolqa: A dataset for llm question answering with external tools.
Y Zhuang, Y Yu, K Wang, H Sun, C Zhang.
Advances in Neural Information Processing Systems, 2024.
[NeurIPS]
[Github]
Followbench: A multi-level fine-grained constraints following benchmark for large language models.
Y Jiang, Y Wang, X Zeng, W Zhong, L Li, F Mi, L Shang, X Jiang, Q Liu, W Wang.
ArXiv, 2023.
[ArXiv]
[Github]
Instructeval: Towards holistic evaluation of instruction-tuned large language models.
YK Chia, P Hong, L Bing, S Poria.
arXiv:2306.04757, 2023.
[ArXiv]
[Github]
[DataSets]
Instruction-following evaluation for large language models.
J Zhou, T Lu, S Mishra, S Brahma, S Basu, Y Luan, D Zhou, L Hou.
arXiv:2311.07911, 2023.
[ArXiv]
[Github]
[DataSets]
Benchmarking complex instruction-following with multiple constraints composition.
B Wen, P Ke, X Gu, L Wu, H Huang, J Zhou, W Li, B Hu, W Gao, J Xu, Y Liu, J Tang, H Wang, et al.
arXiv:2407.03978, 2024.
[ArXiv]
[Github]
Cfbench: A comprehensive constraints-following benchmark for llms.
T Zhang, Y Shen, W Luo, Y Zhang, H Liang, F Yang, M Lin, Y Qiao, W Chen, B Cui, W Zhang, et al.
arXiv:2408.01122, 2024.
[ArXiv]
[Github]
Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models.
H Sun, L Liu, J Li, F Wang, B Dong, R Lin, R Huang.
arXiv:2404.02823, 2024.
[ArXiv]
[Github]
Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation.
R Hida, J Ohmura, T Sekiya.
arXiv:2406.16356, 2024.
[ArXiv]
From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models.
Q He, J Zeng, Q He, J Liang, Y Xiao.
arXiv:2404.15846, 2024.
[ArXiv]
[Github]
From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models.
Q He, J Zeng, Q He, J Liang, Y Xiao.
arXiv:2404.15846, 2024.
[ArXiv]
[Github]
InFoBench: Evaluating Instruction Following Ability in Large Language Models.
Y Qin, K Song, Y Hu, W Yao, S Cho, X Wang, X Wu, F Liu, P Liu, D Yu.
arXiv:2401.03601, 2024.
[ArXiv]
[Github]
INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models.
H Oh, H Lee, S Ye, H Shin, H Jang, C Jun, M Seo.
arXiv:2402.14334, 2024.
[ArXiv]
[Github]
SysBench: Can Large Language Models Follow System Messages?
Y Qin, T Zhang, Y Shen, W Luo, H Sun, Y Zhang, Y Qiao, W Chen, Z Zhou, W Zhang, B Cui.
arXiv:2408.10943, 2024.
[ArXiv]
[Github]
| Date | Task | Title | Paper | HomePage | Github | DataSets |
|---|---|---|---|---|---|---|
| 2022 | Hallucination | Truthfulqa: Measuring how models mimic human falsehoods. | [ArXiv] | - | [Github] | - |
| 2023 | Hallucination | Autohall: Automated hallucination dataset generation for large language models. | [ArXiv] | - | - | - |
| 2023 | Hallucination | Evaluating hallucinations in chinese large language models. | [ArXiv] | - | [Github] | - |
| 2023 | Hallucination | HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. | [ArXiv] | - | - | [DataSets] |
| 2023 | Hallucination | Halo: Estimation and reduction of hallucinations in open-source weak large language models. | [ArXiv] | - | [Github] | - |
| 2023 | Hallucination | Halueval: A large-scale hallucination evaluation benchmark for large language models. | [ArXiv] | - | [Github] | - |
| 2023 | Hallucination | Med-halt: Medical domain hallucination test for large language models. | [ArXiv] | - | [Github] | - |
| 2023 | Hallucination | Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation. | [ArXiv] | - | [Github] | - |
| 2024 | Hallucination | DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models. | [ArXiv] | - | [Github] | - |
| 2024 | Hallucination | Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models. | [ArXiv] | - | - | - |
| 2024 | Hallucination | HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation. | [ArXiv] | - | [Github] | - |
| 2024 | Hallucination | HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild. | [ArXiv] | - | [Github] | - |
| 2024 | Factuality | [Simple-evals ] Measuring short-form factuality in large language models. | [Paper] | - | [Github] | - |
RobustQA: Benchmarking the robustness of domain adaptation for open-domain question answering.
R Han, P Qi, Y Zhang, L Liu, J Burger, WY Wang, Z Huang, B **ang, D Roth.
Findings of the Association for Computational Linguistics: ACL 2023, 2023.
[ArXiv]
[Github]
Are Large Language Models Really Robust to Word-Level Perturbations?
H Wang, G Ma, C Yu, N Gui, L Zhang, Z Huang, S Ma, Y Chang, S Zhang, L Shen, X Wang, et al.
arxiv:2309.11166, 2023.
[ArXiv]
[Github]
Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility.
W Ye, M Ou, T Li, X Ma, Y Yanggong, S Wu, J Fu, G Chen, H Wang, J Zhao.
arxiv:2305.10235, 2023.
[ArXiv]
[Github]
Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection.
Zekun Li, et al.
arXiv:2308.10819v2, 2023.
[ArXiv]
Intuitive or Dependent Investigating LLMs' Robustness to Conflicting Prompts.
J Ying, Y Cao, K **ong, Y He, L Cui, Y Liu.
arxiv:2309.17415, 2023.
[ArXiv]
Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.
K Zhu, J Wang, J Zhou, Z Wang, H Chen, Y Wang, L Yang, W Ye, Y Zhang, NZ Gong, X **e.
ArXiv, 2023.
[ArXiv]
[Github]
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting.
M Sclar, Y Choi, Y Tsvetkov, A Suhr.
arxiv:2310.11324, 2023.
[ArXiv]
[Github]
Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models.
Y Liu, T Cong, Z Zhao, M Backes, Y Shen, Y Zhang.
arxiv:2308.07847, 2023.
[ArXiv]
Robut: A systematic study of table qa robustness against human-annotated adversarial perturbations.
Y Zhao, C Zhao, L Nan, Z Qi, W Zhang, X Tang, B Mi, D Radev.
arxiv:2306.14321, 2023.
[ArXiv]
[Github]
Revisit input perturbation problems for llms: A unified robustness evaluation framework for noisy slot filling task.
G Dong, J Zhao, T Hui, D Guo, W Wang, B Feng, Y Qiu, Z Gongque, K He, Z Wang, W Xu.
CCF International Conference on Natural Language Processing and Chinese Computing, 2023.
[ArXiv]
[Github]
GAIA: a benchmark for General AI Assistants.
G Mialon, C Fourrier, C Swift, T Wolf, Y LeCun, T Scialom.
ArXiv, 2023.
[ArXiv]
[Datasets]
An empirical study on large language models in accuracy and robustness under chinese industrial scenarios.
Z Li, W Qiu, P Ma, Y Li, Y Li, S He, B Jiang, S Wang, W Gu.
arxiv:2402.01723, 2024.
[ArXiv]
What is the best model? Application-driven Evaluation for Large Language Models.
S Lian, K Zhao, X Liu, X Lei, B Yang, W Zhang, K Wang, Z Liu.
arxiv:2406.10307, 2024.
[ArXiv]
[Datasets]
Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems.
SE Finch, JD Finch, JD Choi.
arxiv:2212.09180, 2022.
[ArXiv]
[Github]
Benchmarking LLM powered chatbots: methods and metrics.
D Banerjee, P Singh, A Avadhanam, S Srivastava.
arXiv:2308.04624, 2023.
[ArXiv]
Benchmarking, ethical alignment, and evaluation framework for conversational AI: Advancing responsible development of ChatGPT.
PP Ray.
BenchCouncil Transactions on Benchmarks, Standards, 2023.
[Paper]
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues.
H Duan, J Wei, C Wang, H Liu, Y Fang, S Zhang, D Lin, K Chen.
ArXiv, 2023.
[ArXiv]
[Github]
DialogBench: Evaluating LLMs as Human-like Dialogue Systems.
J Ou, J Lu, C Liu, Y Tang, F Zhang, D Zhang, Z Wang, K Gai.
ArXiv, 2023.
[ArXiv]
Lmsys-chat-1m: A large-scale real-world llm conversation dataset.
Lianmin Zheng, et al.
ArXiv, 2023.
[ArXiv]
[DataSets]
ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark.
H Wakaki, Y Mitsufuji, Y Maeda, Y Nishimura, S Gao, M Zhao, K Yamada, A Bosselut.
arXiv:2406.11228, 2024.
[ArXiv]
DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents.
J Kim, W Chay, H Hwang, D Kyung, H Chung, E Cho, Y Jo, E Choi.
arXiv:2406.13144, 2024.
[ArXiv]
[Github]
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words.
J Ao, Y Wang, X Tian, D Chen, J Zhang, L Lu, Y Wang, H Li, Z Wu.
arXiv:2406.13340, 2024.
[ArXiv]
[Github]
Docmath-eval: Evaluating numerical reasoning capabilities of llms in understanding long documents with tabular data.
Y Zhao, Y Long, H Liu, L Nan, L Chen, R Kamoi, Y Liu, X Tang, R Zhang, A Cohan.
arXiv:2311.09805, 2023.
[ArXiv]
[Github]
Evaluating LLMs on document-based QA: Exact answer selection and numerical extraction using CogTale dataset.
Z Rasool, S Kurniawan, S Balugo, S Barnett, et al.
Natural Language Processing Journal, 2024.
[Paper]
Kitab: Evaluating llms on constraint satisfaction for information retrieval.
MI Abdin, S Gunasekar, V Chandrasekaran, J Li, M Yuksekgonul, RG Peshawaria, R Naik, et al.
arXiv:2310.15511, 2023.
[ArXiv]
[Huggingface]
Ragas: Automated evaluation of retrieval augmented generation.
S Es, J James, L Espinosa-Anke, S Schockaert.
arXiv:2309.15217, 2023.
[Paper]
[Github]
Benchmarking Large Language Models in Retrieval-Augmented Generation.
J Chen, H Lin, X Han, L Sun.
AAAI, 2024.
[Paper]
[Github]
Ares: An automated evaluation framework for retrieval-augmented generation systems.
J Saad-Falcon, O Khattab, C Potts, M Zaharia.
arXiv:2311.09476, 2023.
[Paper]
[Github]
CRAG--Comprehensive RAG Benchmark.
X Yang, K Sun, H Xin, Y Sun, N Bhalla, X Chen.
arXiv, 2024.
[Paper]
[Github]
Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.
Y Lyu, Z Li, S Niu, F Xiong, B Tang, W Wang, H Wu, H Liu, T Xu, E Chen.
arXiv:2401.17043, 2024.
[Paper]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning.
A Masry, DX Long, JQ Tan, S Joty, E Hoque.
arXiv:2203.10244, 2022.
[ArXiv]
[Github]
QTSumm: Query-focused summarization over tabular data.
Y Zhao, Z Qi, L Nan, B Mi, Y Liu, W Zou, S Han, R Chen, X Tang, Y Xu, D Radev, A Cohan.
arXiv:2305.14303, 2023.
[ArXiv]
[Github]
TableQAKit: A Comprehensive and Practical Toolkit for Table-based Question Answering.
F Lei, T Luo, P Yang, W Liu, H Liu, J Lei, Y Huang, Y Wei, S He, J Zhao, K Liu.
arXiv:2310.15075, 2023.
[ArXiv]
[Github]
Datatales: Investigating the use of large language models for authoring data-driven articles.
N Sultanum, A Srinivasan.
IEEE Visualization and Visual Analytics (VIS), 2023.
[ArXiv]
[Github]
Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation.
Z Kasner, O Dušek.
ACL, 2024.
[ACL]
[Github]
Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data.
X Liu, Z Wu, X Wu, P Lu, KW Chang, Y Feng.
arxiv:2402.17644, 2024.
[ArXiv]
[Github]
BIBench: Benchmarking Data Analysis Knowledge of Large Language Models.
S Liu, S Zhao, C Jia, X Zhuang, Z Long, M Lan.
arXiv:2401.02982, 2024.
[ArXiv]
[Github]
Chartbench: A benchmark for complex visual reasoning in charts.
Z Xu, S Du, Y Qi, C Xu, C Yuan, J Guo.
arXiv:2312.15915, 2023.
[ArXiv]
[Github]
Infiagent-dabench: Evaluating agents on data analysis tasks.
X Hu, Z Zhao, S Wei, Z Chai, G Wang, X Wang, J Su, J Xu, M Zhu, Y Cheng, J Yuan, et al.
arxiv:2401.05507, 2024.
[ArXiv]
Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents.
J Li, N Huo, Y Gao, J Shi, Y Zhao, G Qu, Y Wu, C Ma, JG Lou, R Cheng.
ArXiv, 2024.
[ArXiv]
[Github]
Viseval: A benchmark for data visualization in the era of large language models.
N Chen, Y Zhang, J Xu, K Ren, Y Yang.
IEEE Transactions on Visualization and Computer Graphics, 2024.
[ArXiv]
Table meets llm: Can large language models understand structured table data? a benchmark and empirical study.
Y Sui, M Zhou, M Zhou, S Han, D Zhang.
WSDM, 2024.
[ArXiv]
| Date | Task | Title | Paper | HomePage | Github | DataSets |
|---|---|---|---|---|---|---|
| 2021 | Software | [Codexglue] Codexglue: A machine learning benchmark dataset for code understanding and generation. | [ArXiv] | - | [Github] | - |
| 2021 | Software | [HumanEval] Evaluating large language models trained on code. | [ArXiv] | - | [Github] | - |
| 2021 | Software | [APPS] Measuring coding challenge competence with apps. | [ArXiv] | - | [Github] | - |
| 2021 | Software | [MBPP] Program synthesis with large language models. | [ArXiv] | - | [Github] | - |
| 2021 | Software | [ClassEval] Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. | [ArXiv] | - | [Github] | - |
| 2023 | Software | [Codescope] Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation. | [ArXiv] | - | [Github] | - |
| 2023 | Software | [StudentEval] StudentEval: a benchmark of student-written prompts for large language models of code. | [ArXiv] | - | - | - |
| 2023 | Software | Testing LLMs on Code Generation with Varying Levels of Prompt Specificity. | [ArXiv] | - | [Github] | - |
| 2023 | Software | Text-to-sql empowered by large language models: A benchmark evaluation. | [ArXiv] | - | [Github] | - |
| 2024 | Software | Competition-Level Problems are Effective LLM Evaluators. | [ACL] | [Homepage] | - | - |
| 2024 | Software | Benchmarking the text-to-sql capability of large language models: A comprehensive evaluation. | [ArXiv] | - | - | - |
| 2024 | Software | Livecodebench: Holistic and contamination free evaluation of large language models for code. | [ArXiv] | - | [Github] | - |
| 2024 | Software | Codereval: A benchmark of pragmatic code generation with generative pre-trained models. | [ICSE] | - | - | - |
Pptc benchmark: Evaluating large language models for powerpoint task completion.
Y Guo, Z Zhang, Y Liang, D Zhao, D Nan.
ArXiv, 2023.
[ArXiv]
[Github]
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion.
Z Zhang, Y Guo, Y Liang, D Zhao, N Duan.
ArXiv, 2024.
[ArXiv]
[Github]
KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions.
F Xu, K Lo, L Soldaini, B Kuehl, E Choi, D Wadden.
ArXiv, 2024.
[ArXiv]
[Homepage]
Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change).
K Valmeekam, A Olmo, S Sreedharan, S Kambhampati.
arXiv:2206.10498, 2022.
On the planning abilities of large language models (a critical investigation with a proposed benchmark).
K Valmeekam, S Sreedharan, M Marquez, A Olmo, S Kambhampati.
arXiv:2302.06706, 2023.
On the Planning Abilities of Large Language Models--A Critical Investigation.
K Valmeekam, M Marquez, S Sreedharan, S Kambhampati.
Thirty-seventh Conference on Neural Information Processing Systems, 2023.
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change.
K Valmeekam, M Marquez, A Olmo, S Sreedharan, S Kambhampati.
Thirty-seventh Conference on Neural Information Processing Systems Datasets, 2023.
LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench.
K Valmeekam, K Stechly, S Kambhampati.
arXiv:2409.13373.
Agentsims: An open-source sandbox for large language model evaluation.
J Lin, H Zhao, A Zhang, Y Wu, H Ping, Q Chen.
arXiv:2308.04026, 2023.
[ArXiv]
[Homepage]
Bolaa: Benchmarking and orchestrating llm-augmented autonomous agentsn.
Z Liu, W Yao, J Zhang, L Xue, S Heinecke, R Murthy, Y Feng, Z Chen, JC Niebles, D Arpit, et al.
arXiv:2308.05960, 2023.
[ArXiv]
[Homepage]
Smartplay: A benchmark for llms as intelligent agents.
Y Wu, X Tang, TM Mitchell, Y Li.
arXiv:2310.01557, 2023.
[ArXiv]
[Homepage]
Agentbench: Evaluating llms as agents.
X Liu, H Yu, H Zhang, Y Xu, X Lei, H Lai, Y Gu, H Ding, K Men, K Yang, S Zhang, X Deng, et al.
ICLR, 2024.
[ICLR]
[Homepage]
Webarena: A realistic web environment for building autonomous agents.
S Zhou, FF Xu, H Zhu, X Zhou, R Lo, A Sridhar, et al.
arXiv, 2023.
[ArXiv]
[Homepage]
Artificial-General-Intelligence-Testing-Resources.
Resources for AGI & Embodied AI (EAI) Testing.
[Github]
Fft: Towards harmlessness evaluation and analysis for llms with factuality, fairness, toxicity.
S Cui, Z Zhang, Y Chen, W Zhang, T Liu, S Wang, T Liu.
arXiv:2311.18580, 2023.
[ArXiv]
[Github]
Safety assessment of chinese large language models.
H Sun, Z Zhang, J Deng, J Cheng, M Huang.
arXiv:2304.10436, 2023.
[ArXiv]
[Github]
Safetybench: Evaluating the safety of large language models with multiple choice questions.
Z Zhang, L Lei, L Wu, R Sun, Y Huang, C Long, et al.
arXiv, 2023.
[ArXiv]
[Github]
Sc-safety: A multi-round open-ended question adversarial safety benchmark for large language models in chinese.
L Xu, K Zhao, L Zhu, H Xue.
arXiv:2310.05818, 2023.
[ArXiv]
[Github]
Trustgpt: A benchmark for trustworthy and responsible large language models.
Y Huang, Q Zhang, L Sun.
arXiv:2306.11507, 2023.
[ArXiv]
[Github]
Trustworthy llms: a survey and guideline for evaluating large language models' alignment.
Y Liu, Y Yao, JF Ton, X Zhang, R Guo, H Cheng, Y Klochkov, MF Taufiq, H L.
arXiv:2308.05374, 2023.
[ArXiv]
[Github]
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.
B Wang, W Chen, H Pei, C Xie, M Kang, C Zhang, C Xu, Z Xiong, R Dutta, R Schaeffer, et al.
NeurIPS, 2023.
[ArXiv]
[Github]
Towards ai safety: A taxonomy for ai system evaluation.
B Xia, Q Lu, L Zhu, Z Xing.
arXiv:2404.05388, 2024.
[ArXiv]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.
T Hartvigsen, S Gabriel, H Palangi, M Sap, D Ray, E Kamar.
arXiv:2203.09509, 2022.
[ArXiv]
[Github]
A chinese prompt attack dataset for llms with evil content.
C Liu, F Zhao, L Qing, Y Kang, C Sun, K Kuang, F Wu.
arXiv:2309.11830, 2023.
[ArXiv]
[Github]
Control risk for potential misuse of artificial intelligence in science.
J He, W Feng, Y Min, J Yi, K Tang, S Li, J Zhang, K Chen, W Zhou, X Xie, W Zhang, N Yu, et al.
arXiv:2312.06632, 2023.
[ArXiv]
[Github]
Do-not-answer: A dataset for evaluating safeguards in llms.
Y Wang, H Li, X Han, P Nakov, T Baldwin.
arXiv:2308.13387, 2023.
[ArXiv]
[Github]
Examining user-friendly and open-sourced large gpt models: A survey on language, multimodal, and scientific gpt models.
K Gao, S He, Z He, J Lin, QZ Pei, J Shao, W Zhang.
arXiv:2308.14149, 2023.
[ArXiv]
[Github]
Xstest: A test suite for identifying exaggerated safety behaviours in large language models.
P Röttger, HR Kirk, B Vidgen, G Attanasio, F Bianchi, D Hovy.
arXiv:2308.01263, 2023.
[ArXiv]
[Github]
JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models.
M Zhang, X Pan, M Yang.
ArXiv, 2023.
[ArXiv]
[Github]
CARE-MI: chinese benchmark for misinformation evaluation in maternity and infant care.
T Xiang, L Li, W Li, M Bai, L Wei, B Wang, N Garcia.
Advances in Neural Information Processing Systems, 2023.
[ArXiv]
[Github]
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain.
X Tong, B Jin, Z Lin, B Wang, T Yu.
arXiv:2402.07234, 2024.
[ArXiv]
A benchmark for understanding dialogue safety in mental health support.
H Qiu, T Zhao, A Li, S Zhang, H He, Z Lan.
CCF International Conference on Natural Language Processing and Chinese, 2023.
[ArXiv]
[Github]
Cosafe: Evaluating large language model safety in multi-turn dialogue coreference.
E Yu, J Li, M Liao, S Wang, Z Gao, F Mi, L Hongn.
arXiv:2406.17626, 2024.
[ArXiv]
[Github]
Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models.
H Qiu, S Zhang, A Li, H He, Z Lan.
arXiv:2307.08487, 2023.
[ArXiv]
[Github]
Multilingual jailbreak challenges in large language models.
Y Deng, W Zhang, SJ Pan, L Bing.
arXiv:2310.06474, 2023.
[ArXiv]
[Github]
Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity.
TY Zhuo, Y Huang, C Chen, Z Xing.
arXiv:2301.12867, 2023.
[ArXiv]
Jailbreakbench: An open robustness benchmark for jailbreaking large language models.
P Chao, E Debenedetti, A Robey, M Andriushchenko, F Croce, V Sehwag, E Dobriban, et al.
arXiv:2404.01318, 2024.
[ArXiv]
[Github]
Cvalues: Measuring the values of chinese large language models from safety to responsibility.
G Xu, J Liu, M Yan, H Xu, J Si, Z Zhou, P Yi, X Gao, J Sang, R Zhang, J Zhang, C Peng, et al.
ArXiv, 2023.
[ArXiv]
[Github]
Flames: Benchmarking value alignment of chinese large language models.
K Huang, X Liu, Q Guo, T Sun, J Sun, Y Wang, Z Zhou, Y Wang, Y Teng, X Qiu, Y Wang, et al.
arXiv:2311.06899, 2023.
[ArXiv]
[Github]
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models.
L Yu, Y Leng, Y Huang, S Wu, H Liu, X Ji, et al.
ArXiv, 2024.
[ArXiv]
[Github]
Localvaluebench: A collaboratively built and extensible benchmark for evaluating localized value alignment and ethical safety in large language models.
GI Meadows, NWL Lau, EA Susanto, CL Yu, et al.
ArXiv, 2024.
[ArXiv]
CrowS-pairs: A challenge dataset for measuring social biases in masked language models.
N Nangia, C Vania, R Bhalerao, SR Bowman.
arXiv, 2020.
[ArXiv]
[Github]
Bold: Dataset and metrics for measuring biases in open-ended language generation.
J Dhamala, T Sun, V Kumar, S Krishna, Y Pruksachatkun, KW Chang, R Gupta.
FAccT, 2021.
[ArXiv]
[Github]
BBQ: A hand-built bias benchmark for question answering.
A Parrish, A Chen, N Nangia, V Padmakumar, J Phang, J Thompson, PM Htut, SR Bowman.
ACL, 2022.
[ArXiv]
[Github]
CBBQ: A chinese bias benchmark dataset curated with human-ai collaboration for large language models.
Y Huang, D Xiong.
arXiv:2306.16244, 2023.
[ArXiv]
[Github]
Evaluating and mitigating discrimination in language model decisions.
A Tamkin, A Askell, L Lovitt, E Durmus, N Joseph, S Kravec, K Nguyen, J Kaplan, D Ganguli.
arXiv:2312.03689, 2023.
[ArXiv]
[Github]
Winoqueer: A community-in-the-loop benchmark for anti-lgbtq+ bias in large language models.
VK Felkner, HCH Chang, E Jang, J May.
arXiv:2306.15087, 2023.
[ArXiv]
[Github]
A comparative analysis to evaluate bias and fairness across large language models with benchmarks.
MY Chan, SM Wong.
arXiv, 2024.
[ArXiv]
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents.
T Yuan, Z He, L Dong, Y Wang, R Zhao, T **a, L Xu, B Zhou, F Li, Z Zhang, R Wang, G Liu.
ArXiv, 2024.
[ArXiv]
[Github]
I Think, Therefore I am: Awareness in Large Language Models.
Y Li, Y Huang, Y Lin, S Wu, Y Wan, L Sun.
ArXiv, 2024.
[ArXiv]
[Github]
Can llms keep a secret? testing privacy implications of language models via contextual integrity theory.
N Mireshghallah, H Kim, X Zhou, Y Tsvetkov, M Sap, R Shokri, Y Choi.
ArXiv, 2023.
[ArXiv]
[Github]
Llm-pbe: Assessing data privacy in large language models.
Q Li, J Hong, C Xie, J Tan, R Xin, J Hou, X Yin, et al.
ArXiv, 2024.
[ArXiv]
[Github]
BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark.
Dakuan Lu, Hengkui Wu, Jiaqing Liang, Yipei Xu, Qianyu He, Yipeng Geng, Mengkun Han, Yingsi Xin, Yanghua Xiao.
ArXiv, 2023.
[ArXiv]
[Github]
CFBenchmark: Chinese financial assistant benchmark for large language model.
Y Lei, J Li, M Jiang, J Hu, D Cheng, Z Ding, C Jiang.
arXiv:2311.05812, 2023.
[ArXiv]
[Github]
FinanceBench: A New Benchmark for Financial Question Answering.
Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, Bertie Vidgen.
ArXiv, 2023.
[ArXiv]
[Github]
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models.
Liwen Zhang, Weige Cai, Zhaowei Liu, Zhi Yang, Wei Dai, Yujie Liao, Qianru Qin, Yifei Li, Xingyu Liu, Zhiqiang Liu, Zhoufan Zhu, Anbo Wu, Xin Guo, Yun Chen.
ArXiv, 2023.
[ArXiv]
[Github]
[Datasets]
FinGPT: Open-Source Financial Large Language Models.
Hongyang Yang, Xiao-Yang Liu, Christina Dan Wang.
ArXiv, 2023.
[ArXiv]
[Github]
[Datasets]
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance.
Q Xie, W Han, X Zhang, Y Lai, M Peng, A Lopez-Lira, J Huang.
ArXiv, 2023.
[ArXiv]
[Github]
[Datasets]
WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain.
Raj Sanjay Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, Wendi Du, Sudheer Chava, Natraj Raman, Charese Smiley, Jiaao Chen, Diyi Yang.
ArXiv, 2022.
[ArXiv]
[Github]
[Datasets]
PubMedQA: A Dataset for Biomedical Research Question Answering.
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, Xinghua Lu.
EMMNLP, 2019.
[ArXiv]
[Github]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams.
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, Peter Szolovitsu.
AS, 2021.
[ArXiv]
[Github]
[Datasets]
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset
for Medical domain Question Answering.
Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu.
PMLR, 2022.
[ArXiv]
[Github]
[Datasets]
Benchmarking Large Language Models on CMExam - A comprehensive Chinese Medical Exam Dataset.
Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, LEI ZHU, Michael Lingzhi Li.
Nips, 2023.
[Paper]
[Github]
ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist Examination.
Dongfang Li, Jindi Yu, Baotian Hu, Zhenran Xu, Min Zhang.
EMNLP, 2023.
[ArXiv]
[Github]
CMB: A Comprehensive Medical Benchmark in Chinese.
Xidong Wang, Guiming Hardy Chen, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, Haizhou Li.
ArXiv, 2023.
[ArXiv]
[Datasets]
MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine.
Jie Xu, Lu Lu, Sen Yang, Bilin Liang, Xinwei Peng, Jiali Pang, Jinru Ding, Xiaoming Shi, Lingrui Yang, Huan Song, Kang Li, Xin Sun, Shaoting Zhang.
ArXiv, 2023.
[ArXiv]
PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain.
Wei Zhu, Xiaoling Wang, Huanran Zheng, Mosha Chen, Buzhou Tang.
ArXiv, 2023.
[ArXiv]
[Datasets]
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench.
Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu.
ArXiv, 2023.
[ArXiv]
[Github]
Large Language Models Encode Clinical Knowledge.
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, et al.
Natrue, 2023.
[HomePage]
MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models.
Cai, Y., Wang, L., Wang, Y., de Melo, G., Zhang, Y., Wang, Y., & He, L.
AAAI, 2024.
[Paper]
[Github]
Evaluation of ChatGPT-generated medical responses: a systematic review and meta-analysis.
Q Wei, Z Yao, Y Cui, B Wei, Z Jin, X Xu.
Journal of Biomedical Informatics, 2024.
[Paper]
Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI.
M Abbasian, E Khatibi, I Azimi, D Oniani, Z Shakeri Hossein Abad, A Thieme, R Sriram, et al.
NPJ Digital Medicine, 2024.
[Paper]
JEC-QA: A Legal-Domain Question Answering Dataset.
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, Maosong Sun.
ArXiv, 2023.
[Paper]
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review.
Dan Hendrycks, Collin Burns, Anya Chen, Spencer Ball.
ArXiv, 2021.
[ArXiv]
[Github]
[Datasets]
LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning.
Neel Guha, Daniel E. Ho, Julian Nyarko, Christopher Ré.
ArXiv, 2022.
[ArXiv]
[Github]
LAiW: A Chinese Legal Large Language Models Benchmark A Technical Report.
Yongfu Dai, Duanyu Feng, Jimin Huang, Haochen Jia, Qianqian Xie, Yifang Zhang, Weiguang Han, Wei Tian, Hao Wang.
ArXiv, 2023.
[ArXiv]
[Github]
LawBench: Benchmarking Legal Knowledge of Large Language Models.
Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, Jidong Ge.
ArXiv, 2023.
[ArXiv]
[Github]\
司法大语言模型评估框架路线分析.
李海涛,艾清遥,吴玥悦,刘奕群.
CAAI, 2023.
法律大模型评估指标和测评方法.
许建峰,刘程远,况琨,何浩,孙常龙,李宝善,魏斌,杨力,金耀辉,吴飞.
中国人工智能学会, 2024.
[Paper]
| Date | Task | Title | Paper | HomePage | Github | DataSets |
|---|---|---|---|---|---|---|
| 2023 | Software | Empower large language model to perform better on industrial domain-specific question answering. | [ArXiv] | - | [Github] | - |
| 2023 | Software | Exploring the effectiveness of llms in automated logging generation: An empirical study. | [ArXiv] | - | [Github] | - |
| 2023 | Software | OpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models. | [ArXiv] | - | [Github] | - |
| 2024 | Software | CloudEval-YAML A Practical Benchmark for Cloud Native YAML Configuration Generation. | [MLSys] | - | [Github] | - |
| 2024 | Software | CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery. | [ArXiv] | - | [Github] | - |
Curriculum-Driven Edubot: A Framework for Developing Language Learning Chatbots Through Synthesizing Conversational Data.
Y Li, S Qu, J Shen, S Min, Z Yu.
arXiv:2309.16804, 2023.
[Paper]
CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation.
W You, P Wang, C Li, Z Ji, J Bai.
Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
[Paper]
[Github]
Adapting large language models for education: Foundational capabilities, potentials, and challenges.
Q Li, L Fu, W Zhang, X Chen, J Yu, W Xia, W Zhang, R Tang, Y Yu.
arXiv:2401.08664, 2023.
[Paper]
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models.
J Hou, C Ao, H Wu, X Kong, Z Zheng, D Tang, C Li, X Hu, R Xu, S Ni, M Yang.
arXiv:2401.15927, 2024.
[ArXiv]
[Github]
Large language models for education: A survey and outlook.
S Wang, T Xu, H Li, C Zhang, J Liang, J Tang, PS Yu, Q Wen.
arXiv:2403.18105, 2024.
[ArXiv]
[Github]
| Date | Task | Title | Paper | HomePage | Github | DataSets |
|---|---|---|---|---|---|---|
| 2023 | Comprehensive | Benchmarking large language models as ai research agents. | [ArXiv] | - | [Github] | - |
| 2023 | Comprehensive | GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of ChatGPT in Science. | [ArXiv] | - | - | - |
| 2023 | Comprehensive | LLMs for science: Usage for code generation and data analysis. | [JSEP] | - | - | - |
| 2023 | Comprehensive | MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. | [ArXiv] | - | [Github] | - |
| 2023 | Comprehensive | Scibench: Evaluating college-level scientific problem-solving abilities of large language models. | [ArXiv] | - | [Github] | - |
| 2023 | Comprehensive | Scieval: A multi-level large language model evaluation benchmark for scientific research. | [AAAI] | - | [Github] | - |
| 2023 | Comprehensive | The sciqa scientific question answering benchmark for scholarly knowledge. | [SR] | - | [Github] | [DataSets] |
| 2024 | Biomedical | Bioinfo-Bench: A Simple Benchmark Framework for LLM Bioinformatics Skills Evaluation. | [bioRxiv] | - | [Github] | - |
| 2023 | Chemistry | [ChemLLMBench] Do large language models understand chemistry? a conversation with chatgpt. | [JCIM] | [Github] | - | |
| 2024 | Chemistry | [ChemLLMBench] What can large language models do in chemistry? a comprehensive benchmark on eight tasks. | [NeurIPS] | [Github] | - | |
| 2024 | Geoscience | [GeoBench] K2: A foundation language model for geoscience knowledge understanding and utilization. | [WSDM] | - | [Github] | - |
| 2023 | Materials | [MaScQA] MaScQA: A Question Answering Dataset for Investigating Materials Science Knowledge of Large Language Models. | [ArXiv] | - | [Github] | [- |
To be refreshed...
TeleQnA: A Benchmark Dataset to Assess Large Language Models Telecommunications Knowledge.
Ali Maatouk, Fadhel Ayed, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, Zhi-Quan Luo.
ArXiv, 2023.
[ArXiv]
[Github]
An Empirical Study of NetOps Capability of Pre-Trained Large Language Models.
Yukai Miao, Yu Bai, Li Chen, Dan Li, Haifeng Sun, Xizheng Wang, Ziqiu Luo, Yanyu Ren, Dapeng Sun, Xiuting Xu, Qi Zhang, Chao Xiang, Xinchi Li.
ArXiv, 2023.
[ArXiv]
[Datasets]
NetConfEval: Can LLMs Facilitate Network Configuration?
C Wang, M Scazzariello, A Farshin, S Ferlin, D Kostić, M Chiesa.
Proceedings of the ACM on Networking, 2024.
[ArXiv]
NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain.
A Acharya, S Munikoti, A Hellinger, S Smith, S Wagle, S Horawalavithana.
ArXiv, 2023.
[ArXiv]
[Github]
Open-transmind: A new baseline and benchmark for 1st foundation model challenge of intelligent transportation.
Y Shi, F Lv, X Wang, C **a, S Li, S Yang, T **, G Zhang.
CVPR, 2023.
[Paper]
[Github]
工业大模型:体系架构、关键技术与典型应用.
任磊, 王海腾, 董家宝等.
中国科学: 信息科学,2024.(在审)
Evaluating the Effectiveness of GPT Large Language Model for News Classification in the IPTC News Ontology.
B Fatemi, F Rabbi, AL Opdahl.
ArXiv, 2023.
[Paper]
How Good is ChatGPT in Giving Advice on Your Visualization Design.
NW Kim, G Myers, B Bach.
arXiv:2310.09617, 2023.
[Paper]
Llmrec: Benchmarking large language models on recommendation task.
J Liu, C Liu, P Zhou, Q Ye, D Chong, K Zhou, Y Xie, Y Cao, S Wang, C You, PS Yu.
arXiv:2308.12241, 2023.
[Paper]
Gameeval: Evaluating llms on conversational games.
D Qiao, C Wu, Y Liang, J Li, N Duan.
arXiv:2308.10032, 2023.
[Paper]
[Github]
AvalonBench: Evaluating LLMs Playing the Game of Avalon.
J Light, M Cai, S Shen, Z Hu.
NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
[Paper]
[Github]
Artificial-General-Intelligence-Testing-Resources.
Resources for AGI & Embodied AI (EAI) Testing.
[Github]
A User-Centric Benchmark for Evaluating Large Language Models.
J Wang, F Mo, W Ma, P Sun, M Zhang, et al.
ArXiv, 2024.
[ArXiv]
[Github]
Understanding User Experience in Large Language Model Interactions.
J Wang, W Ma, P Sun, M Zhang, JY Nie.
ArXiv, 2024.
[ArXiv]
Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models.
Y He, Y Wu, Y Jia, R Mihalcea, Y Chen, N Deng.
arXiv:2310.16755, 2023.
[ArXiv]
[Github]
Sotopia: Interactive evaluation for social intelligence in language agents.
X Zhou, H Zhu, L Mathur, R Zhang, H Yu, Z Qi, LP Morency, Y Bisk, D Fried, G Neubig, et al.
arXiv:2310.11667, 2023.
[ArXiv]
[Homepage]
Academically intelligent LLMs are not necessarily socially intelligent.
R Xu, H Lin, X Han, L Sun, Y Sun.
arXiv:2403.06591, 2024.
[ArXiv]
[Homepage]
InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context.
Z Liu, A Anand, P Zhou, J Huang, J Zhao.
arXiv:2403.06591, 2024.
[ArXiv]
Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities.
J Wang, C Zhang, J Li, Y Ma, L Niu, J Han, Y Peng, Y Zhu, L Fan.
arXiv:2405.11841, 2024.
[ArXiv]
[Github]
ToMBench: Benchmarking Theory of Mind in Large Language Models.
Z Chen, J Wu, J Zhou, B Wen, G Bi, G Jiang, Y Cao, M Hu, Y Lai, Z Xiong, M Huang.
arXiv:2402.15052, 2024.
[ArXiv]
[Github]
Testing theory of mind in large language models and humans.
JWA Strachan, D Albergo, G Borghini, O Pansardi, E Scaliti, S Gupta, K Saxena, A Rufo, et al.
Nature Human Behaviour, 2024.
[ArXiv]
Emotionally numb or empathetic? evaluating how llms feel using emotionbench.
J Huang, MH Lam, EJ Li, S Ren, W Wang.
arXiv, 2023.
[ArXiv]
[Github]
Can Generative Agents Predict Emotion?
C Regan, N Iwahashi, S Tanaka, M Oka.
arXiv:2402.04232, 2024.
[ArXiv]
[Github]
EmoBench: Evaluating the Emotional Intelligence of Large Language Models.
S Sabour, S Liu, Z Zhang, JM Liu, J Zhou, AS Sunaryo, J Li, T Lee, R Mihalcea, M Huang.
arXiv:2402.12071, 2024.
[ArXiv]
[Github]
GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models.
L Wang, Y Jin, T Shen, T Zheng, X Du, C Zhang, W Huang, J Liu, S Wang, G Zhang, L Xiang, et al.
arXiv:2406.14903, 2024.
[ArXiv]
[Github]
A Comprehensive Evaluation of Quantization Strategies for Large Language Models.
M Zhang, X Pan, M Yang.
ACL, 2024.
[ArXiv]
Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox.
Y Liu, Y Meng, F Wu, S Peng, H Yao, C Guan, C Tang, X Ma, Z Wang, W Zhu.
arxiv:2406.12928, 2024.
[ArXiv]
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases.
R Murthy, L Yang, J Tan, TM Awalgaonkar, Y Zhou, S Heinecke, S Desai, J Wu, R Xu, S Tan, et al.
arxiv:2406.10290, 2024.
[ArXiv]
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models.
Z Yu, Y Wu, Z Deng, Y Tang, XP Zhang.
arXiv:2405.12843, 2024.
[ArXiv]
Multimodal-Data-Optimization-Resources.
Test DataSets Evluation
[Github]
Multimodal-Data-Generation-Resources.
Test DataSets Generation
[Github]
Are large language model-based evaluators the solution to scaling up multilingual evaluation?
R Hada, V Gumma, A de Wynter, H Diddee, M Ahmed, M Choudhury, K Bali, S Sitaram.
arXiv:2309.07462, 2023.
[ArXiv]
Automated evaluation of personalized text generation using large language models.
Y Wang, J Jiang, M Zhang, C Li, Y Liang, Q Mei, M Bendersky.
arXiv:2310.11593, 2023.
[ArXiv]
Calibrating LLM-Based Evaluator.
Y Liu, T Yang, S Huang, Z Zhang, H Huang, et al.
arXiv, 2023.
[ArXiv]
Can large language models be an alternative to human evaluations?
CH Chiang, H Lee.
arXiv:2305.01937, 2023.
[ArXiv]
Chateval: Towards better llm-based evaluators through multi-agent debate.
CM Chan, W Chen, Y Su, J Yu, W Xue, S Zhang, J Fu, Z Liu.
arXiv:2308.07201, 2023.
[ArXiv]
CRITIQUELLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation.
P Ke, B Wen, Z Feng, X Liu, X Lei, J Cheng, S Wang, A Zeng, Y Dong, H Wang, J Tang, and et al.
ArXiv, 2023.
[ArXiv]
[Github]
Generative judge for evaluating alignment.
J Li, S Sun, W Yuan, RZ Fan, H Zhao, P Liu.
arxiv:2310.05470, 2023.
[ArXiv]
G-eval: Nlg evaluation using gpt-4 with better human alignment.
Y Liu, D Iter, Y Xu, S Wang, R Xu, C Zhu.
arxiv:2303.16634, 2023.
[ArXiv]
G-eval: Nlg evaluation using gpt-4 with better human alignment.
Y Liu, D Iter, Y Xu, S Wang, R Xu, C Zhu.
arxiv:2303.16634, 2023.
[ArXiv]
JudgeLM: Fine-tuned Large Language Models are Scalable Judges.
L Zhu, X Wang, X Wang.
ArXiv, 2023.
[ArXiv]
[Github]
Prd: Peer rank and discussion improve large language model based evaluations.
R Li, T Patel, X Du.
arxiv:2307.02762, 2023.
[ArXiv]
[Github]
Split and merge: Aligning position biases in large language model based evaluators.
Z Li, C Wang, P Ma, D Wu, S Wang, C Gao, et al.
arxiv, 2023.
[ArXiv]
Wider and deeper llm networks are fairer llm evaluators.
X Zhang, B Yu, H Yu, Y Lv, T Liu, F Huang, H Xu, Y Li.
arxiv:2308.01862, 2023.
[ArXiv]
Large Language Models are not Fair Evaluators.
P Wang, L Li, L Chen, Z Cai, D Zhu, B Lin, Y Cao, Q Liu, T Liu, Z Sui.
ACL, 2024.
[ArXiv]
Aligning with human judgement: The role of pairwise preference in large language model evaluators.
Y Liu, H Zhou, Z Guo, E Shareghi, I Vulic, A Korhonen, N Collier.
arxiv:2403.16950, 2024.
[ArXiv]
Agent-as-a-Judge: Evaluate Agents with Agents.
M Zhuge, C Zhao, D Ashley, W Wang, D Khizbullin, Y **ong, Z Liu, E Chang, et al.
arxiv:2410.10934, 2024.
[ArXiv]
An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers.
H Huang, Y Qu, J Liu, M Yang, T Zhao.
arxiv:2403.02839, 2024.
[ArXiv]
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution.
M Cao, A Lam, H Duan, H Liu, S Zhang, K Chen.
arXiv:2410.16256, 2024.
[ArXiv]
[Github]
Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework.
M Li, Z Liu, S Deng, S Joty, NF Chen, MY Kan.
arXiv:2405.15329, 2024.
[ArXiv]
Length-controlled alpacaeval: A simple way to debias automatic evaluators.
Y Dubois, B Galambosi, P Liang, TB Hashimoto.
arxiv:2404.04475, 2024.
[ArXiv]
Leveraging large language models for nlg evaluation: A survey.
Z Li, X Xu, T Shen, C Xu, JC Gu, C Tao.
arxiv:2401.07103, 2024.
[ArXiv]
Peer-review-in-LLMs: Automatic Evaluation Method for LLMs in Open-environment.
KP Ning, S Yang, YY Liu, JY Yao, ZH Liu, Y Wang, M Pang, L Yuan.
arxiv:2402.01830, 2024.
[ArXiv]
Pre: A peer review based large language model evaluator.
Z Chu, Q Ai, Y Tu, H Li, Y Liu.
arxiv:2401.15641, 2024.
[ArXiv]
Prometheus 2: An open source language model specialized in evaluating other language models.
S Kim, J Suk, S Longpre, BY Lin, J Shin, S Welleck, G Neubig, M Lee, K Lee, M Seo.
arxiv:2405.01535, 2024.
[ArXiv]
Self-taught evaluators.
T Wang, I Kulikov, O Golovneva, P Yu, W Yuan, et al.
arXiv, 2024.
[ArXiv]
The Critique of Critique.
S Sun, et al.
arXiv:2401.04518v1, 2024.
Evaluating large language models at evaluating instruction following.
Z Zeng, J Yu, T Gao, Y Meng, T Goyal, D Chen.
ICLR, 2024.
[ArXiv]
Flask: Fine-grained language model evaluation based on alignment skill sets.
S Ye, D Kim, S Kim, H Hwang, S Kim, Y Jo, J Thorne, J Kim, M Seo.
ICLR, 2024.
[ArXiv]
Benchmarking foundation models with language-model-as-an-examiner.
Y Bai, J Ying, Y Cao, X Lv, Y He, X Wang, J Yu, K Zeng, Y **ao, H Lyu, J Zhang, J Li, L Hou.
Advances in Neural Information Processing Systems, 2024.
[ArXiv]
Efficiently measuring the cognitive ability of llms: An adaptive testing perspective.
Y Zhuang, Q Liu, Y Ning, W Huang, R Lv, Z Huang, G Zhao, Z Zhang, Q Mao, S Wang, et al.
arxiv:2306.10512, 2023.
[ArXiv]
Large language model routing with benchmark datasets.
T Shnitzer, A Ou, M Silva, K Soule, Y Sun, et al.
arxiv, 2023.
[ArXiv]
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models.
J Cheng, Y Lu, X Gu, P Ke, X Liu, Y Dong, H Wang, J Tang, M Huang.
arXiv:2406.16714, 2024.
[ArXiv]
[Github]
Efficient benchmarking (of language models).
Y Perlitz, E Bandel, A Gera, O Arviv, L Ein-Dor, E Shnarch, N Slonim, M Shmueli-Scheuer, et al.
arxiv:2308.11696, 2023.
[ArXiv]
MixEval Deriving Wisdom of the Crowd from LLM Benchmark Mixtures.
J Ni, F Xue, X Yue, Y Deng, M Shah, K Jain, et al.
arXiv, 2024.
[ArXiv]
[Github]
tinyBenchmarks: evaluating LLMs with fewer examples.
FM Polo, L Weber, L Choshen, Y Sun, G Xu, et al.
arXiv, 2024.
[ArXiv]
[Github]
Dynabench Rethinking Benchmarking in NLP.
Douwe Kiela, et al.
arXiv, 2021.
Beyond static datasets: A deep interaction approach to llm evaluation.
J Li, R Li, Q Liu, et al.
arXiv, 2023.
[ArXiv]
LLMEval: A Preliminary Study on How to Evaluate Large Language Models.
Y Zhang, M Zhang, H Yuan, S Liu, Y Shi, T Gui, Q Zhang, X Huang.
Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
[HomePage]
[Paper]
[Github]
Have Seen Me Before Automating Dataset Updates Towards Reliable and Timely Evaluation.
Jiahao Ying, et al.
arXiv:2402.11894v2, 2024.
Livebench: A challenging, contamination-free llm benchmark.
C White, S Dooley, M Roberts, A Pal, B Feuer, et al.
arXiv, 2024.
[HomePage]
[Paper]
Beyond static datasets: A deep interaction approach to llm evaluation.
J Li, R Li, Q Liu.
arXiv, 2023.
[ArXiv]
Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks.
L Ibrahim, S Huang, L Ahmad, M Anderljung.
arXiv:2405.10632, 2024.
[ArXiv]
Branch-solve-merge improves large language model evaluation and generation.
Swarnadeep Saha, et al.
arXiv:2310.15123v1, 2023.
Evaluating general-purpose ai with psychometrics.
X Wang, L Jiang, J Hernandez-Orallo, D Stillwell, L Sun, F Luo, X **e.
arxiv:2310.16379, 2023.
[ArXiv]
State of what art? a call for multi-prompt llm evaluation.
M Mizrahi, G Kaplan, D Malkin, R Dror, D Shahaf, G Stanovsky.
Transactions of the Association for Computational Linguistics, 2024.
[TACL]
Evals
Openai
[Github]
Language Model Evaluation Harness.
EleutherAI
[Github]
DeepEval.
Confident AI
[Github]
OpenCompass
司南大模型评测平台
上海人工智能实验室
[HomePage]
[Github]
FlagEval
天秤大模型评测平台
北京智源研究院
[HomePage]
[Github]
Cleva: Chinese language models evaluation platform.
Y Li, J Zhao, D Zheng, ZY Hu, Z Chen, X Su, Y Huang, S Huang, D Lin, MR Lyu, L Wang.
ArXiv, 2023.
[ArXiv]
GPT-Fathom.
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond.
S Zheng, Y Zhang, Y Zhu, C **, P Gao, X Zhou, KCC Chang.
ArXiv, 2023.
[ArXiv]
[Github]
Catwalk.
Catwalk: A Unified Language Model Evaluation Framework for Many Datasets.
D Groeneveld, A Awadalla, I Beltagy, A Bhagia, I Magnusson, H Peng, O Tafjord, P Walsh, et al.
ArXiv, 2023.
[ArXiv]
[Github]
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking.
F Dalvi, M Hasanain, S Boughorbel, B Mousi, S Abdaljalil, N Nazar, A Abdelali, and et al.
ArXiv, 2023.
[ArXiv]
[Github]
HumanELY: Human evaluation of LLM yield, using a novel web-based evaluation tool.
R Awasthi, S Mishra, D Mahapatra, A Khanna, K Maheshwari, J Cywinski, F Papay, P Mathur.
medRxiv, 2023.
[ArXiv]
UltraEval.
UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs.
C He, R Luo, X Han, Z Liu, M Sun, and et al.
ArXiv, 2024.
[ArXiv]
[Github]
FreeEval.
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models.
Z Yu, C Gao, W Yao, Y Wang, Z Zeng, W Ye, J Wang, Y Zhang, S Zhang.
ArXiv, 2024.
[ArXiv]
[Github]
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety.
C Liu, L Yu, J Li, R **, Y Huang, L Shi, J Zhang, et al.
ArXiv, 2024.
[HomePage]
Clean-eval: Clean evaluation on contaminated large language models.
W Zhu, H Hao, Z He, Y Song, Y Zhang, H Hu, Y Wei, R Wang, H Lu.
arXiv:2311.09154, 2023.
[Paper]
Data contamination through the lens of time.
M Roberts, H Thakur, C Herlihy, C White, S Dooley.
arXiv:2310.10628, 2023.
[Paper]
Investigating data contamination in modern benchmarks for large language models.
C Deng, Y Zhao, X Tang, M Gerstein, A Cohan.
arXiv:2311.09783, 2023.
[Paper]
Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark.
O Sainz, JA Campos, I García-Ferrero, J Etxaniz, OL de Lacalle, E Agirre.
arXiv:2310.18018, 2023.
[Paper]
[Github]
Rethinking benchmark and contamination for language models with rephrased samples.
S Yang, WL Chiang, L Zheng, JE Gonzalez, I Stoica.
arXiv:2311.04850, 2023.
[Paper]
[Github]
Task contamination: Language models may not be few-shot anymore.
C Li, J Flanigan.
Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
[Paper]
Investigating data contamination for pre-training language models.
M Jiang, KZ Liu, M Zhong, R Schaeffer, S Ouyang, J Han, S Koyejo.
arXiv:2401.06059, 2024.
[Paper]
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models.
Z Yu, C Gao, W Yao, Y Wang, W Ye, J Wang, X Xie, Y Zhang, S Zhang.
ArXiv, 2024.
[ArXiv]
[Github]
Large language models sensitivity to the order of options in multiple-choice questions.
P Pezeshkpour, E Hruschka.
arXiv:2308.11483, 2023.
[Paper]
Don't make your llm an evaluation benchmark cheater.
K Zhou, Y Zhu, Z Chen, W Chen, WX Zhao, X Chen, Y Lin, JR Wen, J Han.
arXiv:2311.01964, 2023.
[Paper]
[Github]
Inadequacies of large language model benchmarks in the era of generative artificial intelligence.
TR McIntosh, T Susnjak, N Arachchilage, T Liu, P Watters, MN Halgamuge.
arXiv:2402.09880, 2024.
[Paper]
LMSYS Org
UC Berkeley.
[Homepage]
| Name | Organization | HomePage | Github | Scholar | Benchmark |
|---|---|---|---|---|---|
| Sun Maosun | Tsinghua University | [homepage] | - | [scholar] | - |
| Tang Jie | Tsinghua University | [homepage] | - | [scholar] | - |
| Huang Minlie | Tsinghua University | [homepage] | - | [scholar] | - |
| Zheng Haitao | Tsinghua University | [homepage] | - | [scholar] | - |
| Yewei | Peking University | [homepage] | - | [scholar] | - |
| Qiu Xipeng | Fudan University | [homepage] | - | [scholar] | - |
| Xiao Yanghua | Fudan University | [homepage] | - | [scholar] | - |
| Xiong Deyi | Tianjin University | [homepage] | [github] | [scholar] | - |
| Chen Kai | Shanghai AI Lab | - | - | [scholar] | - |
| Zhang Songyang | Shanghai AI Lab | - | - | [scholar] | - |
NeurIPS (Datasets and Benchmarks Track).
[Homepage]
Patronus AI.
[Homepage]