FDU-NLP LLMEval Team

LLMEval-1 Public

[AAAI 2024] LLMEval Phase I dataset — 17 categories, 453 questions, 2186 annotators for Chinese LLM evaluation

LLMEval-2 Public

[AAAI 2024] LLMEval Phase II dataset — professional domain evaluation across 12 academic disciplines

LLMEval-Fair Public

[ACL 2026] A large-scale longitudinal study on robust and fair evaluation of LLMs — 200K+ generative questions across 13 disciplines

LLMEval-Med Public

[EMNLP 2025] A real-world clinical benchmark for medical LLMs with physician validation — 2,996 questions from EHRs

Python 27 1

Llmeval-Gaokao2024-Math Public

LLM evaluation on 2024 Chinese Gaokao Mathematics — zero-contamination benchmark with dual prompt formats

LLMEval-Logic Public

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening (80% public release; 20% private holdout)

Python 9

Provide feedback