LeetCodeDataset is a dataset comprising Python LeetCode problems designed for training and evaluating Large Language Models (LLMs).
💻 Hugging Face Datasets 📄 Paper
The dataset adheres to the human-eval problem file format.
task_id: The LeetCode problem's question title slug, which corresponds to the problem URL.question_id: The LeetCode problem's question ID.difficulty: The problem's difficulty level (Easy, Medium, or Hard).tags: E.g. ['Array', 'Hash Table']problem_description: The problem description, including examples and constrains.starter_code: The starter code to solve the problem.estimated_date: The estimated release date.prompt: The prefix for the completion, such as basic imports.completion: The completion without the prompt.entry_point: The function name used for evaluation.test: A function to check test cases.input_output: Test cases.query: The query including problem description and starter code.response: The correct response.
LeetCodeDataset can be used for training as follows:
- The dataset is split into training and test sets. Problems are ordered by
question_id, with those having largerquestion_idvalues used for the test set. - Use
queryas the query andresponseas the response to train the LLM using the training split.
The number of problems in each version and split is as follows:
| Version | Train | Test |
|---|---|---|
| v0.1.0 | 1570 | 175 |
| v0.2.0 | 1890 | 200 |
| v0.3.0 | 2386 | 386 |
| v0.3.1 | 2641 | 228 |
git clone https://github.com/newfacade/LeetCodeDataset
pip install -e .eval_lcd --version v0.3.1 \
--split test \
--input_file ./data/LeetCodeDataset-v0.3.1-test.jsonl \
--predict_column completionversion: dataset version.split: test or train.input_file: A JSONL file containing the problems and predictions for the specified LeetCodeDataset, withtask_idand prediction.predict_column: The column name of the prediction ininput_file, e.g.,{'task_id': 'two_sum', 'output': 'To solve the problem of finding two indices ...'}uses--predict_column output.
You can also perform custom evaluations using the evaluate_functional_correctness command, which is consistent with human-eval.
- Metadata Acquisition, including: – question id: unique numeric identifier – question: url-related string (serves as primary task id) – problem description – starter code
- Canonical Solution Verification
- Retrieved reference solutions from GitHub open-source datasets
- Validated solution correctness through LeetCode’s official execution environment
- Entry Point Identification: Implemented text pattern matching to detect target functions
- Test Case Generation
- Automated Evaluation Framework
- Developed sandboxed execution environment for safe code evaluation
- Implemented trial-and-error mechanism to Execute canonical solutions against generated inputs
- Pre-SFT: Let Models Decide on Supervisory Data for Fine-Tuning
- Preference Modeling: Binary Discrimination Versus Imitation Learning
- POLICY FILTRATION IN RLHF TO FINE-TUNE LLM FOR CODE GENERATION
- AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
- Breaking the Attention Trap in Code LLMs: A Rejection Sampling Approach to Enhance Code Execution Prediction
- code-r1
@misc{xia2025leetcodedatasettemporaldatasetrobust,
title={LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs},
author={Yunhui Xia and Wei Shen and Yan Wang and Jason Klein Liu and Huifeng Sun and Siyue Wu and Jian Hu and Xiaolong Xu},
year={2025},
eprint={2504.14655},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.14655},
}