# LLMs and Planning This repo has the code for four papers: 1. The code in 'plan-bench' subdirectory belongs to the paper ["PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change"](https://arxiv.org/abs/2206.10498) 2. The code in 'llm_planning_analysis' subdirectory belongs to the paper ["On the Planning Abilities of Large Language Models--A Critical Investigation"](https://arxiv.org/abs/2305.15771) 3. The code in 'llm_planning_analysis/back_prompting_parallel.py' consists of an implementation of the [LLM-Modulo framework](https://openreview.net/forum?id=Th8JPEmH4z) 4. **NEW**: 'llm_planning_analysis' subdirectory also contains the code for the paper ["A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1"](https://openreview.net/forum?id=FkKBxp0FhR) # PlanBench Static Test Set Leaderboard The leaderboard below shows the performance of the models on the PlanBench static test set with zero-shot prompting. Check out llm_planning_analysis/results/ folder for the detailed files. For Blocksworld Hard, the results are included in results/backprompting/ folder. | Model Name | Model Type | Blocksworld - NL - 600 instances | Mystery Blocksworld - NL - 600 instances | Randomized Mystery Blocksworld - NL - 600 instances | Blocksworld Hard - PDDL - 110 instances | |------------|------------|----------------------------------|----------------------------------------|--------------------------------------------------|----------------------------------------| | Deepseek R1 | LRM | 99.1% | 43.3% | 25.8% | 53.6% | | o1-preview | LRM | 97.8% | 52.8% | 37.3% | 23.65% | | o1-mini | LRM | 56.6% | 19.1% | 3.5% | 10% | | Claude-3.5 Sonnet | LLM | 54.8% | 0% | - | - | | GPT-4o | LLM | 35.5% | 0% | - | - | | LLaMA-3.1 405B | LLM | 62.6% | 0.8% | - | - | | Claude 3 Opus | LLM | 59.3% | 0% | - | - | | LLaMA-3 70B | LLM | 34.16% | 0% | - | - | | GPT-4 | LLM | 34.6% | 0% | - | - | | Gemini 1.5 Pro | LLM | 23.8% | - | - | - | > Note: LLM = Large Language Model, LRM = Language Reasoning Model, NL = Natural Language Prompting, PDDL = Planning Domain Definition Language Prompting ## Submitting to the Leaderboard Kindly submit results of any new models by submitting a pull request with the result file and the leaderboard will be updated. ## Citation(s) PlanBench - _NeurIPS 2023 Datasets and Benchmarks Track_: ``` @article{valmeekam2023planbench, title={Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change}, author={Valmeekam, Karthik and Marquez, Matthew and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao}, journal={Advances in Neural Information Processing Systems}, volume={36}, pages={38975--38987}, year={2023} } ``` On the Planning Abilities of Large Language Models - _NeurIPS 2023 Spotlight_: ``` @article{valmeekam2023planning, title={On the planning abilities of large language models-a critical investigation}, author={Valmeekam, Karthik and Marquez, Matthew and Sreedharan, Sarath and Kambhampati, Subbarao}, journal={Advances in Neural Information Processing Systems}, volume={36}, pages={75993--76005}, year={2023} } ``` A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1 - _TMLR_: ``` @article{valmeekam2025a, title={A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1}, author={Karthik Valmeekam and Kaya Stechly and Atharva Gundawar and Subbarao Kambhampati}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=FkKBxp0FhR}, note={} } ``` ## Star History [![Star History Chart](https://api.star-history.com/svg?repos=karthikv792/LLMs-Planning&type=Date)](https://www.star-history.com/#karthikv792/LLMs-Planning&Date)