Here we provide a result file named single-pass-eval.jsonl, which contains the evaluation of trajectories generated by the model.
For test-time scaling, we provide 40patches.jsonl, which contains the 40 patches used, and 40tests.jsonl, which contains the 40 tests. Additionally, we include the evaluation results of the final selected patches after test-time scaling in the file test-time-final-single-pass-eval.jsonl.
The evaluation code will be released soon.
single-pass-eval.jsonl
For each item in the file:item['instance_id']is the unique label for each instanceitem['patch']is the patch generated by the model to fix the bug in the issueitem['judge_res']['EVAL_EXEC_str']is the test executing log after applying the item['result']['model_patch']item['judge_res']['test_status']records the details of FAIL_TO_PASS, PASS_TO_PASS, FAIL_TO_FAIL, and PASS_TO_FAILitem['judge_res']['resolved']is true if there is no failure in the test
40patches.jsonl
This file is used for test time scaling, containing 40 independent rollouts of the patches.40tests.jsonl
This file is used for test time scaling, containing 40 independent rollouts of the tests.test-time-final-single-pass-eval.jsonl
This file contains the evaluation log of the patch ultimately selected for each instance after performing test-time scaling. The structure of each item in the file is similar to that ofsingle-pass-eval.jsonl.