πΈ BigCodeBench Leaderboard
BigCodeBench evaluates LLMs with practical and
challenging programming tasks.
π Notes
- Evaluated using BigCodeBench version 0.1.0;
-
Hard Set vs Full Set:
Hard Set: A subset of ~150 BigCodeBench tasks which is more user-facing and challenging.
Full Set: The full set of 1140 BigCodeBench tasks. - Models are ranked according to (calibrated) Pass@1 using greedy decoding. Setup details can be found here.
-
Complete vs Instruct:
Complete: Code Completion based on the structured long-context docstring. This variant tests if the models are good at coding.
Instruct (π₯Vibe Checkπ₯): Code Generation based on the brief NL-oriented instructions. This variant tests if the models are really capable enough to understand human intents to code. - Wonder the relative performance among models, or the current progress of task solve rate? Check out the π€ Hugging Face Leaderboard!
- π€ indicates the models having at least a difference of 1% between the calibrated Pass@1 and the original one. What does this imply? Instruction-tuned models can be lazy, omitting essential code parts and thus failing on some tasks. Therefore, we add the missing parts during evaluation, and report the calibrated Pass@1 score as default,
- β¨ marks models evaluated using a chat setting, while others perform direct code completion. We note that some instruction-tuned models miss the chat template in their tokenizer configuration.
- Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
- π means open weights and open data. π means open weights and open SFT data, but the base model is not data-open. What does this imply? ππ models open-source the data such that one can concretely reason about contamination.
- "Size" here is the amount of activated model weight during inference.
π€ More Leaderboards
In addition to BigCodeBench leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:π Acknowledgements
- We thank the EvalPlus team for providing the leaderboard template.
- We are grateful for the significant contributions from the BigCode community.