BigCodeBench Leaderboard

📝 Notes

Evaluated using BigCodeBench;
Hard Set vs Full Set:
Hard Set: A subset of ~150 BigCodeBench tasks which is more user-facing and challenging.
Full Set: The full set of 1140 BigCodeBench tasks.
Models are ranked according to (calibrated) Pass@1 using greedy decoding. Setup details can be found here.
Complete vs Instruct:
Complete: Code Completion based on the structured long-context docstring. This variant tests if the models are good at coding.
Instruct (🔥Vibe Check🔥): Code Generation based on the brief NL-oriented instructions. This variant tests if the models are really capable enough to understand human intents to code.
Wonder the relative performance among models, or the current progress of task solve rate? Check out the 🤗 Hugging Face Leaderboard!
🧠 indicates an evaluation setup without response prefilling during generation, potentially leading to the reasoning process.
✨ marks models evaluated using a chat setting, while others perform direct code completion. We note that some instruction-tuned models miss the chat template in their tokenizer configuration.
Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
💚 means open weights and open data. 💙 means open weights and open SFT data, but the base model is not data-open. What does this imply? 💚💙 models open-source the data such that one can concretely reason about contamination.
"Size" here is the number of model parameters during inference.

🌸 BigCodeBench Leaderboard

BigCodeBench evaluates LLMs with practical and challenging programming tasks.

📝 Notes

🤗 More Leaderboards

🙏 Acknowledgements