CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.
Discussion
Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and today we've got a really fascinating topic lined up. We're diving into the world of AI, specifically large language models, and how they're tackling the challenge of coding at a competitive level. It’s a bit different from the usual stuff we cover, so I’m super excited to explore this.
Guest: Hey Leo, thanks for having me! Yeah, this is a really hot area right now. We've seen AI get incredibly good at generating text, but code is a whole different beast. It requires such precision and logical reasoning, it's a real test of the models’ abilities, so it's quite interesting to see how far we've come with them in competitive coding scenarios.
Host: Exactly! And that's what makes this research so intriguing. We aren't talking about simple scripts here, we’re talking about problems that require the kind of creative problem-solving and strategic thinking that humans use in coding competitions. It's not just about syntax, it's about algorithms, optimization, and handling complex scenarios. What kind of benchmarks have been used to assess this capability, and why is it so crucial to assess models in competitive environments?
Guest: That's a great question, Leo. Traditionally, we've seen a lot of benchmarks that focus on the functional correctness of code. Models might be asked to write a function that sorts a list or performs a simple calculation. These benchmarks are really useful for understanding basic code generation abilities, but they don't really capture the nuances and challenges of competitive coding. You know, the kind of challenges you see in platforms like CodeForces or AtCoder. Competitive coding often requires more intricate algorithms, optimization strategies, and handling large datasets, all under time pressure. That’s why it’s important to test models in a more realistic competitive setting. Moreover, standardized testing allows us to have a benchmark to compare and iterate upon. Without these, it’s hard to know which methods and models are progressing.
Host: Right, and I guess that’s where the idea of using CodeForces comes in? It’s a well-known platform for competitive programming, where people from all over the world compete. So why is it ideal for benchmarking LLMs and what makes it different from other existing benchmarks?
Guest: Absolutely, Leo. CodeForces is perfect because it provides real-world problems created by humans, with a large pool of competitors solving them. It gives a genuine sense of how an AI model would stack up against human programmers. Previous benchmarks often rely on self-generated test cases or scraped problems that are not robust enough, leading to many false positives. In the real competition, problem test cases are usually quite robust, designed to catch even the slightest error and trick competitors, and creating such a set of test cases is not easy. Additionally, CodeForces provides different types of problems with varying difficulties, from very basic to extremely hard, allowing us to test the models at all skill levels. This level of complexity isn’t always captured in standard coding tasks. Also, a crucial aspect is the fact that CodeForces has its own judges with special judges, as many problems do not have unique correct answers, meaning that a specific code that can determine whether an output is valid is needed. In our study, we found that around 30% of problems need special judges, which most existing benchmarks cannot accommodate. Also, the execution environment is exactly the same, aligned with the settings used by humans, making the comparison fair. These all contribute to making it a great choice for this kind of testing.
Host: Okay, that makes a lot of sense. It’s not just about solving the problem, it's about solving it efficiently and correctly under the same rules as human participants. So, how does this new benchmark, which is also called CodeForces, actually work? You've mentioned submitting models’ outputs, and that is kind of interesting, how do we do that?
Guest: Right, so we built a system that acts like a virtual participant, it automatically submits the models' code solutions to the CodeForces platform for testing. We used an automatic submission bot to help do this, without any human intervention. The platform’s judges then evaluate the submissions just like it does for any human, with absolutely accurate feedback. This is where we achieve zero false positives, as problems are checked through the platform with human-designed robust test cases, with no access to these test cases needed. This means we don’t need to try and recreate the test environments or the special judges, since CodeForces provides them. And importantly, we're using their specialized execution environment, which ensures that the running time of programs is judged consistently and accurately, regardless of the computing environment used in testing. This is important because timing is a crucial factor in competitive coding.
Host: That's a pretty clever setup, a bot acting like a contestant, getting feedback directly from the platform, that solves a lot of problems. It not only gives us more accurate results, but it also makes the whole testing process much more aligned with the actual competition. So what exactly is the output of this whole process, what kind of metric do you use to benchmark these LLMs' capabilities?
Guest: Yeah, exactly. So besides pass rates, which is a typical benchmark for coding tasks, we also introduced a standardized Elo rating system. If you're familiar with chess or other competitive games, the Elo rating is a way to estimate a participant’s skill level relative to others. We adopted a system similar to the official CodeForces platform, but we have also made some changes to reduce the variance, making it more accurate for our benchmarks. This rating system is really helpful because it allows us to directly compare the abilities of models with human participants. It's not just a pass or fail, it's about seeing where they fall on the skill spectrum. Unlike other simple pass@n metrics that ignore failed submissions before successful ones, or that do not consider the difficulty of problems, the Elo rating also effectively balances pass attempts, penalizing failed attempts before successful ones. Moreover, it also encourages models to attempt and solve more difficult problems because they will be rewarded with higher scores in the rating calculations. So in short, it gives us a more nuanced and meaningful evaluation of the models' performance.
Host: That’s fascinating! So it's not just about whether they can solve a problem, but also how well they solve it, which is key in a competition. You've mentioned some models standing out, right? What did you discover when testing various large language models with the CodeForces benchmark?
Guest: Yes, we tested a variety of both open-source and proprietary models, 34 in total, and the results were really quite interesting. We found that models that use a kind of reasoning process similar to OpenAI’s o1 model, with a long chain of thought, tended to perform the best. For instance, the o1-mini model achieved the highest Elo rating, surpassing nearly 90 percent of human participants. The open-source model QwQ-32B-Preview also did well, placing it in about the 60th percentile of human participants. This really highlights the importance of the reasoning capabilities of these models when solving complex code problems. On the other hand, we noticed that most other models struggled significantly, even with the simpler problems. Many of them fell within the lowest 10th or 20th percentiles of the human Elo rating, which suggests that there’s still a lot of room for improvement in the coding abilities of most current models. Also, generally, larger models tended to perform better than smaller ones, although the majority of open-source LLMs remain below a rating of 700.
Host: That’s quite a wide performance gap, from the 90th percentile down to the 10th or 20th, it shows a huge variation in model capabilities. Given that, what can you tell us about models' strengths and weaknesses based on your analysis? Are there specific kinds of coding problems where they excel, and what types are more challenging for them?
Guest: That's a key part of our research, Leo. We categorized problems by algorithm tags, and what we found is that models perform relatively well on problems that involve math or implementation. These are areas where models seem to grasp the logic and carry out the steps quite well. They also tend to do well on sorting algorithms. However, they struggle quite a bit with problems involving dynamic programming, or 'DP,' and trees. These are the kinds of problems that require more complex strategies, often a deeper understanding of the problem constraints, and more complex thought processes. This highlights a need to enhance the models’ abilities in these areas. And it also shows the potential for this benchmark to help drive progress by allowing developers to target areas where models are weakest.
Host: That’s really insightful, the fact that they have trouble with dynamic programming and trees, suggests they may struggle with a kind of reasoning that is more akin to a strategic, top-down approach rather than just step-by-step execution. I wonder, does the programming language choice matter in all of this? Do models perform differently depending on whether they’re coding in Python or C++?
Guest: That's a really good point, Leo, and it’s something that actually surprised us. It turns out that language choice does matter significantly. We found that models generally perform best when they use C++ rather than Python, even though most models default to Python. It’s interesting, because human participants in coding competitions often use C++ due to its speed and efficiency. And we saw similar trends in models, indicating the critical importance of using C++ in runtime-constrained environments. When we prompted models to use either C++ or Python, we consistently saw the best scores from C++ outputs across all models, sometimes with notable differences. This contradicts previous benchmarks that primarily use Python, and the result hints that the models might be better in competitive environments with C++. Our research further suggests that we had better test models with C++ to unlock their best performance. It points to a gap in their training: models should be better at recognizing that C++ might be a better tool for speed, when needed, and is something that model developers might want to consider for further training.