Humanity's Last Exam
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
Discussion
Host: Hey everyone, and welcome back to the podcast! Today, we're diving into something that’s been causing quite a stir in the AI world—a new benchmark that's trying to push large language models to their absolute limits. I'm your host, Leo, and I’m super excited to unpack this with all of you.
Host: We’re not just talking about another dataset of questions here, folks. It's called ‘Humanity’s Last Exam,’ or HLE, which, frankly, sounds pretty dramatic, doesn’t it? It’s designed to be incredibly challenging, almost like the final academic hurdle for these AI models. Think of it as, well, the 'last exam'. It’s got me thinking a lot about what it really means to measure intelligence.
Host: So, we're going to go over the different aspects of this exam. We'll start with an overview, then we'll dig into how they came up with this exam, what sort of questions we're talking about, how they're evaluating the performance of these models, and then discuss what it all means for the future. I think this is going to be a really interesting discussion, so let's get started!
Host: Alright, let's jump right in with the introduction to Humanity's Last Exam, or HLE. As I mentioned, it's designed to be an incredibly tough benchmark, right? The creators are saying that current popular benchmarks are basically too easy for state-of-the-art LLMs. Models are acing them with 90% accuracy or more, which makes it tough to really gauge how far these models have come and how much further they can go. It's like giving calculus students basic algebra problems and thinking you're testing their abilities. We need something that really challenges their deep reasoning and problem-solving skills, especially in academic areas.
Host: This really highlights a problem with the current evaluation system. We've been relying on the same benchmarks for so long, that the models have practically memorized them. It's less about intelligence at that point and more about recall. The idea of HLE is to move the goalpost and keep pushing the frontier of what AI can do, or rather, what we can measure that they can do. They're really emphasizing closed-ended academic questions that are very precise and can be automatically graded which allows for large scale testing but also tests the accuracy rather than creativity or open ended problem solving.
Host: And I find it fascinating that this isn’t just some random collection of questions. It's supposed to be at the frontier of human knowledge, which is ambitious. It's not something you can just Google the answer to. They emphasize questions that require deep understanding and problem-solving. I'm really curious about how they're ensuring this level of difficulty. How do you even begin to create a dataset like that?
Host: Yeah, and that brings us to the related work. What I found really interesting is that they didn’t just wake up one day and decide to create HLE out of the blue. They built upon what's already out there. So, we're talking about the history of benchmarks for LLMs which is really important. There have been a lot of benchmarks focused on things like general language understanding, code generation, and even mathematical reasoning, which are all crucial, right? We've seen things like MMLU, which tries to cover a broad range of subjects, but as we've touched upon, these are becoming increasingly saturated, which basically means that these models are achieving peak performance in these tests.
Host: And the thing about those benchmarks is that they're very useful for tracking the rapid improvement in LLMs. They have allowed us to pinpoint areas where models excel and where they fall short. But the saturation of these benchmarks kind of highlights a need to redefine the scope of the evaluations. You can't keep measuring a car's top speed on the same short stretch of road. Eventually, you'll need a longer track to see how fast it can truly go, and that's kind of what HLE represents in the field of AI. It’s about pushing the boundary and seeing how advanced these AI tools really are. It’s like how some athletes are pushing the limit in running, and every time someone sets a record, it gives us more room to explore. It also means that the next benchmark might be even harder.
Host: That’s a great analogy, Leo. It’s not enough to just see how fast they are on a familiar track; we need to test their endurance and adaptability on new and challenging courses. So, with these established benchmarks losing their effectiveness, the focus has really shifted towards creating more challenging tests. We've seen efforts to develop multi-modal tests that involve images and text, not just text-based inputs, or ones that filter questions more rigorously and bring in domain experts to create more advanced academic questions. And this is precisely where HLE fits into this landscape. It basically combines these approaches which makes it more robust. This means that they use subject matter experts to write the tests while still trying to maintain the broad coverage of MMLU.
Host: Exactly, it's not just about throwing more difficult questions at the AI; it's about creating questions that really tap into expert-level knowledge. It's a move from quantity to quality, in a sense. HLE aims to measure the gap between where current AI stands and the capabilities of expert humans, particularly in very specific academic domains. It's like comparing the abilities of an AI in a certain field to the level of an expert PhD in the same field, and the questions are designed to test exactly that. And it does this using close ended questions rather than open ended ones. It’s a fundamental difference from tests that evaluate more general skills.
Host: And it's interesting that they are comparing it specifically to open-ended domain assessments. There is a difference between answering set questions and doing actual creative work. HLE is designed to measure the models’ ability to tackle those structured, expert-level problems, rather than open-ended tasks. That distinction is very important because it highlights what these tests are meant to do – provide a focused measurement of technical and scientific knowledge, and it's not meant to encompass all aspects of intelligence, especially the creative ones. It’s like evaluating whether a chef can follow a recipe versus creating a brand-new dish.
Host: Right, and that leads us very nicely into the dataset itself. So, Humanity’s Last Exam consists of a whopping 3,000 super challenging questions, spanning over a hundred different subjects. They've released these questions publicly which gives AI researchers access to a vast and difficult dataset. And I think it’s worth noting, they are doing that while keeping a portion of the questions as a private test set. That’s to ensure that the models aren’t just overfitted to the public set but also are able to tackle new questions. It's a good strategy to prevent 'cheating' by basically memorizing the questions.
Host: It's like having a practice exam and then a real exam, ensuring that students aren't just memorizing answers but really understand the concepts, which is an excellent approach to evaluating AI. Now, about the collection process, this is something else entirely. I was blown away that this is a global collaborative effort. They've got questions from nearly a thousand subject matter experts affiliated with over 500 institutions in over 50 countries. It's essentially the whole world coming together to make one incredibly difficult exam for AI. Most of these contributors are professors, researchers, and those with advanced degrees, so we know they're serious experts in their field. The sheer scale of this collection is pretty impressive.
Host: Absolutely. The diversity in terms of geography and expertise is key here. You want to make sure that this isn't just a test for a certain type of knowledge or a specific part of the world. It really has to challenge the models from every angle possible. It makes sense for a test designed to test the very boundaries of human knowledge to be created through a global effort with a lot of subject experts. It’s also a good way to avoid bias that might creep in if it was only put together by a small group of people or a single institution.
Host: And on that note of diversity, let's talk about the question style. HLE includes both exact-match questions, where models have to give a precise answer, and multiple-choice questions, where they select from five or more options. About 10% of these questions also involve images which introduces another layer of complexity by combining visual and textual information. Interestingly, most of the questions, 80%, are exact-match, with the rest being multiple-choice, which means that these models will need to be very specific and accurate in their answers. This means we need to ensure models aren't just guessing or relying on pattern recognition. It tests their deep understanding and accuracy in recall or computations.
Host: That's right, and the level of detail they require is fascinating. For each question, they need to include not just the question itself but also the answer, a detailed rationale explaining the solution, the academic subject, and even the contributor's name and institution. It is very meticulous, and it’s probably what gives HLE a lot of integrity. It’s a way of maintaining accountability and ensuring the accuracy of these questions. They are not just creating a test; they are creating a resource and setting the standards of the quality and detail involved in creating such tests. This isn't just some random guessing game, but an academic exercise at its core.
Host: And the submission format is quite rigorous, as it should be. The questions must be precise, unambiguous, solvable and, crucially, not searchable. That means models can’t just pull the answers from the web; they really have to know their stuff. It makes sense that all submissions need to be original or at least, non-trivial syntheses of published information, meaning that while unpublished research is acceptable, they shouldn’t just be lifting material. This shows they are not testing the model's access to information but rather their actual understanding of it. It’s an exercise to see how well these models can independently reason.
Host: It also highlights that these questions require graduate level or expert knowledge. So, it's about testing the models' depth of understanding, their precision, and accuracy. They are not just testing if the models know something but rather if they know it to a very high degree of accuracy and specificity. The questions often delve into specific details such as precise historical events, very niche trivia, or local customs, which all have very specific, objective answers. They're also taking measures to prevent AI from simply regurgitating memorized data by tweaking parameters such as the number of choices so the models don’t just luck into the right answer by guessing.
Host: And I think it’s interesting that they require that they use clear English with technical terminology and LaTeX formatting where appropriate. It’s all to ensure the questions are as precise and unambiguous as possible. And with the answers, they must be short and easily verifiable for exact-match questions to support automatic grading, which is key to make sure the whole thing runs smoothly. They've also forbidden subjective interpretations and open-ended questions to make sure the questions have an objective solution and avoid personal bias. That's why the inclusion of detailed solutions is important; it's how they confirm the accuracy of the question, but also a check to make sure it is well-posed.
Host: Yeah and I think one last interesting thing about this dataset creation is that they have a prize pool of half a million dollars. It's a way to attract those experts who can produce high-quality questions. So, 5,000 USD for each of the top 50 questions and then 500 for the next 500. It’s quite an incentive and also a great way to get people to participate who are actually experts in those fields. Not only is it a financial incentive but also the opportunity for co-authorship is another motivating factor which means anyone whose question is accepted in HLE gets to be co-authors. All this helps make sure that HLE is a high-quality, rigorous test designed by experts.
Host: Absolutely, the review process is just as thorough as the collection process. First of all they do an LLM difficulty check, which is quite smart. Before even sending a question to a human reviewer, they test it against several frontier LLMs to ensure that the questions are actually challenging enough. If the AI can easily solve it, it’s immediately rejected. That means that they use the current tools that they are trying to evaluate to filter out the questions. And it shows how this evaluation process is both thorough and innovative.
Host: Exactly, it’s an iterative process, a sort of pre-screening by AI, if you will, that filters out the easier questions. And they logged a lot of attempts, over 70,000, resulting in about 13,000 questions that stumped the AI models, which then moved on to expert human review, and I think that really underscores the sheer scale and meticulousness of this whole process. It’s also another way to ensure the quality and high standard that they want the exam to be.
Host: Right, and then comes the human expert review. These reviewers have advanced degrees in their fields, which means they're qualified to assess the validity and quality of these questions, and the questions go through two rounds of reviews. The first round is all about refining the submissions, where questions can receive between one and three reviews, depending on the content. They are scored using a standardized rubric and the reviewers also give feedback to help improve the questions, so it’s an iterative process between question submissions and reviews.
Host: And the second round is a more select process. In this round, organizers and reviewers select the top-rated questions from the first round, using a different rubric, and this is where they decide which ones should actually be included in the final HLE dataset. It’s another check to make sure the best questions end up in the dataset. And it also helps maintain the standards and ensures they are fit for the purpose of HLE. They’re not just taking all the questions that stumped the AI, but only the ones that are well-posed, insightful, and accurate.
Host: That’s right and because the questions can be very advanced and specialized, it’s impossible for the reviewers to verify each solution, they don’t spend more than five minutes, so they focus on whether the question aligns with the guidelines. And I think that shows us the process is rigorous but also understands the challenges of such a comprehensive review. And because of those challenges, they welcome community feedback, which is why they're planning a public feedback period to further refine it. It also allows the community to participate and be a part of the process. This makes sure the dataset is continually improving.
Host: Yeah and that idea of continuous improvement leads us to the evaluation phase. So how did the LLMs actually perform? This is where it gets really interesting. After all that rigorous collection and reviewing, they put the state-of-the-art LLMs to the test. They used a standardized system prompt to guide the models' responses, and the models had to explicitly state their reasoning and the final answer. The evaluations were done using GPT-4o as the judge to verify the model’s answer against the provided answer. So, it is also a test of GPT-4o’s ability to assess information. And I think that helps add another layer of consistency to the whole process.
Host: Absolutely. What’s striking here is the accuracy level which was generally low across all models. This demonstrates a real gap between where these LLMs are and where human experts are. This is important because the dataset is designed to filter out questions that the AI could easily answer, which implies the questions that made it into the dataset are really testing the cutting-edge capabilities of these models. It’s a validation of the rigorous collection process and that it indeed highlights a significant area for improvement. It also gives us a better idea of what to expect of the AI as it gets more advanced.
Host: And that low score isn’t just due to random guessing. The models do achieve some accuracy. There’s a certain level of randomness in the way they guess, and sometimes for multiple-choice questions, they can guess worse than random chance, so it’s not a simple case of them just guessing and getting it wrong. They still make the right choice sometimes even if they don't know the reasoning behind it. They’ve left those questions in the set instead of removing them, and it shows they are aiming for a realistic measure of model performance rather than just creating a test where they fail all the time.
Host: That's right, it's a very intentional choice and also a reminder that progress, in this area at least, will not be a smooth linear upwards trend. That's why they emphasize that small changes in those low-accuracy scores are not necessarily indicative of significant progress, they just show where the baseline is at the moment. The other significant evaluation is on model calibration error. It is just as important as accuracy, perhaps even more so. Models should be able to recognize when they are uncertain rather than confidently stating wrong answers, something we all know as hallucination or confabulation which they should be able to avoid.
Host: And the numbers are quite shocking actually. The models show very high RMS calibration error scores, which basically means that their confidence levels are completely misaligned with their actual accuracy. This suggests that the model often provides incorrect answers with high confidence and it doesn’t understand its own limitations, which can be a bigger problem than simply not knowing the answer. It’s like an overconfident student who gets everything wrong while believing they’re right. Which as we all know can be dangerous, especially when these models are used in more critical and sensitive fields. It's not only about getting the answer right but also recognizing when it doesn't know what the answer is.