Summary

As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in https://github.com/RUC-NLPIR/OmniEval{https://github.com/RUC-NLPIR/OmniEval}.

Discussion

Host: Hey everyone, and welcome back to the podcast! Today, we're diving into the fascinating world of AI, specifically focusing on how we're evaluating these powerful language models, especially when they're trying to do complex things in specialized areas. We're not just chatting about AI in general; we're going to zoom in on a very specific and, frankly, super important area: finance.

Host: Think about it, we rely on these systems for advice, for insights, and increasingly, for handling very sensitive and important financial data. That's why we need to make sure we can evaluate these RAG (Retrieval-Augmented Generation) systems properly. And to do that, we need high quality, comprehensive benchmarks. It's not as simple as just asking a question and checking if the answer is right. We have to get down into the weeds, and that's what we're doing today.

Host: So, I’m excited to introduce a really cool paper that does just that, it’s titled OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain. It's from a team at the Gaoling School of Artificial Intelligence at Renmin University of China and it introduces a new benchmark for evaluating RAG systems in the financial sector. What they are doing here really pushes the boundaries of what’s possible and how detailed we can get.

Host: Before we jump into the details, let’s just quickly talk about why we need benchmarks at all. It's like having a standardized test, right? If everyone is testing their model in slightly different ways, it’s hard to compare results. So, benchmarks make sure we’re all playing on the same level field.

Host: Exactly, it’s all about creating consistent, reliable ways to measure how well these RAG systems are performing in real-world scenarios. It's not just about having them give correct answers, but also how they are using the information available and how they handle the diversity of different tasks. This paper, OmniEval, tackles this in a really impressive way that's worth taking a deeper look at.

Host: Alright, let's dive in! So, the main thing here is that they've created this benchmark called OmniEval, and it's designed to be, as they put it, 'omnidirectional'. I found that term really interesting. It's not just a one-size-fits-all test; it's looking at RAG systems from many different angles and scenarios.

Host: What really sets OmniEval apart is its comprehensive evaluation framework. It's not just looking at whether a system gives a right answer, which is what many previous evaluations did, they've designed this matrix-based RAG scenario evaluation system which divides evaluation into task classes and financial topics. We're talking about various query types, from extractive questions to multi-hop reasoning, long-form, contrast, and conversational Q&A. These are all really unique, demanding, and common types of financial queries.

Host: And it's not just about the tasks; they also break it down by financial topics. They've got 16 different categories, covering a lot of ground. This combination of task types and financial topics gives us this very structured and detailed way to evaluate the RAG systems. It’s like creating a map where each intersection is a specific scenario we want to test. The authors actually show these matrices visually and it really illustrates the comprehensive nature of the benchmark.

Host: It's brilliant, because in the real world, financial queries are so diverse. It’s not just a matter of asking for a simple number or a fact; sometimes you need comparisons, sometimes complex reasoning, or a detailed summary of information. By building this framework, OmniEval mirrors real-world usage much better than previous benchmarks. I also like that they specifically chose to look at the financial domain because it’s where LLMs often struggle due to their lack of specialized knowledge. Financial language is so specific and relies on very particular terminologies and information sets.

Host: Absolutely, and to further complicate things, financial data changes all the time! It's not a static knowledge base. So any system trying to work with that needs to be able to retrieve and process new information effectively. That’s why RAG systems have the potential to be really powerful in this area because they’re designed to pull in up-to-date information, but we really need ways to reliably evaluate how well that actually works.

Host: Okay, so we’ve got the matrix-based scenario evaluation. Next up is the multi-dimensional evaluation data generation. How did they actually create all of the questions and answers needed to properly test their RAG models? The paper outlines their really interesting method of combining GPT-4-based automatic generation with human annotation. This isn't just a totally automated, hands-off approach, they’re using a hybrid technique.

Host: Right, so, they leveraged GPT-4 to generate a large number of diverse examples automatically, making it adaptable and flexible for other areas too. The automation allows them to generate much more data and to do so in a systematic way, which is a big deal in benchmark design. But, just relying on AI to create the datasets would lead to problems – AI might introduce biases or generate questions that don’t really make sense. So they didn't stop there.

Host: Exactly, that's where the human annotation comes in, and that part is essential. They had human experts evaluate the automatically generated instances, checking if they made sense, were factually correct, and relevant to the financial context. It's the human element that ensures the quality of the dataset. They report an impressive 87.47% acceptance rate after this human evaluation, which speaks to the effectiveness of their automatic data generation pipeline. This hybrid approach is pretty unique and very smart, in my view.

Host: I agree completely. It’s important to have that human oversight to avoid introducing biases into the benchmark dataset. I mean, imagine using an AI to test another AI, if the training data has some quirks, that is not going to be an ideal evaluation setting. And the fact that they achieved such a high acceptance rate after human review really highlights how well their pipeline is working. This is setting up not just a large dataset but a quality dataset.

Host: And that leads us to the third key piece of OmniEval which is its multi-stage evaluation system. It doesn't just look at the final answer; it also evaluates the retrieval stage. This is a huge step because it recognizes that a RAG system is only as good as its ability to find the right information to start with.

Host: Yeah, I think that is so crucial. In a real-world RAG system, the first step, retrieval, is super important. If you can’t get the correct or relevant information from the data source to begin with, the best language model in the world isn’t going to help the system give a reasonable answer. It's like the foundation of a building; if the foundation is weak, the whole thing might crumble. And this is especially critical in finance, where accurate, up-to-date information is non-negotiable.

Host: Exactly, open-domain retrievers that are not trained specifically on financial content might just not find the right context to answer the questions, which leads to inaccurate or just completely irrelevant responses. OmniEval evaluates both the retriever and the generator and it is really important to show that where the system is good or where the problems are coming from.

Host: This is exactly the kind of detailed analysis that's needed to make RAG systems truly useful in complex domains like finance. It’s not enough to just say, “Oh, it answered correctly.” We need to know if it found the right documents, if it used them effectively, and then if the final answer was good. And that’s where the multi-dimensional evaluation metrics come in.

Host: Yes, their choice of metrics is really thorough. They went with both rule-based and LLM-based evaluation metrics. Rule-based metrics, like MAP (Mean Average Precision) and Rouge, which are useful for comparing text, are solid foundations for judging retrieval performance and textual similarity, but they’re not the whole picture and they don't do well with complex human-like answers.

Host: Exactly, metrics like MAP and Rouge-L are great for things like comparing text similarity, but they often miss the bigger picture in the context of evaluating how these RAG systems understand the content they’re processing. For example, a system could give a 'correct' answer that's missing important context or it might be factually correct but lack crucial elements of comprehension. They’re pretty useful but limited when trying to truly judge a complex answer, the kind a LLM provides. That's where LLM-based metrics come in.

Host: Exactly, that's where things get really interesting. They used LLMs to evaluate the responses based on some pretty sophisticated metrics: accuracy, completeness, hallucination, utilization, and numerical accuracy. That's a level above a simple word match. Accuracy, as they define it, isn’t just about matching words in the answer; it’s about semantic consistency, or, how well the system understands the question and formulates the answer in a meaningful way. It's measuring whether the system provides meaningful responses.

Host: Completeness is about if an answer fully addresses all aspects of a question which is extremely important for long-form questions, which often need to consider different aspects to provide a really good and useful answer. Then, Hallucination is all about checking if the LLM is making up things that are not supported by the retrieved information, which is the major problem with RAG systems, and Utilization is seeing how well the system uses the documents that it retrieves to form the answer, again addressing problems where a RAG model does not do a good job of using the content provided. And then, crucially in finance, we’ve got numerical accuracy. These metrics go beyond simple text matching to assess the real-world quality of the responses.

Host: I think it's really important that they’ve identified these five metrics. It’s a good balance of different aspects of the performance. Because the LLM itself can make the assessment, it’s much more flexible and also more accurate to determine if the system really understand the content and answer correctly. It’s not just about getting a right answer, it’s also about how the answer is formulated.

Host: And to make the LLM-based metrics even more reliable, they didn't just use a zero-shot LLM to evaluate the responses. They fine-tuned a smaller LLM on a dataset of human-annotated evaluation results, they specifically used Qwen2.5-7B-Instruct, which they found to be more accurate, reaching 74.4% accuracy compared to human evaluation. This fine-tuning step ensures that the evaluator LLM is not just randomly judging; it’s trained to provide really accurate feedback.

Host: It's a really rigorous approach and again shows how thoughtful this benchmark is. It makes sense to me that fine-tuning a smaller LLM evaluator on human annotations is a great way to go. It not only helps provide very accurate ratings but also makes the evaluation process more efficient. I mean, running a huge LLM to evaluate every response is not feasible at scale. A fine-tuned model that can do the job with high accuracy is really a good approach.

Host: Absolutely, and the scale of the benchmark is impressive. They have around 11.4k automatically generated test examples, 1.7k human-annotated examples and even a 3k training set for fine tuning. They’ve produced a large and diverse set of data for both testing and development, which is a real asset for researchers working on RAG models. All of this is openly available which is another important step.

Host: And they didn't just build a benchmark, they actually tested it using several different RAG models using their benchmark. They tested models with multiple retrievers and several open source LLMs and this highlights how the system will perform across a variety of settings. They used retrievers like BGE-M3, BGE-large-zh, GTE-Qwen2-1.5b, and jina-zh and tested them with language models like Qwen2.5-72B-Instruct, Llama3.1-70B-Instruct, Deepseek-v2-chat, and Yi15-34B. It provides really solid results showing how various systems do across all different types of data. They also released a lot of comparison visualizations.

Host: And what did the results show? The most important finding was that RAG systems performance really does vary across topics and tasks, they found no singular system that works across every scenario which really highlights the importance of the fine-grained analysis that OmniEval provides. They also found that using a RAG system generally improves the results of the LLMs and that despite all of these systems they tested there's still a lot of space for improving RAG systems in specialized domains like finance. The research reveals that RAG systems are good, but definitely need a lot of work.

Host: It's really interesting that the results show that even the most state-of-the-art RAG systems still struggle in financial domains, which is something we suspected but it’s good to have data to support the claim. This really underscores how important it is to have a specialized benchmark like OmniEval. It’s not just about improving performance on general tasks; it’s about enabling AI to be reliable and useful in areas that require domain-specific knowledge and reasoning.

Host: I also found it interesting that they found that the retrieval quality really matters. The GTE-Qwen2-1.5b model outperformed other retrievers because it was also fine-tuned using LLMs, which really showcases the importance of having domain specific retrieval models to supplement these RAG systems. This really underscores the importance of a comprehensive evaluation. It's not enough to just evaluate the final answer; we need to know how well each component of the system is performing.

Host: And this is where the detailed analysis that OmniEval provides really shines. The matrix-based approach allows the researchers to examine the performance of these RAG systems in specific scenarios and discover those areas where the models need to improve. For example, they found that tasks like multi-hop reasoning and conversational QA are harder than other tasks. This really allows for targeted improvements in different models and makes the process much more efficient.

Host: Absolutely, It's really important to not just evaluate things in general, but also specifically across different tasks and topics. Because once you find the areas where the model is weak, you can really focus on what exactly is needed to improve performance in that scenario. It is not enough to just increase the training data, but instead, you really need to target specific areas.

Host: And I think that's why a benchmark like OmniEval is so valuable. It’s not just a pass/fail test; it’s a tool for diagnosis. It gives researchers the insights they need to develop better RAG systems, and it's this very detailed level of analysis that's key for the future of AI in domains that require expertise. They created a very detailed tool that can really be used to help further RAG development.

Host: So, to sum it up, this paper presents a really impressive step forward in how we evaluate RAG systems. They've created a robust and highly detailed benchmark, OmniEval, that combines automated generation with human oversight, a multi-stage evaluation, and a sophisticated set of evaluation metrics. The results of this show that RAG systems still need more work but also set the framework for how further evaluations should be created.