Evaluating Generative AI: The Evolution Beyond Public Benchmarks

5 min readOct 31, 2024

In the evolving landscape of artificial intelligence, particularly with the rise of generative AI, the methods for evaluating these systems have transformed significantly. Our recent episode of the ODSC Ai X Podcast featured Jason Lopatecki, co-founder of Arize AI, who shed light on the challenges and innovations in evaluating generative AI models. In this blog, we’ll explore the key points of Jason’s discussion on evaluating generative AI, focusing on the paradigm shift away from public benchmarks and toward task-specific evaluations.

You can listen to the full podcast on Spotify, Apple/iTunes, and SoundCloud.

Jason will also be a speaker for ODSC West later in October! Check out his talk, “Demystifying LLM Evaluation,” there!

The Problem with Public Benchmarks for Evaluating Generative AI

Public benchmarks have long been a staple in AI evaluation. These standardized datasets allow developers and researchers to compare model performance. However, Jason was quick to point out their limitations. According to him, public benchmarks are not as valuable as they once were, particularly in the context of large language models (LLMs) and generative AI.

A key issue with public benchmarks is data leakage. When benchmarks are widely available, it’s easy for data from the test sets to end up in training datasets, either inadvertently or through deliberate fine-tuning. This can result in artificially inflated performance scores that don’t reflect the model’s ability to handle real-world data. Moreover, public benchmarks tend to degrade over time. Jason referred to the “half-life” of public tests, where the value of a benchmark diminishes as models get optimized specifically to excel at those tests, rather than generalizing well to new data or tasks.

Jason’s take on the issue is straightforward: public benchmarks, while useful for a high-level view, are not the best indicators of a model’s performance on the tasks that matter to individual organizations. Instead, companies should focus on building their own test sets that are directly aligned with their specific tasks and use cases.

Task-Specific Evaluations: The New Gold Standard

The shift from public benchmarks to task-specific evaluations is a crucial development in AI evaluation. Jason emphasized that models should be tested on the tasks they will actually perform. This means creating custom test sets tailored to the specific applications of the AI system.

Why is this so important? Generative AI models, especially foundational models like GPT-4 or Mistral, are often trained on a wide variety of data, making them capable of handling many different tasks. However, when these models are deployed in specialized industries such as finance or healthcare, their real-world performance depends on how well they handle domain-specific challenges. Evaluating models on a few hundred examples that reflect the actual work they’ll be doing is far more valuable than seeing how they perform on a generic, public benchmark.

Jason underscored that building a small test set of 50–100 samples, hand-labeled by experts, can provide a strong foundation for evaluation. This approach allows for more accurate assessments of whether the model is performing the specific tasks it’s designed to handle. It also helps teams catch issues like hallucinations or misinterpretations early in the process.

The Role of LLMs as Judges

One of the more innovative evaluation techniques discussed by Jason was using LLMs themselves to evaluate other models — a concept he referred to as “LLMs as judges.” This involves using a generative AI model to assess outputs, especially for subjective tasks like determining whether a summary is accurate or whether the tone of a conversation is appropriate.

LLMs as judges represent a major departure from traditional evaluation methods, which typically involve hard-coded metrics like precision or recall. Instead, AI systems are now tasked with evaluating their peers, making this method more dynamic and adaptable. However, Jason acknowledged the concerns around this approach, particularly the introduction of potential biases and the question of which LLMs should be used as judges.

To mitigate some of these issues, Jason recommends that teams start with larger, more expensive models for evaluation. This ensures a higher level of accuracy and reliability before moving to smaller, cost-optimized models. Importantly, he also highlighted the value of reasoning explanations provided by these judge models. When a model not only gives an evaluation but also provides a rationale for its judgment, it becomes much easier to trust the results and fine-tune the evaluation process.

Challenges of Agentic AI

Jason also delved into the complexities of evaluating agentic AI systems, which involve multiple calls to LLMs, iterative processes, and interactions with external tools or APIs. Unlike traditional machine learning models that produce a single output based on a fixed set of inputs, agentic AI systems are more dynamic and often involve multiple decision points.

Evaluating these systems requires breaking down the process into smaller components. Jason stressed the importance of evaluating each decision point or function call individually, ensuring that every part of the system is functioning as expected. This type of evaluation is much more complex than simply testing whether a model can generate the right answer to a question.

For companies looking to adopt agentic AI, Jason’s advice is to start simple. By building a basic version of the agent and getting it to work reliably, teams can avoid the pitfalls of premature optimization. Once the system is functional, they can then move on to more complex evaluations, gradually increasing the sophistication of their models and workflows.

Moving Toward Evaluation-Driven Development

One of the most compelling ideas Jason introduced was the concept of evaluation-driven development. In this approach, evaluations are not just an afterthought but an integral part of the development process. By embedding evaluations into the CI/CD pipelines, teams can continuously test and refine their models as they develop them. This ensures that new changes don’t introduce regressions and that the model is always moving toward better performance.

Tools like Phoenix, an open-source project led by Arize AI, play a significant role in this process. Phoenix allows teams to track evaluations, experiments, and traces, offering deep insights into how their models are performing. This kind of observability is essential for understanding where a model might be going wrong and how it can be improved.

The Future of Evaluating Generative AI

As generative AI continues to evolve, the methods we use to evaluate it will also need to adapt. Jason pointed out that we’re still in the early days of understanding how to properly evaluate these systems, particularly in complex, agentic workflows. But the tools and techniques are rapidly improving.

Looking forward, we can expect to see more integration of LLMs as judges, more sophisticated task-specific evaluations, and a greater emphasis on evaluation-driven development. As AI models become more complex, so too will the methods we use to evaluate them, but the goal remains the same: ensuring that AI systems are reliable, trustworthy, and effective at solving the problems they were designed to address.