Evaluating Agent Tool Selection — Testing if First Really is the Worst

ODSC - Open Data Science
6 min readJan 24, 2025

--

Editor’s note: Sinan Ozdemir is a speaker for the month-long AI Builders Summit starting on January 15th! Be sure to check out his session on January 29th, “Modern AI Agents from A-Z: Building Agentic AI to Perform Complex Tasks,” there!

At its most basic, an AI Agent is not much more than a Generative LLM (GPT-4, Claude, Llama, etc) reasoning through tasks and running tools like APIs, image generation models (like how ChatGPT uses DALL-E to make images), and code executors.

Your Basic AI Agent relies on an auto-regressive LLM’s ability to think through a task and select the right tool at the right time.

Agents are useful in theory, but in practice, they often fall short. Evaluating agents then becomes a key imperative to make sure that things are on track and we can evaluate agents on several levels:

  1. Making sure the agent’s final answer is accurate and helpful
  2. Checking that the latency/speed of the system is good enough
  3. Checking each reasoning step to make sure the agents are tackling problems efficiently and correctly

One of the more underrated evaluation criteria for agents is quantifying the ability of the LLM to select the right tool for a task. It can be easy to overlook this criteria because if the answer is right at the end, that would imply the agent selected the right tools along the way, right? Well maybe not. Maybe the agent selected the wrong tool twice before fumbling its way into the right one and that would impact both the latency and the accuracy overall.

Moreover, there are underlying issues with the deep learning architecture that virtually every LLM is based on, the Transformer. While there’s no doubt that the invention of the Transformer was one of the greatest advancements in NLP in the last several decades, there’s one particular type of bias it falls prey to quite often, the positional bias.

Depending on where the tools are in the agent prompt, tools listed later in the list might end up towards the middle of the prompt, where information can get ignored due to positional bias

Positional bias essentially means the LLM has a tendency to pay more attention to (pun intended) tokens at the start or end of the prompt while glossing over tokens in the middle. You may have heard this called the “lost-in-the-middle” problem. This is a big deal when it comes to agents, especially if the LLM favors tools that are recorded earlier in the prompt, glossing over later tools that often appear towards the middle of the overall prompt. As a result, the LLM could pick the wrong tool.

Testing Tool Selection — The Experiment

To properly investigate tool selection in agents, let’s run a simple test. Our test will run in 3 stages:

Stage 1 — Setup

  1. Write a test set where we write a fairly simple task which (mostly) obviously matches to a single tool. I have 5 tools in total. Examples include:
  • Check the status of my NFT listings → ‘Crypto Lookup Tool
  • Add a new row and just write “To do” in it‘Google Spreadsheet Tool
  • Convert 98 degrees Fahrenheit to Celsius using Python‘Python Tool
  1. Define several LLMs to test. I tested several from OpenAI, Anthropic, a Mistral model, a few Llama models and a Gemini model

Stage 2 — Run the Agent + Log results

  1. Choose an n (I chose n=10)
  2. For each test datapoint, and for each LLM, shuffle the tools around and pass the order of the tools into the agent framework n times.
  • for each time, log the correct tool index, the chosen tool index, and whether the agent was correct.

Stage 3 — Calculate Results

  1. Calculate the accuracy, precision, recall, and F1 for each LLM on its tool selection along with broken down metrics (to see these, check out my upcoming AI Builders Talk)
  2. Calculate the % difference between each tool index being chosen and the index being correct to try and see if the LLM favored any particular tool indices.

Tool Selection Accuracy can vary greatly between LLMs

Depending on which LLM I tried, there were pretty stark differences between tool selection accuracy. It’s tempting to look at this and say “Oh ok, so Anthropic’s Claude 3.5 Haiku” is clearly the best LLM for agents” but that would be misleading. This is a test for my own agent framework that I wrote and on my own defined tools and on my test data. I hope you will take this post/my talk as a framework to follow when testing your own agents!

To no one’s surprise, the choice of LLM impacted overall tool selection accuracy

Positional Bias is Real

The graph below shows the average % difference between how often the agent chose a particular tool index (there are 5 bars because I had 5 tools) and how often that tool index was actually correct. so a 9.51% in the first bar means that on average, the LLMs chose the first tool in the list 9.51% more often than the index was correct. For example, if that tool index was the correct tool index 95 times during the test, the LLM actually chose that tool index roughly 104 times ([104–95] / 95 ~ 9.5%). At the same time, the tools that were listed further down were under-chosen, showing evidence of a positional bias.

On average, the chosen LLMs tended to over-select tools in earlier indexes

You might be thinking that it was the smaller open source models that really skewed the results, but if you look at the results broken down by model provider, even OpenAI models fall victim to positional bias:

Even the “gold standard” OpenAI LLMs fall victim to positional biases

Conclusion

Our experiment highlights that evaluating an agent’s performance goes beyond simply checking the final answer and how long it took to get there. Even when an agent ultimately reaches the correct solution, inefficient tool selection driven by inherent biases can impact accuracy, latency, and consistency.

Moreover, even the most advanced LLMs from top providers like OpenAI and Google are not immune to these challenges. The over-selection of tools appearing earlier in the list underscores the need for robust testing frameworks and deeper investigations into the LLM’s decision-making process.

The takeaway? Don’t assume a strong final output implies flawless tool selection. Use testing frameworks like the one shared here to rigorously test, iterate, and refine your agents for better real-world performance. And remember, the right tool at the right time isn’t just magical — it’s measurable.

Check out the code for this experiment and much more at my upcoming talk during the AI Builders Summit, Modern AI Agents from A-Z: Building Agentic AI to Perform Complex Tasks. See you there!

About the Author/AI Builders Summit Speaker:

Sinan Ozdemir is a mathematician, data scientist, NLP expert, lecturer, and accomplished author. He is currently applying my extensive knowledge and experience in AI and Large Language Models (LLMs) as the founder and CTO of LoopGenius, transforming the way entrepreneurs and startups market their products and services.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

Responses (1)