LLM mistakes to avoid (Part II)
The key for effective model evaluation is to start from a clear-cut use case
This is part II of a series on Mistakes to avoid when using LLM-powered chatbots for business. Part I is here.
Mistake 2: Starting to evaluate model performance before you have a clear and specific business use case
As the AI arms race intensifies, the biggest names in tech keep rushing not only to offer new and improved versions of their LLMs, but also to facilitate comparisons between language models and their different versions via automatic evaluation tools with predefined metrics such as accuracy, robustness, and friendliness.
The reality, though, is that to successfully compare LLMs and identify the smartest LLM available to power a particular business chatbot, first you need to have a concrete use case in mind.

This is because each large language model has its own “personalities”, strengths, and weaknesses. While GPT-4 remained the dominant model for over a year, various comparative studies now show Google's Gemini Advanced outperforming it in some tasks, Antrophic’s Claude 3 Opus in others, and so forth.
In practice, even two tasks that look similar may lead to dramatic differences in model performance.
For instance, consider a conversational AI tool designed to retrieve information from large bodies of documents. In the surface, it looks like a fairly specific scenario. But when we look deeper, we can see how the underlying technical problem may diverge enough that the problem requires further refinement for us to be able to make the right comparison between LLMs and corresponding prompt strategies.
One of the key questions to ask here is, “What kind of information needs to be retrieved?”
Imagine that a chatbot is being designed to help legal professionals extract, from a large volume of files, content that meets a known requirement (e.g., “Give me a list of all the references to liability amounts greater than one million dollars in this set of contracts”.)

This problem is similar to the task described in an experiment performed by Arize AI and documented in the article The Needle In a Haystack Test. In this experiment, various LLMs were prompted to answer what the best thing to do in San Francisco was based on an essay on a different topic where the answer was “hidden” at different places. The experiment concluded that different models need different prompting strategies to perform well in the task, and small differences in prompts could lead to big changes in retrieval accuracy.

In the scenario above, the LLM can be informed in advance exactly what object is to be retrieved. But what if, instead of a haystack hiding a needle, you have an auto salvage yard filled with old cars and trucks where you’re hoping to find any kind of junkyard treasure?

A problem of this nature can be found in the HBR article Use GenAI to Uncover New Insights into Your Competitors. The article describes how an analysis using generative AI allowed a company to to flag a valuable detail that had been initially overlooked by a team of analysts pouring over a 200-page annual report of a rival manufacturer of heavy industry products.
The detail (the purchase of a plot of land in India) could be interpreted as a clear sign that the competitor manufacturer had the intention to built a factory to expand into the Indian market. That piece of information, contained in 14 of the 33,660 lines of text in the report, made it possible for the company to start making informed decisions to respond to the rival’s expansion months before it became a reality.
Trying to evaluate models for information retrieval not knowing whether we’re optimizing for a needle in a haystack challenge or a treasure in the junkyard competition is like trying to evaluate a runner not knowing if we’re looking for a sprinter and a marathoner. Each modality requires vastly different skill sets, and consequently demands distinct performance assessment tests.
To avoid mistakes when testing LLMs for an AI-powered business chatbot, start from understanding exactly what your use case is. To be trusted, the evaluation approach and metrics should be defined after the business application is understood at the “sprint or marathon” level, rather than just at the “running race” level.