Mistakes to avoid when using LLM-powered chatbots for business
Part I: Why tests designed for humans shouldn't be trusted to measure LLMs' capabilities
On the basis of my work as a data science consultant, I’m writing a 3-part series describing common mistakes I’ve seen organizations make when adopting LLM-powered chatbots for business purposes. If you’re interested in exploring AI chatbots for individual productivity instead, a list of recommended articles can be found at the end of this post.
It’s all about large language models (LLMs) these days.
With their innate ability to accept arbitrary inputs from users and handle digressions and follow-on questions, LLM-powered conversational bots are swiftly replacing traditional AI assistants and rules-based chatbots. And this makes sense, given their potential to enhance customer service, improve employee interactions, reduce response times, and cover a wider range of work-related tasks.
LLM creators, cloud providers, and software vendors are quick to promise AI tools that “comprehend and respond to your queries immediately, even those pertaining to internal, confidential data” and “provide exceptional service round the clock” with “virtually no initial setup time”. However, bringing risks down to an acceptable level and achieving positive ROI with an LLM-powered chatbot is much more complex (and costly) than they would have you believe.
I’m not going to focus on costs here, other than noting that most organizations using proprietary or even open source LLMs are currently relying on heavily subsidized services that cost much more than what they’re currently paying. (More on costs of both proprietary and open source LLMs can be found here.)
This and future installments of this series will center the discussion on common mistakes to avoid to prevent business initiatives involving LLM-powered chatbots from having disappointing or even disastrous results.
Mistake 1: Overestimating the abilities of LLMs
Hiding behind the awe-inspiring and fast-evolving natural conversation ability of the current breed of LLMs are some weaknesses that, combined with our tendency to anthropomorphize (i.e., attribute human characteristics or behavior to an object) broadens the scope for misuse of language models.
As well illustrated by Cal Newport in his article Can an A.I. Make Plans?, LLMs can at the same time ace SAT exams, beat us in chess, and falter on basic math or planning activities that an average person can easily handle.
The following fictitious example was adapted from a real-life scenario.
A vendor that provides email marketing services decided to offer a conversational chatbot to help customers automate marketing tasks. One of the tasks the chatbot is charged with is to help marketers compose marketing emails.
The solution passed numerous tests involving promotional offers and mathematical calculations. But after going live, the chatbot began to fail in tasks like the following:
You are a marketer in charge of communications for an e-commerce business. Write a message inviting customers to take advantage of a time-sensitive promotion: get 20% off the original price of $87.50 on a pair of high-end headphones when purchasing within the next two days. The promotion can be combined with another 20% discount applied to the already discounted price when they sign up to a free account during checkout. Include the total discount percentage the customer will get when taking advantage of the two promotional offers.
In early April 2024, I tried this prompt with the free versions of OpenAI’s ChatGPT (based on GPT-3.5), Microsoft’s Copilot (based on GPT-4) , and Anthropic’s Claude 3 Sonnet (Claude’s second-most intelligent model, behind Claude 3 Opus).
The only chatbot to get it right was Claude. Both ChatGPT and Copilot provided the wrong percentage (“Sign up for a free account, and we'll sweeten the deal by applying an additional 20% discount to the already discounted price. That means you'll get a total of 40% off relative to the original price!”) when in reality the total discount percentage is 36% when we apply the straightforward compounding discount rule as the prompt instructed.
You may be thinking, “The technology is getting better all the time; I wouldn’t be surprised if the paid versions of the models that failed the test are already getting it right.” And this wouldn’t surprise me either: the playing field is constantly changing as LLM creators fine-tune their models further.
But as LLMs’ behaviors change over time, it’s wrong to assume those changes always yield better outcomes. Studies like How Is ChatGPT’s Behavior Changing over Time? show how language models can experience performance degradation in a variety of tasks—including their ability to provide correct answers to math problems.
In particular, given how many experiments keep revealing LLMs’ surprising failure modes, it’s a bad idea to rely on their astonishing performance on bar exams, SAT math tests, and other tests designed for humans to execute on its own work involving things like quantitative analysis or strategic planning. In such cases, with the appropriate guardrails an LLM-powered chatbot can add value by introducing a user-friendly layer to the delivery of answers and self-service actions, but not successfully replace a computational system.
For instance, consider a provider of decision-support technologies for the aviation industry. Clearly, using an AI chatbot to produce on its own answers to questions like “When is this plane expected to land?” would make no sense. On the other hand, the same chatbot might become a great companion to a model that calculates flight arrival estimates based on data about weather, radar data, and other factors, helping users interpret complex analytics and answering follow-on questions in human-like language.
What I’m saying here may seem obvious for people working in organizations that have already made significant progress integrating data and analytics across their core functions. But in companies left behind in the quest to embrace advanced data analytics, it’s common to see executives being dazzled by the draw of a new favorite buzzword, “generative AI” and placing unwarranted trust in the “cognitive abilities” demonstrated by the newest versions of LLMs.
Whether AI chatbots are used in customer interactions or only in internal processes to help staff complete their tasks fasters, the risks arising from overestimating their impressive capabilities to respond to human prompts should not be ignored. (Given the many limitations and challenges of LLMs, it’s hardly surprising to see the biggest names in tech investing heavily in the next wave of computing beyond language models.)
This is not to say that LLM-powered chatbots can’t be put for good use in business environments.
The point here is that cognitive tests designed for humans are not a reliable measure of an LLM’s abilities and limitations, and while their “apparent reasoning” make them suitable for some specific real-world tasks, the technical and managerial barriers to making LLM-powered chatbots perform well in high-stakes business scenarios is hard to overstate.
Part II of this series can be found here:
LLM mistakes to avoid (Part II): The key for effective model evaluation is to start from a clear-cut use case
Looking for advice on using LLMs for personal productivity?
Recommended reading:
From the archives:
How to increase your individual productivity with free generative AI
On Modern Analyst: