Why game-changers are unlikely to originate from split testing

And how to broaden your thinking about experimentation

Jul 08, 2022

As a data scientist, I’m happy when a business problem I’m working on requires experimentation to validate a hypothesis. Every experiment, even the failed ones, provide us with important information, and can point out the gaps and flaws within our ideas so that we can take timely action to fix what’s not working.

A problem arises when companies start to equate experimentation with split testing. As you probably already know, split testing, also called A/B test, has subjects randomly assigned to one treatment or another (patients prescribed drug A vs. drug B, consumers exposed to price A vs. price B, etc.). Its experimentation process starts with a hypothesis (“drug B is better than the existing standard drug A”, “price B is more profitable than the current price A”).

Such tests are common in web design and marketing, where results are readily measured. In fact, failing to run an A/B test before committing to a marketing change can cost a company millions in revenue, as Basecamp learned from a failed redesign that removed the signup form from their landing page.

However, an overreliance on split tests is a hindrance to innovation.

When split testing becomes an unhealthy obsession

Years ago I was working on a software-as-a-service (SaaS) application that had an obvious flaw that made it very difficult for users to get their job done. The solution to improve user experience was obvious to anyone who understood the “job” the software was hired to do. Still, because the startup relied on data generated from A/B tests for every product decision, it was hard at first to convince the decision-makers to approve the change without running a comparison experiment first.

A good analogy would be a restaurant chain with a floor plan where the kitchen is divided into two rooms separated by a large dining area. To grill some vegetables, the cook has to go back and forth, walking around tables to the other side to wash the produce, then carrying it back to chop it, then crossing the dining area again to finally start grilling.

Imagine setting up an experiment to test the hypothesis, “Bringing the whole kitchen to one side of the dining area (variation) will make the cooking process smoother and the cook happier compared to keeping it divided by a dining area (control).” That experiment would not only be needless, but also cause avoidable delays in fixing the issue across all restaurants in the chain.

In the case of the startup with the SaaS product, the situation was similar. The time required to set up a valid A/B test would only increase the frustration of the customers waiting for improvements. Across a subset of interviewed users and employees who knew the needs of the target audience well, there was full agreement that the proposed change would dramatically improve user experience. It took some effort to convince the leadership team that moving forward without a split test was the proper response, but in the end a quick improvement in customer satisfaction and renewal metrics spoke for itself.

Why split tests are unlikely to create a game-changer

Imagine that you’re in charge of improving user adoption of a dashboard offered an add-on to customers of a SaaS (software-as-a-service) product with the goal of increasing conversion after a free trial.

If you suggest a split test of bar charts vs. gauges to improve chart readability, your boringly rational, incremental idea is likely to be approved without question. Suggest instead experimenting with a no-charts, narrative-based analytics tool to tell a story with the data, and your attempt to “think outside the box” might put you through an endless process to even approve testing the idea.

The sad reality is that teams fixated on split testing often get stuck in a cycle of incremental changes that, at most, yield modest returns on the investment. A/B tests—or A/B/C/D experiments where you are testing multiple versions of feature, price, headlines, colors, and so on—are not a panacea. If they were, applications like Google+, a product made by a company with plenty of A/B test expertise and that welcomed 90 million users in its first year, would still be around rather than being featured in product graveyard collections.

Smart companies understand that true innovation happens outside the usual assumptions and conventional thinking. At times, as advertising executive Rory Sutherland likes to say, we need to “rewrite the brief.”

How to improve your experimentation process

Two key tactics can help organizations improve how they think about experimentation to avoid the pitfalls of split testing.

First, make sure you are working from first principles.

Radically superior solutions are rarely achieved by testing slight variations on the same theme. To quote from a Farnam Street Blog post that explains how first principles work,

When we take what already exists and improve on it, we are in the shadow of others. It’s only when we step back, ask ourselves what’s possible, and cut through the flawed analogies that we see what is possible.

For instance, in one of my projects, following nurses on the job to see how they used the software I was working on helped immensely in understanding the “real job to be done” by the product. This approach allowed us to “rewrite the brief” and consider alternative solutions: what if instead of requiring nurses to use a writing instrument to record patient notes on an iPad (as every competitor product did), we provided a voice-to-text functionality that automatically generated the written notes as they talked to patients?

Exploring multiple perspectives through first principles helps to expand the number of alternatives under consideration and mitigates the risk of getting stuck in a “local optimum”. In that project, rather than merely achieving incremental improvements by testing the efficacy of different writing tools via A/B experiments, we were able to envision what turned out to be a better solution: a voice-to-text feature that took us to product-market-fit much faster.

Second, avoid the tyranny of the spreadsheet

One one hand, we should never settle for a messy innovation process that ignores the experimentation process. On the other hand, it’s often possible to build and validate our knowledge of the problem and solution spaces without extensive data collection or endless A/B testing that in practice might simply create an unwarranted patina of scientific credibility.

As I’ve highlighted in a previous post,

Quantitative data is not a panacea. In two decades working for all kinds of businesses, from Fortune 500 companies to tiny startups, I’ve witnessed time and again successes that couldn’t be projected on a numeric spreadsheet. To strike gold, sometimes we need to rely on curiosity and qualitative insights extracted from talking to and observing our target audience.

For example, many data-driven companies fail to recognize that the more variables you add to an experiment, the easier it is to find support for some spurious, self-serving narrative. If you are comparing results across multiple treatment groups, you are asking multiple questions: Is A different from B? Is B different from C? Is A different From C? And with each question, you are increasing the chance of being “fooled by randomness”, as the more variables you add to an experiment, the greater the probability that something will emerge as “significant” just by chance.

Rory Sutherland points out in his book Alchemy: The Dark Art and Curious Science of Creating Magic in Brands, Business, and Life that for all we obsess about scientific methods, “it’s far more common for a mixture of luck, experimentation and instinct to provide the decisive breakthrough; reason only comes into play afterward.”

`. . .`

It pays to remember that oftentimes all the data we need to develop breakthrough products and services might be acquired from small experiments to validate our knowledge of the problem to be solved and the criteria our customers use to measure value. And that combination of small experiment efforts may create a roadmap for innovation and growth that’s far superior to the conventional data-driven, A/B test focused model that demands huge sample sizes to achieve statistical significance.

Sure, incrementally better decisions add up to a lot of value over time (probabily), but maybe we’re just stuck in a local optimum and getting many small changes right will never get us to where we want to go. A modification of the famous Henry Ford quote kind of works here: you can’t A/B test your way from selling horses to selling cars. And a corollary: if you’re testing a horse against a car, you definitely don’t need an A/B test.

— Locally Optimal from Causal Engineering

Data Points

Discussion about this post