How to run an experiment

I’m going to be writing a lot more about design’s role in leveraged power dynamics over the next few months. What questions do you have that you’d like me to answer? Nothing is too small or tactical!

Brief one today. Last, we’ll talk about the third pillar of value-based design: experimentation. Experimentation is the application of measurement to a specific design decision in order to de-risk its implementation for your business’s customers.

The third pillar of value-based design measures the economic impact of design decisions through experimentation. The reasons for this are twofold:

All design is speculative until it’s put in front of paying customers. Experimentation allows you to understand the specific impact that a design decision will have on the business.
Experimentation is a hedge on risk. Keep what works, throw away what doesn’t, and grow the business accordingly.

Experimentation is an extension of the scientific method to the design process. First, you state a hypothesis: that a specific change (the design decision) will improve a specific metric (e.g. conversions, ARPU, etc.) by a specific magnitude (5%, say). Then, you send equal proportions of the control (the original design) and the variant (the new design) to your customers. Finally, you measure which performs better, use your findings as research to inform your future design direction, and repeat with a new change.

Done right, experimentation allows the value-based designer to surrender their ego to the needs, desires, and motivations of the business’s customers – which ultimately puts the customers in control.

Experimentation is both powerful and easy to screw up. That’s why it makes the most amount of sense to hire a value-based designer who can soberly tell the team when & how to run experiments. We sometimes have to deliver bad news: that we have to test an idea in the first place, or that you can’t peek on the test, or that an experiment favored the control. But we’re also frequently the last line of defense against you harming your own business.

First and foremost, A/B tests are a type of experiment, but they are not all experiments. One-off experiments are considerably higher-risk and more advanced, and I don’t recommend them for most applications. With that in mind, we’ll talk about how to prioritize your design decisions and run an A/B test today.

Prioritization

Every optimization program has a common thread: making sense of all of the design decisions you could test. Prioritization is the process of fairly scoring design decisions and sorting them in order of what you should experiment on. This allows individual egos to take a back seat, flattening the org chart and embracing rational decision-making in the process.

Every value-based designer has their own way of prioritizing. Many rely on established methodologies. What are the most popular prioritization methods, and what are the strengths & weaknesses of each? Let’s take a look at a few of the most popular ones, and determine what works and what could be improved in each.

PIE

PIE is WiderFunnel’s prioritization framework. The three letters in this particular TLA correspond to each of the metrics you’ll be scoring: potential, importance, and ease.

Of the four frameworks we’re analyzing here, PIE is the most similar to the one that we use at Draft, at least on its face. Where it differs is in how each of the values is calculated – and how it sorts each hypothesis. The potential value is mostly grounded in research, and the importance value mostly connects to past tests. This is great if you run lots of experiments, and less great if you don’t already have an established experimentation program.

ICE

ICE stands for “impact, cost, effort,” which are the three metrics that are used to evaluate new experiment ideas.

This seems like a great idea, but the actual calculation of each value is mostly left up to the reader. I’d only implement ICE if you have a clear sense of what goes into the calculation of each value. Make sure you write down your criteria to get your team on the same page before you start using this framework.

PXL

PXL is the framework that they use at ConversionXL. It weighs 10 different factors, and (usually) scores them a 1 or a 0 based on whether each factor is true or not.

PXL is by far the simplest of the frameworks we’re discussing here, and the answers are answered easily. You can tell if something is above the page fold or not, for example. But within it is a weird complication around research: you can have a relatively high-scoring idea that is unsupported by research. We don’t test anything at Draft if it fails to be supported by research.

Draft Method

This one is ours, and it’s a little more involved than you typically see at an agency. That’s because we made it for our use primarily. We’ve open sourced it because we would love other consultancies to use it, but really it’s just a nice-to-have for educating prospective and current clients.

In it, we break out 3 key metrics: feasibility, impact, and business alignment. Each of these is scored from 1 to 10, and then the three scores are summed up. We re-assess the whole experimentation queue quarterly.

This framework is great if you have the time and resources to devote at least one part-time resource for optimization. It doesn’t work so well if you have a shifting short-term business strategy, or if you lack optimization maturity throughout the organization.

Running an A/B test

Now comes the fun part: A/B testing!

This lesson is long, so you might want to print or save it for later. But after reading it, you should go from not knowing that A/B testing is even a thing to getting the hang of a new framework.

First, get and set up a framework

You’ll first want to sign up for an A/B testing framework. These allow you to run and analyze new tests. VWO & Convert are our two favorites. Then, install the framework using its instructions. Most of these involve adding a JavaScript snippet to the <head> tag of every page on your site.

The easiest way to run a test: your framework’s editor

The core of most A/B testing frameworks comprises a WYSIWYG editor. In there, you click on certain elements, change what needs to be changed in a given variant, and the framework’s tracking code uses JavaScript DOM-editing fanciness to deliver the variant to the right proportion of your customers.

This is terrific for smaller tests on coloration, copy, and imagery. But what happens when you want to test a more radical rework? Or if you have some more dynamic, complicated content? That’s when you have to get others involved – and yes, you’ll have to get your hands dirty with code.

Working with code is a core part of any solid A/B testing strategy. You can’t just change calls to action and expect them to land every single time. You will need to work with code for most of your tests.

The context

For a very long time, most testing efforts were hard-coded and self-hosted. People would create their own frameworks that handled all the work of delivery, tracking, and reporting. For example, take a look at A/Bingo, a Ruby framework for running your own tests. That’s one of the more sophisticated open source ones, should you still wish to roll your own.

Around the same time, Google Website Experiments, now rolled under the Analytics tarp, did roughly the same thing – but you didn’t have the WYSIWYG component, it was hard to analyze (this is Google Analytics, after all, where nothing is allowed to be easy or convenient), and most of your experiments still needed to be server-side.

We’ve come a long way since then. You can click, change text, add some goals, and deploy your test. If it weren’t for the convenience provided by contemporary WYSIWYG frameworks, value-based experimentation wouldn’t exist.

Worst case, you can leave the analysis up to the big framework providers now. You don’t have to sweat the statistical analysis (much) or wring your hands over whether a variant is actually generating more revenue for you. That’s extremely good – not only for your own testing efforts, but for getting more people in the fold as well.

Yet this is still not enough.

When the framework alone isn’t enough

When is your framework’s WYSIWYG editor not enough?

Any dynamic content. Testing pricing? You’ll want to haul a developer in. Testing something that changes from region to region? That won’t work, either. Frameworks are usually only good at handling static content – unless you plan to remove the control’s element entirely and code it by hand, which could trigger any of a litany of JavaScript bugs.

Big, fancy JavaScript-y things. Got a carousel that dances across the screen for no good reason? You are likely bracing for a world of hurt if you want to rewrite all the JavaScript to turn that off directly in your framework. This is a case where it’s not impossible, but development work will be substantially easier for you if you develop your own solution.

Multi-page tests. Do you want to test changing the name of your product from, say, "Amazon Echo" to "Amazon Badonkadonk?" I mean, it’s your company, Jeff, but it seems like you might have "Amazon Echo" listed in many different places, including in images and <title> tags. Frameworks are terrific at tracking across sessions, but unless you have the whole site set to declare Amazon Echo as a variable, and you’re ready to swap it out at a moment’s notice, a single line of injected JavaScript won’t do the trick – you’ll probably want something on the back end to change it out everywhere.

Massive site reworks. Did you just redesign your site, and you’re looking to measure the new thing’s performance against the old thing? Yeah, I would strongly caution against using a WYSIWYG editor for such an undertaking.

This is a lot, right? And some of it seems pretty critical. Bold reworks? Pricing changes? That all is testing 101 for most of us. My take: if you want to dabble in testing by doing most of the stuff with the least impact, by all means rely solely on a framework. But if you really want to build a solid practice that will endure no matter what you want to do to your site, you’re going to have to work with code – or enlist the help of someone who can do it for you.

Redirecting the page: the 20% solution

Your framework should let you create a split page test, which shunts the right proportion of visitors to a whole separate variant page. Put another way, if you run control on /home and variant on /home/foo, 50% of your visitors should be automatically redirected to /home/foo, and measured accordingly. This allows you to bypass the WYSIWYG editor for a home-rolled solution.

You can always create a new page and forward people there – and that allows you to make whatever changes you’d like. This only works for tests on a single page, though: if you’re making changes that affect any other pages on your site (say, with pricing), you’ll want to go with something a little more involved. Fortunately, most of the work is in your first-time setup, and you’ll be able to reuse it in your future testing efforts.

Creating a whole new site: the big rework solution

For situations where you’re taking the wrecking ball to your whole site and evaluating a new version, you can create a whole new set of pages, static, on the root of the site. (Rails allows you to do this in a /public/ folder.) Consider making the home page /home/index.html, /welcome/index.html, etc – and then create synonyms for the others: /plans/ instead of /pricing/, /join/ instead of /signup/, etc.

Then, use your split page test to send your variant’s traffic to /welcome/ instead of /home/. This is particularly good for SaaS businesses whose funnels are typically three pages (home, pricing, sign up); for ecommerce sites that involve many different pages and a lot of dynamic functionality, you may want to deploy to a whole different server, and redirect people to a subdomain like (say) shop.example.com instead of example.com.

GET queries: the 80% solution

For most substantial tests, I recommend clients set up a solution that redirects the page using a GET query.

For those who don’t know, GET queries are when you have ?foo=1 in your URL, which allows a page to grab variables for its own dynamic processing. For example, on most New York Times articles of a sufficient length, appending ?pagewanted=all to the end of a URL allows you to view the whole thing on one page. Additional GET queries are delimited with an ampersand, so ?pagewanted=all&foo=1 sets two variables: pagewanted to all and foo to 1. You can then pull GET queries in using your dynamic language of choice.

You’ll be employing a split page test here as well: just redirect your variant’s traffic to something like http://example.com/?v=1. Then, index.php pulls in the GET variable, determines that it’s been set to 1, and serves a variant (or changes the appropriate content) instead.

if \$variant == 1 {
  load "index-variant.php";
  append "?v=1" to every link that points to a local site page;
} else {
  load "index-control.php";
}

Two tactics are important here:

I always maintain two different pages that allow me to vet what’s on control and what’s on variant. That way, I don’t need to go mucking around in the actual code. I can always use variables to switch stuff like pricing out on the pages themselves.

See that second line of code, about appending ?v=1 to all local A tags? That’s so we can keep people on a variant through the whole conversion funnel. You don’t just want people to hit index.php?v=1; you want people to go to /pricing/?v=1, /signup/?v=1, etc.

What if the customer notices ?v=1 being appended, and deletes it in a refresh for some outlandish reason? Your framework will throw a cookie that persists across a session, and they’ll recognize that you’re supposed to be receiving a variant, so ?v=1 will come back – like a very persistent cockroach of URL modification.

What if the customer doesn’t have cookies enabled? They won’t get the test in the first place, and they’ll always be served control. Their decision will keep them from being counted in our test’s final analysis as well. (This goes for any tests you run, ever – not just these crazy ones.)

This should cover most things

I’m sure that you’re reading this and thinking of some situation where everything would catastrophically break for you. And that may be true for your context. But a little upfront reworking will go a long way in testing – and it’ll allow you considerably more freedom from the constraints of your testing framework’s WYSIWYG editor. Doing this work is critical to ensuring that you have a durable, long-term testing strategy that makes you money into perpetuity.

Note, in each of these cases, how the testing framework is never fully cut out of the picture. You’re still using it to redirect traffic and, crucially, gather insights into what your customers are doing. You’re still using it to plan and call a test. But you’re offloading the variant generation onto your own plate – which allows you near-infinite latitude in what you can test.

You don’t want to come up with a really promising test idea, run into the limitations of your framework, and then get into a political battle about what to do next. It will be enormously frustrating for you.

It’s far easier to have clarity in how to proceed – and creating the tests on your own allows you to proceed with confidence. Otherwise, you’ll probably just deploy the variant to production without testing it. And that’s not really the point of having a testing strategy, now is it?

Experiment goals

Let’s say you have a clear sense of what to change in a test. How do you measure its impact?

This should be obvious. It’s not. Let’s dive into how to configure the best goals possible for a test, and how to pay attention to them when it’s time to call the final result.

First, configure revenue

Your tests should exist to generate revenue for your business – otherwise, there’s no reason for you to run a test. Revenue is the first and most important goal for any A/B test.

Revenue tracking exists in all major A/B testing frameworks. On the confirmation page for your purchase, you should add revenue tracking to your snippet.

When you’re in your framework, then, add revenue tracking as your primary goal, and set the goal to the correct URL.

Next, confirming the sale

Just as you have to install revenue tracking on the confirmation page, you should also create a second goal that tracks views of this page. This allows you to confirm that you’re getting enough qualified, wallet-out traffic.

The page should be unique to the customer’s experience of the site. In a SaaS app, your thank you page shouldn’t be the same URL as the product’s dashboard. In an ecommerce site, your thank you page shouldn’t be the same as the order status page.

Then, tracking every step in the funnel

You should configure goals that match every step in your funnel, and take them a lot less seriously than the goals that actually close a sale. This allows you to get a whole portrait of your funnel, which is valuable for assessing whether there are any significant drop-off points which may be best addressed with further testing.

Finally, engagement

Both frameworks have "engagement" as a goal, to show the proportion of customers that actually engage with a page. This is basically a freebie, so add it just to make sure the test is generating data.

I don’t think engagement is a valuable metric otherwise – not even on blogs or news sites. If you must measure your readers’ attention – which, to be clear, is very rare in A/B testing – it’s much better for you to come up with more granular metrics.

How to configure a goal

A goal can be configured in a variety of ways. You can also add more than one URL, in case you need to use multiple wildcards in aggregate.

CSS Selectors

This tracks clicks and taps on any CSS selector you desire. Here’s a huge reference of them. These are great for:

Clicks to your site’s navigation.
Clicks to your site’s footer.
Clicks anywhere on a pricing , measured against clicks on actual CTA buttons within that .
Clicks to a primary CTA button that’s scattered in multiple places on your home page.
Clicks to an add to cart button, when there are multiple ways to check out (think PayPal, Amazon, on-site, etc).
Attempts to enter a search query, versus submissions of that query.
Focuses on a form field, such as to enter one’s email address. (This is great for fixing validation bugs!)

Overall, CSS selector goals are great for addressing one-off issues with the usability of the site – and they’re especially great for validating the insights that you get from your heat maps.

Form submissions

Your framework should be able to track form submissions to a specific URL as well. You’ll want to enter the href that the form is going to, if it exists.

If it doesn’t have a direct target, and you don’t have any other forms that someone can use at this point in the funnel, use a wildcard with an asterisk so it can capture all form submissions everywhere.

Page views & link clicks

This is the most common goal by far: tracking views of a specific page. And it comes in two forms:

Page views track hits to a specific page, regardless of source.

Link clicks track clicks to a specific page on the test page you’re measuring.

I use each of these goals to measure each step in a funnel, usually with page view goals. Why? Ultimately, it doesn’t matter how much tire-kicking the person is doing, as long as they convert. And I can get a sense for the quantity of tire-kicking by running a heat map.

You can measure views to a specific page by indicating the exact URL ("URL matches", in the string-matching queries above). You can measure views to many pages by using other string-matching queries:

"Matches pattern" uses wild-card matching to specify a page. For example, all URLs with blah can be specified with *blah*. All URLs with blah and bleh can be specified with *bl?h*.

"Contains" simply returns all URLs containing that string. As with the blah example above, simply typing blah will do the same thing.

"Starts with" and "ends with" are what you think – but you need to begin "starts with" using the correct trailing protocol. "https://example.com" will only work on "https://example.com", not "https://www.example.com" or "http://example.com".

"URL matches regex" uses regular expressions to match your URLs, which is terrific for highly complex string matching. Rather than spend ten thousand words teaching you about regular expressions, I refer you to an authoritative course.

What if your goal isn’t firing?

You might be using the wrong pattern for your URLs, or the target URL may not have your framework’s tracking code.

In the former case, you need to change the goal so it accurately reflects what’s happening on the site – and you probably need to flush your test’s data and start over.

In the latter case, you need to install the tracking code on the corresponding test pages. Make sure it appears in the right place on each page, especially keeping in mind that many apps sideload code.

Measure as carefully as you test

Don’t forget that it’s quite possible to over-measure – and act on the wrong insights.

First, remember that revenue trumps all other goals. You are testing to generate revenue, always.

Next, revenue-related goals trump other goals. Hits to a confirmation page correspond directly to revenue, as long as you aren’t providing something for free.

Finally, rank behavior-based goals last. Clicks on navigation, engagement, and the like are sometimes useful for framing design decisions, but consider them more like a part of your research process than an actual testing outcome.

Calling a winner

Now, we’re going to answer the biggest question that I get from our clients: how do we call a winning variant in an experiment?

That should be a clear-cut answer. It is not. I am going to spend a lot of words explaining why.

The final calculations

I use chi-squared tests in order to calculate statistical significance for my clients. This shows how confident we should be about the test – and how much of a lift is likely to occur once we roll the change out to everyone. You probably made chi-squared calculations if you ever took a college-level statistics class.

I only call variations if they calculate a significant change at 95% or higher confidence. 99% is ideal, but we generally only get that with high traffic and conversion numbers.

The reality

This is how experiments play out in an idealized state. It is also how you analyze tests if you roll your own framework and calculate everything by hand.

In the real world, you’re using a framework to make and analyze your changes. Each framework has its own proprietary system for analyzing tests – and they’re liable to continue changing. As a result, you’re probably working with a whole different set of calculations – and each product considers their methodology a trade secret, so you’re dumping everything into a black box.

It’s tempting to look at a test the day after it launches and think you increased sales by 420%. You did not. Frameworks often call variations prematurely – even though you don’t have enough traffic yet. Just ignore them, and don’t peek until you know you have enough visitors.

Frameworks also have a weird foible which pops an all-caps warning if you don’t run your test for a specific integer number of weeks. So, if you launch a test on a Monday and call it 17 days later on a Thursday, your framework will probably force you to run it until that next Monday. They do this in order to account for any variations that may occur over the weekend. In practice, though, I have found little behavioral difference in most of my clients’ sites over the weekend – just less traffic. And if you’re getting enough traffic to call something confidently, the number of days you run a test ultimately doesn’t matter: the number of hits does. Soldier through their crappy dialog and call the test anyway.

Calculating sample size

Testing is worthless without valid statistical significance for your findings – and you need enough qualified, wallet-out traffic to get there. If you want a lengthy mathematical explanation for this, here’s one.

What does this mean for you in practice? Before you run any test, you need to calculate the amount of people that should be visiting it. Fortunately, there’s an easy way to calculate your minimum traffic. Let’s go into how to do this!

Maximum timeframe

First off, you should be getting the minimum traffic for an A/B test within a month’s time. Why a month? Several reasons:

It won’t be worth your organization’s time or resources to run tests so infrequently. You are unlikely to get a high ROI from A/B testing within a year of effort.

One-off fluctuations in signups – either by outreach campaigns, holidays, or other circumstances – are more likely to influence your test results.

Your organization will be more likely to spend time fighting over the meaning of small variations in data. That is not a positive outcome of A/B testing.

You will not be able to call tests unless they’re total home runs, for reasons I’ll describe below.

Sample size is calculated with two numbers:

Your conversion rate. If you don’t have this already calculated, you should configure a goal for your "thank you" page in Google Analytics – and calculate your conversion rate accordingly.

The minimum detectable effect (or MDE) you want from the test, in relative percentage to your conversion rate. This is subjective, and contingent on your hypothesis.

A note on minimum detectable effect

The lower the minimum detectable effect, the more visitors you need to call a test. Do you think that a new headline will double conversions? Great, your minimum detectable effect is 100%. Do you think it’ll move the needle less? Then your minimum detectable effect should be lower.

Put another way, if you want to be certain that a test causes a small lift in revenue-generating conversions – let’s say 5% – then you will need more traffic than a hypothesis that causes your conversions to double. This is because it’s easier to statistically call big winners than small winners. It also means that the less traffic you have, the fewer tests you’ll be able to call.

You should not reverse-engineer your minimum detectable effect from your current traffic levels. A test either fulfills your hypothesis or it doesn’t, and science is historically quite unkind to those who try to cheat statistics.

How to calculate sample size

I use Evan Miller’s sample size calculator for all of my clients. You throw your conversion rate and MDE numbers in there, and calculate the level of confidence you want your test to be at.

I recommend at least 95% confidence for all tests. Why? Because anything less means you still have a high chance for a null result in practice. Lower confidence raises the chance that you’ll run a test, see a winner, roll it out, and still have it lose in the long run.

Let’s say your conversion rate is 3% and your hypothesis’s MDE is 10% – so you’re trying to run a test that conclusively lifts your conversion rate to 3.3%. Here’s an example of how I fill this form out.

Note that the resulting number there is per variation. Are you running a typical A/B test with a control and 1 variant? You’ll need to double the resulting number to get your true minimum traffic. Are you running a test with 3 variants? Quadruple the number. You get the idea. This can result in very high numbers very quickly.

If you see a number that’s clearly beyond the traffic you’d ever expect to get in a month, work on one-off optimizations to your funnel instead. Don’t A/B test. It’ll be a waste of your company’s time and resources. Testing isn’t how the struggling get good, it’s how the good get better.

This week, for paid members

This week’s paid lesson provides a brief primer on leveraged power structures in design. This is essential for understanding how to sell work and get it shipped.
Our design of the week presents the weirdest pre-release that we’ve ever seen. $1 for… something?
And our fortnightly teardown is for spice brand Diaspora. We’ve done them before, but they subtly redesigned – how does it fare?

Want in? Join us now – named one of the best ecommerce communities going on the web.

Already a member? Log in here and take a look at what’s new.

Links

One of my design pals just wrapped up a Fulbright in North Carolina, home to one of my two alma maters, and his multipartite summary of his time there brought back lots of memories and conveyed the messiness & glory of that place well.
A brief read on the restructuring that will occur in design to restore how it is presented & received. You’re buying profitable thinking when you buy design. You are not buying the results of that design, although design is what creates the results, and the results matter. Holding this duality in your head as you buy design is absolutely vital to the ongoing practice. (Archive link.)
Arranging things on top of things is a compositional challenge, but it brings about a lovely richness when it’s done well. More on how to attain it.
The highly profitable practice of information architecture is, as always, under attack. It’s wild, since not only is information architecture one of the most important things you can nail on a store, it’s also something that countless usability tests prove out. This week’s teardown is also focused on low-quality information architecture. A trend, no? Perhaps a competitive advantage? More.
Retention tactics, a major focus of value-based design.
Notes on the dark patterns inherent in a popular design tool. Because design is a form of leveraged power, and because a focus on results will corrode the industry and destabilize power dynamics away from design, I believe this particular tool began the conclusion of the contemporary design industry. “But ‘collaboration!’”, you say. No.

Aug. 13, 2024, 7 a.m.