Online Metrics You Can Take To The Bank
Consider this scenario: Your client (or boss) wants to determine if a different version of a display ad will perform better than another. You build the two different ads, holding everything the same except for the part you want to test, traffic them into one (or multiple) placements rotating 50/50 against each other in your ad server. After a while, you check the results and... voila! Version A has a CTR 25% higher than Version B.
What's your immediate next step?
1) Declare Version A the winner and roll it out to more placements.
2) Let the test keep running to see if you can get a higher lift than 25%, because you think that bigger gains are required to pronounce a true winner.
Let me now take each of the answers above and discuss them in more detail.
Declare Version A the winner
A 25% lift sounds impressive, but how do you know that this result is reliable? Many advertisers make decisions trying to use the least number of impressions possible, which is a worthy goal because no one wants to waste impressions. However, when your CTR is low (say 0.04%) and when sample sizes are small, your results can have a large amount of error, meaning there is little certainty that the performance you are seeing will continue to hold true.
Hold the 50/50 rotation to get a higher lift
Going for even better gains from your tests is admirable, but is it necessary? In the absence of better analysis, many believe that you need to look for a minimum amount of improvement to determine a winner. The reality is, if you have a high level of confidence (statistical confidence, that is), the exact amount of lift is not as important. As we'll see later in this article, relatively small gains in performance in a test environment can mean huge gains when rolled out to the entire media buy.
How do you know that the 25% increase in performance you've seen with Version A will hold true over time? Did you run enough impressions? Is percentage lift even the appropriate metric to use?
If you answered "neither", give yourself a pat on the back. Given the information in the example, you don't know enough to determine if your results are accurate. You need to understand more about the numbers before moving forward.
Here are two examples to illustrate how a smaller percentage difference between two versions of an ad is actually a more reliable prediction of future performance. (To keep everyone's interest, I won't go into the formulas used for the calculations.)
Let's say you gave the test a total of 400,000 impressions (200,000 for each version). Version A got 100 clicks (CTR = 0.05%) and Version B got 80 clicks (0.04%). Version A is performing 25% better than Version B. However:
The overall test has an error rate of 15%. This is too high to be confident that you have enough of a sample size to get a reliable read on the results. You should look for an error rate of 10% or less.
There is not a statistically significant difference between the performance of these two versions. Our confidence level that Version A will always beat Version B is only 86%; 90% is the minimum confidence you should use, and ideally you should look for 95% to declare a winner.
Now let's say you gave the same test 1,500,000 impressions total (750,000 for each version). Version A got 350 clicks (CTR = 0.047%) and Version B got 300 clicks (CTR = 0.04%). Version A is only performing 17% better than Version B. Doesn't sound very impressive, does it? However:
Because of the higher impression levels, the test has an error rate of only 8%, low enough to be sure that we have a big enough sample size to predict results.
Although the difference between the two versions is only 17%, you are 95% confident that Version A will beat Version B. That's only a 5% chance you will be wrong. Not bad odds.
These scenarios become much more complex when testing multiple parts of an ad at the same time. Without a lot of careful coding of creative and manual data collection, it is nearly impossible in the ad server environment to get the information needed to determine the result of a test that involves more than 2 or 3 versions. Imagine what would happen if you wanted to test and optimize 9 or 27 or even more versions?
Direct marketers have used the calculations illustrated above for the past few decades because they had to be as sure as possible that their creative would deliver results. Making the wrong decision about a direct mail piece can be very costly. Online is undoubtedly more nimble and typically less costly and time consuming to get new messages into market. However, does that mean that we online marketers should be less careful about measuring and predicting results? I think not.
Therefore, I'd like to add something to David Ogilvy's quote: "Make sure you can take your results to the bank."