## A Real World Example of A/B Website Testing

A/B testing has gained a lot of interest in recent years as a practical method for improving the results from websites.  My business partner and I produced a mobile application, RecallCheck that relied on a database we created from FDA and USDA websites.  During our project, the FDA introduced a new website for reporting food related issues called the:  Reportable Food Registry.  For our purposes (using a mobile phones to scan bar codes), the result we most cared about was the quality and quantity of UPC codes.  Below is a detailed statistical analysis of before and after the FDA made it’s changes with regard to the quantity of UPC codes included in published recall notices.   While our interest was the UPC codes, the same process can be applied to any change on a website.

For the impatient, let me save you reading everything and say: there were fewer UPC codes.

Our investigation began when we plotted the average number of UPCs per notice. While the decline from the beginning of the year might have been the result of the large recalls dying off, we were surprised that the number remained low after September.  That month, we changed our practice and started contacting companies that didn’t include UPCs in the recall notice when we created a database entry.  Our expectation was that the average number of UPCs in notices would rise as a result of our efforts.  But just looking at the chart didn’t allow us to draw any conclusions, since the change could have been the result of either normal variance or the waning of the large recalls.  Understanding whether September 2009 represented any sort of change would require the application of statistics.

The first question we applied statistical tools to was: Is there a difference in the number UPCs per notice before and after Sept. 2009.  In order to answer this question we used Analysis of Variance (ANOVA).    For the analysis, we used a null hypothesis where the means of the before and after group were equal, and a alternative hypothesis stating they were not equal.  In traditional statistical syntax this is stated as: To perform the calculations, we used the FOSS spreadsheet Gnumeric, because it includes a bunch of statistical tools not included in other spreadsheets.

 Anova: Single Factor SUMMARY Groups Count Sum Average Variance Before 5.00 15.25 3.05 0.85 After 4.00 7.05 1.76 0.98 ANOVA Source of Variation SS df MS F P-value F critical Between Groups 3.68 1.00 3.68 4.07 0.08 3.59 Within Groups 6.34 7.00 0.91 Total 10.02 8.00

In order to evaluate whether the ANOVA analysis is statistically valid, an F-Test is used.  In traditional statistical language, this test is expressed as: If the left hand side is greater than the right hand side, we will conclude that there is a difference between the before and after (known in statistical parlance as “rejecting the null hypothesis”).  Using the table above, we can see that F equals 4.07 and the (right hand side) critical value of F equals 3.59.  Therefore, we can conclude that at a 90% level of confidence, the null hypothesis can be rejected.

Unless you’ve taken statistics, that conclusion isn’t terribly useful.  It means there is a statistically meaningful (at the lower 90% confidence level) that there is a difference between the before and after groups.  It doesn’t tell us if the before group is larger or smaller, but only that there is a difference.  To determine whether the before group is larger or smaller will require another statistical test.

To determine if the before September numbers are larger than the after numbers, we used a test statistic for when the variances are unequal and the results are unpaired. In this case, the null hypothesis was that the mean for the before is the same as the one after and the alternative hypothesis was that the before mean was larger than the after mean.  In statistical notation, this is expressed as: The calculations from Gnumeric were:

 Before After Mean 3.05 1.76 Variance 0.85 0.98 Observations 5.00 4.00 Hypothesized Mean Difference 0.00 Observed Mean Difference 1.29 df 6.34 t Stat 2.00 P (T<=t) one-tail 0.04 t Critical one-tail 1.92 P (T<=t) two-tail 0.09 t Critical two-tail 2.42

Interpreting the results, we used the following t-test: In looking at the table above we can see that t (the left hand side) was: 2.00.  The right hand side, also known as the critical value of was:1.92.  Therefore, we can conclude that the null hypothesis should be rejected.  In other words, there were more average UPCs per notice before September than after.

## What Does it Mean?

So, the statistics tell us (despite the low number examples) that after September 2009, the average number of UPCs per notice dropped, despite the fact that we were making more effort to include more UPCs. Look at potential reasons we noticed that was the same time the FDA cut over to their new reporting system.  After looking through the new system, we noticed that it was only possible to enter one UPC per incident reporting as seen in this image: To use the example above, if there are multiple packages with different UPC codes (8 count, 12 count, 32 count, etc) there is only space for one UPC, even though there might be many, many different UPC codes for “The Best Cookies Ever.”  This means, that someone entering a report, would either have to leave with one UPC (or maybe none, they aren’t required), or complete the very long and detailed form over and over again to account for the different UPCs.

Our conclusion is that the new form drove down the average number of UPCs reported per recall.