Mathematical points for A/B testing By using control experiments, such as A/B testing, we are able to evaluate ideas on the Web. With control experiments, we discover the relationship between changes in the user's experience and the impact on their behaviour. By using an A/B test, companies can increase their Return On Investment (ROI) by providing a better experience for the customers who visit their websites.

However, to quote an anonymous commentator: “The difference between theory and practice is larger in practice than the difference between theory and practice in theory ”. The concept of A/B testing seems very simple and easy to understand in theory. However, there are important factors  rarely understood in practice , so we discuss limitations here. A/ B testing as a multidisciplinary field has a broad range of aspects. In this article, we merely focus on those aspects that are important for analysing data, whilst others are out of scope for this article.  We provide brief notes that are practical and primarily of use for the business analyst who does not have much of a mathematics background.

Before explaining these important notes, we present some real case studies to demonstrate how companies can benefit from A/B testing. Some more examples can be found here. Please take a look, particularly if you have doubts about the potential of A/B testing to select the best variants of your website to increase conversions.

Case studies

• Doctor FootCare compared two versions of their website: Removing the discount code (Version B) for one test, then comparing this with a test with discount code (variant A). They noticed variant B increased conversion rate by 6.5% compared to variant A. With a discount code on the website, people without coupons started to think twice whether they were paying too much.
• Microsoft office help page compared two variants of the website. In variant A,  users were asked for feedback using two separate pop-ups: One for a Yes/No answer to the question if the article was helpful; and the another one for Why. Microsoft modified this widget with a 5-star rating and revealed that the new version provides a better experience for users.
• MSN home page had a critical question about how many ads should be placed on the page. Using an A/B test gave them the insight to scrap the idea of adding more ads on their website.
• Amazon used a recommendation system (variant B) in 2006 and compared to their older version (variant A). This comparison revealed an increase of 3% in Amazon’s revenue  which translated into several hundreds of millions of dollars.
• Marriott added a new online reservation form that resulted in an extra $30 million in bookings. • Coach compared two variants for search engines. An A/B test helped Coach to find that the new variant was 200% more effective. In order to develop a successful A / B test experiment, we aim to help here with several practical lessons focusing on: data gathering, design of experiments, minimum sample size and Confidence Interval: Data gathering The most important step is mining the data. Firstly, we need to record several metrics for all the end users who visited the website during an experiment. By using event-triggered filtering, we restrict the analysis only to users affected by the experiment. Rich and flawless data enable us to analyse using machine learning and data mining techniques. The actual method to gather data is out of scope for this article because it targets data analysts exclusively, but it is important to check multiple data types. A common error is to reach the wrong conclusion by using the wrong data. Design of experiments Several scientific authors recommend testing one factor at a time. Our experience makes us think this advice is restrictive. Instead of traditional A / B tests, we recommend Multivariable testing to include more than one factor in an experiment. In Multivariable testing, we can test multiple factors in a short time. Moreover, we can measure the interaction between factors. It is possible the combined effect of two factors is different from the sum of two individual effects. In the statistical field of Design of Experiments, one major research area is to look into minimising the number of user groups needed for the test. Some methods, such as fractional factorial designs and Taguchi methods, are superior in theory, however, in practice they have some shortcomings and new evidence suggests we do not need to use methods that require a huge amount of time and data. Determine the minimum sample size A/B testing is inherently a statistics-based process, one where we make a hypothesis about the relationship between two datasets to determine if there is a statistically significant relationship or not. In this article, I won't elaborate on some basic statistical concepts such as Type I errors and statistical significance, Type II error and statistical power, confidence interval and so on. Instead, we explain some simple formulas used without advanced mathematical backgrounds. In order to do an A/B test, the first step is to calculate the minimum required sample size which depends on the effect that we aim to detect and the variability of the metric that we measure. The following formula calculates the required minimum sample size when the desired confidence level is 95% and the desired power is 80%. $$n= \frac {16σ^2 }{Δ ^2}$$ In this formula,$n$indicates the number of users in each variant$σ , ^2$is the variance of our measured metrics; and$Δ$is the amount of change that we wish to detect. However, some researchers suggest a more conservative formula for calculating sample size with 90% power which is $$n= (4rσ / Δ) ^2$$ Here$r$is the number of variants. For A/B tests when we have only two variants$r$is equal to 2. So we see the variability of the metrics affects the sample size. Consequently, if we do not have access to enough users for A/B testing we can slightly change the question and define new metrics with less variability. Another factor that affects sample size is the desired sensitivity. To evaluate whether a treatment group is different from a control group, a statistical test should be conducted. One formula that can be used is a t-test: $$t= \frac { \bar O_B - \bar O_A } { \hat σ_d}$$ Where$ \bar O_B $and$ \bar O_A $are the estimated metric values (or they can be their averages) and$\hat σ_d$is the estimated standard deviation of two metrics and$t$is the test result. We compare$t$with our confidence level (let’s say it is 95%) and if the confidence level is larger than threshold, then we reject Null Hypothesis (The hypothesis, often referred to as$H_0\$). In simple words, it means there is no relationship (or difference) among the measured groups. Also, we assume, the sample sizes are large enough throughout to allow us to assume that the data has Normal distribution.

Confidence Interval

Confidence interval gives a range of plausible values for the size of the effect and determines if there is a statistically significant event or not. Confidence interval for absolute effect can be calculated as follows:

$$CI Limits = \bar O_B - \bar O_A \pm 1.96 * \hat σ-d$$

The percentage change has mostly much more of an intuitive meaning than the absolute change and can be calculated as follows

$$Pct Diff = \frac {\bar O_B - \bar O_A} {\bar O_A} * 100 \%$$

Summary

In this article, we have concentrated on mathematical points in A/B testing for non mathematicians. A/B testing can be conducted with several tools and leading the list is Google Optimize 360 to be discussed in our next articles. However, whatever tool we use, our understanding  is incomplete if we neglect the points summarised in this article.

Thank you for browsing this post, stay tuned for more from Internetrix. Pick up the phone, Live Chat, or email us if you would like us to share our skills and knowledge to achieve your business goals and targets. Internetrix combines digital consulting with winning website design, smart website development and strong digital analytics and digital marketing skills to drive revenue or cut costs for our clients. We deliver web-based consulting, development and performance projects to customers across the Asia Pacific ranging from small business sole traders to ASX listed businesses and all levels of Australian government.