A/B Test survival kit

Choose invariant metrics for Sanity Check

Make sure those metrics are randomly select.
Choose evaluation metrics for Effect Size Test
Those metrics are effect by test.

Size calculation

http://www.evanmiller.org/ab-testing/sample-size.html

alpha: reject NULL when NULL true. (false positive)
beta: fail to reject NULL when NULL false. (false negative)
sensitivity = 1 - beta. (often 80%)
baseline conversion: probability to get a sucess event.
dmin: minimum detectable effect.

Sannity Checks

If random invariant metric, use p = 0.5
- SE = SQRT(0.5*0.5/(N_Ctrl + N_Exp))
- CI = [0.5 - SE * z_score, 0.5 + SE * z_score]
- Observed fraction = N_Exp / (N_Ctrl + N_Exp)
If probability metric
- p = N_sucessCtrl / N_Ctrl
- SE = SQRT(p*(1-p)/ N_Ctrl)
- CI = [p - SE * z_score, p + SE * z_score]
- Observed fraction = N_sucessExp / N_Exp

Make sure the observed fraction are in the CI.

A/B Test calculation:

For category or proportional variable: using z-test for statistical significance.
For numerical variable: using t-test for statistical significance.
Effect Size Test (proportional variable)
Pooled Prob or p: N_sucessCtrl + N_sucessExp / (N_Ctrl + N_Exp)
Pooled SE: SQRT(p(1-p)(1/N_Ctrl + 1/N_Exp)
Magin of error: pooled SE * z_score
d = (N_sucessExp/N_Exp) - (N_sucessCtrl / N_Ctrl)
CI = [d - magin of error, d + magin of error]
- statistical significant: not include 0.
- pratical significant: not in the range of dmin.

Sign Test:

Count number of sucess events in total number of experiment according to your hypothesis. http://graphpad.com/quickcalcs/binomial1.cfm

Bonferroni Correct:

Apply when using OR in hypothesis, reducing the false positive by reduce the alpha: new alpha = alpha / number of metrics
Apply when running multiple A/B tests. The more number of A/B tests are running, the more likelyhood these A/B tests are significance.

Question

What are the most important features an A/B testing solution must have ?
Number of invariant metrics and evaluation metrics.

Keyword

A/B testing framework
Sum, count, means, median, percentils 25 75 90, ratios, probability and rates.

Casual Inference methodology:

A/B Test or controlled experiment is the gold standard for casual studies. However, it is not always possible to run A/B Test in certain cases. In these cases, we could use observational casual studies to gain the insights.
1. Interrupted Time Series: alternate A/B variants to same population in time. Use cases: marketing, pricing
2. Interleaved Experience: display A/B variants in mixed orders. Use cases: ranking, recommendation.
3. Regression Discontinuity Desgin: apply A/B variants accordiing to threshold of population. Use cases: scoring algorithm
4. Instrumented Variables (IV) and Natural Experiment: IV is used to approximate random assigments.Sometines, natual experince could make random as good as controlled experiment.
5. Propensity Score Matching: segmented population based on common confounds to build comparable control and treatment population. Creating a number named constructed propensity score with observed covariants. Use cases: ad campaign.
6. Difference in Difference: find a comparable population to controlled experiment population and assume that they are moving in the same trend. The difference of difference between 2 population before and after experiment is the treatment effect. Use cases: geo experiments

Reference

Written on August 4, 2023