A/B Test survival kit
Choose invariant metrics for Sanity Check
- Make sure those metrics are randomly select.
Choose evaluation metrics for Effect Size Test
- Those metrics are effect by test.
Size calculation
http://www.evanmiller.org/ab-testing/sample-size.html
- alpha: reject NULL when NULL true. (false positive)
- beta: fail to reject NULL when NULL false. (false negative)
- sensitivity = 1 - beta. (often 80%)
- baseline conversion: probability to get a sucess event.
- dmin: minimum detectable effect.
Sannity Checks
- If random invariant metric, use p = 0.5
- SE = SQRT(0.5*0.5/(N_Ctrl + N_Exp))
- CI = [0.5 - SE * z_score, 0.5 + SE * z_score]
- Observed fraction = N_Exp / (N_Ctrl + N_Exp)
- If probability metric
- p = N_sucessCtrl / N_Ctrl
- SE = SQRT(p*(1-p)/ N_Ctrl)
- CI = [p - SE * z_score, p + SE * z_score]
- Observed fraction = N_sucessExp / N_Exp
Make sure the observed fraction are in the CI.
A/B Test calculation:
- For category or proportional variable: using z-test for statistical significance.
- For numerical variable: using t-test for statistical significance.
-
Effect Size Test (proportional variable)
- Pooled Prob or p: N_sucessCtrl + N_sucessExp / (N_Ctrl + N_Exp)
- Pooled SE: SQRT(p(1-p)(1/N_Ctrl + 1/N_Exp)
- Magin of error: pooled SE * z_score
- d = (N_sucessExp/N_Exp) - (N_sucessCtrl / N_Ctrl)
-
CI = [d - magin of error, d + magin of error]
- statistical significant: not include 0.
- pratical significant: not in the range of dmin.
Sign Test:
- Count number of sucess events in total number of experiment according to your hypothesis. http://graphpad.com/quickcalcs/binomial1.cfm
Bonferroni Correct:
- Apply when using OR in hypothesis, reducing the false positive by reduce the alpha: new alpha = alpha / number of metrics
- Apply when running multiple A/B tests. The more number of A/B tests are running, the more likelyhood these A/B tests are significance.
Question
- What are the most important features an A/B testing solution must have ?
- Number of invariant metrics and evaluation metrics.
Keyword
- A/B testing framework
- Sum, count, means, median, percentils 25 75 90, ratios, probability and rates.
Casual Inference methodology:
- A/B Test or controlled experiment is the gold standard for casual studies. However, it is not always possible to run A/B Test in certain cases. In these cases, we could use observational casual studies to gain the insights.
- Interrupted Time Series: alternate A/B variants to same population in time. Use cases: marketing, pricing
- Interleaved Experience: display A/B variants in mixed orders. Use cases: ranking, recommendation.
- Regression Discontinuity Desgin: apply A/B variants accordiing to threshold of population. Use cases: scoring algorithm
- Instrumented Variables (IV) and Natural Experiment: IV is used to approximate random assigments.Sometines, natual experince could make random as good as controlled experiment.
- Propensity Score Matching: segmented population based on common confounds to build comparable control and treatment population. Creating a number named constructed propensity score with observed covariants. Use cases: ad campaign.
- Difference in Difference: find a comparable population to controlled experiment population and assume that they are moving in the same trend. The difference of difference between 2 population before and after experiment is the treatment effect. Use cases: geo experiments
Reference
Written on August 4, 2023