A/B Test survival kit

Choose invariant metrics for Sanity Check

  • Make sure those metrics are randomly select.

    Choose evaluation metrics for Effect Size Test

  • Those metrics are effect by test.

Size calculation

http://www.evanmiller.org/ab-testing/sample-size.html

  • alpha: reject NULL when NULL true. (false positive)
  • beta: fail to reject NULL when NULL false. (false negative)
  • sensitivity = 1 - beta. (often 80%)
  • baseline conversion: probability to get a sucess event.
  • dmin: minimum detectable effect.

Sannity Checks

  • If random invariant metric, use p = 0.5
    • SE = SQRT(0.5*0.5/(N_Ctrl + N_Exp))
    • CI = [0.5 - SE * z_score, 0.5 + SE * z_score]
    • Observed fraction = N_Exp / (N_Ctrl + N_Exp)
  • If probability metric
    • p = N_sucessCtrl / N_Ctrl
    • SE = SQRT(p*(1-p)/ N_Ctrl)
    • CI = [p - SE * z_score, p + SE * z_score]
    • Observed fraction = N_sucessExp / N_Exp

Make sure the observed fraction are in the CI.

A/B Test calculation:

  • For category or proportional variable: using z-test for statistical significance.
  • For numerical variable: using t-test for statistical significance.
  • Effect Size Test (proportional variable)

  • Pooled Prob or p: N_sucessCtrl + N_sucessExp / (N_Ctrl + N_Exp)
  • Pooled SE: SQRT(p(1-p)(1/N_Ctrl + 1/N_Exp)
  • Magin of error: pooled SE * z_score
  • d = (N_sucessExp/N_Exp) - (N_sucessCtrl / N_Ctrl)
  • CI = [d - magin of error, d + magin of error]

    • statistical significant: not include 0.
    • pratical significant: not in the range of dmin.

Sign Test:

Bonferroni Correct:

  • Apply when using OR in hypothesis, reducing the false positive by reduce the alpha: new alpha = alpha / number of metrics
  • Apply when running multiple A/B tests. The more number of A/B tests are running, the more likelyhood these A/B tests are significance.

Question

  • What are the most important features an A/B testing solution must have ?
  • Number of invariant metrics and evaluation metrics.

Keyword

  • A/B testing framework
  • Sum, count, means, median, percentils 25 75 90, ratios, probability and rates.

Casual Inference methodology:

  • A/B Test or controlled experiment is the gold standard for casual studies. However, it is not always possible to run A/B Test in certain cases. In these cases, we could use observational casual studies to gain the insights.
    1. Interrupted Time Series: alternate A/B variants to same population in time. Use cases: marketing, pricing
    2. Interleaved Experience: display A/B variants in mixed orders. Use cases: ranking, recommendation.
    3. Regression Discontinuity Desgin: apply A/B variants accordiing to threshold of population. Use cases: scoring algorithm
    4. Instrumented Variables (IV) and Natural Experiment: IV is used to approximate random assigments.Sometines, natual experince could make random as good as controlled experiment.
    5. Propensity Score Matching: segmented population based on common confounds to build comparable control and treatment population. Creating a number named constructed propensity score with observed covariants. Use cases: ad campaign.
    6. Difference in Difference: find a comparable population to controlled experiment population and assume that they are moving in the same trend. The difference of difference between 2 population before and after experiment is the treatment effect. Use cases: geo experiments

Reference

example

template

dmin

background

Written on August 4, 2023