Five ways to reduce variance in A/B testing

(bytepawn.com)

51 points | by Maro a day ago ago

19 comments

  • vijayer 2 hours ago ago

    This is a good list that includes a lot of things most people miss. I would also suggest:

    1. Tight targeting of your users in an AB test. This can be through proper exposure logging, or aiming at users down-funnel if you’re actually running a down-funnel experiment. If your new iOS and Android feature is going to be launched separately, then separate the experiments.

    2. Making sure your experiment runs in 7-day increments. Averaging out weekly seasonality can be important in reducing variance but also ensures your results accurately predict the effect of a full rollout.

    Everything mentioned in this article, including stratified sampling and CUPED are available, out-of-the-box on Statsig. Disclaimer: I’m the founder, and this response was shared by our DS Lead.

    • wodenokoto 2 hours ago ago

      > 2. Making sure your experiment runs in 7-day increments. Averaging out weekly seasonality can be important in reducing variance but also ensures your results accurately predict the effect of a full rollout.

      There are of course many seasonalities: day/nigh, weekly, monthly, yearly seasonality, so it can be difficult to decide how broad you want to collect data. But I remember interviewing at a very large online retailer and they did their a/b tests in an hour because they "would collect enough data points to be statistical significant" and that never sat right with me.

  • kqr 3 hours ago ago

    > Winsorizing, ie. cutting or normalizing outliers.

    Note that outliers are often your most valuable data points[1]. I'd much rather stratify than cut them out.

    By cutting them out you indeed get neater data, but it no longer represents the reality you are trying to model and learn from, and you run a large risk of drawing false conclusions.

    [1]: https://entropicthoughts.com/outlier-detection

    • chashmataklu 36 minutes ago ago

      TBH depends a lot on the business you're experimenting with and who you're optimizing for. If you're Lime Bike, you don't want to skew results because of a Doordasher who's on a bike for the whole day because their car is broken.

      If you're a retailer or a gaming company, you probably care about your "whales" who'd get winsorized out. Depends on whether you're trying to move topline - or trying to move the "typical".

  • sunir 3 hours ago ago

    One of the most frustrating results I found is that A/B split tests often resolved into a winner within the sample size range we set; however if I left the split running over a longer period of time (eg a year) the difference would wash out.

    I had retargeting in a 24 month split by accident and found it didn’t matter after all the cost in the long term. We could bend the conversion curve but not change the people who would convert.

    And yes we did capture more revenue in the short term but over the long term the cost of the ads netted it all to zero or less than zero. And yes we turned off retreating after conversion. The result was customers who weren’t retargeted eventually bought anyway.

    Has anyone else experienced the same?

    • kqr 3 hours ago ago

      > We could bend the conversion curve but not change the people who would convert.

      I think this is very common. I talked to salespeople who claimed that customers on 2.0 are happier than those on 1.0, which they had determined by measuring satisfaction in the two groups and got a statistically significant result.

      What they didn't realise was that almost all of the customers on 2.0 had been those that willingly upgraded from 1.0. What sort of customer willingly upgrades? The most satisfied ones.

      Again: they bent the curve, didn't change the people. I'm sure this type of confounding-by-self-selection is incredibly common.

    • Adverblessly 2 hours ago ago

      Obviously it depends on the exact test you are running, but a factor that is frequently ignored in A/B testing is that often one arm of the experiment is the existing state vs. another arm that is some novel state, and such novelty can itself have an effect. E.g. it doesn't really matter if this widget is blue or green, but changing it from one color to the other temporarily increases user attention to it, until they are again used to the new color. Users don't actually prefer your new flow for X over the old one, but because it is new they are trying it out, etc.

    • bdjsiqoocwk 3 hours ago ago

      > One of the most frustrating results I found is that A/B split tests often resolved into a winner within the sample size range we set; however if I left the split running over a longer period of time (eg a year) the difference would wash out.

      Doesn't that just mean there's no difference? Why is that frustrating?

      Does the frustration come from the expectation that any little variable might make a difference? Should I use red buttons or blue buttons? Maybe if the product is shit, the color of the buttons doesn't matter.

      • admax88qqq 3 hours ago ago

        > Maybe if the product is shit, the color of the buttons doesn't matter.

        This should really be on a poster in many offices.

  • usgroup an hour ago ago

    Adding covariates to the post analysis can reduce variance. One instance of this is CUPED by there are lots of covariates which are easier to add (eg request type, response latency, day of week, user info, etc).

  • pkoperek 3 hours ago ago

    Good read. Does anyone know if any of the experimentation frameworks actually uses these methods to make the results more reliable (e.g. allow to automatically apply winsorization or attempt to make the split sizes even)?

  • withinboredom 4 hours ago ago

    good advice! From working on an internal a/b testing platform, we had built-in tooling to do some of this stuff after the fact. I don't know of any off-the-shelf a/b testing tool that can do this stuff.

    • ulf-77723 2 hours ago ago

      Worked at an A/B Test SaaS company as a solutions engineer and to my knowledge every vendor is capable of delivering solutions for those problems.

      Some advertise with those things, but the big ones take it for granted. Usually before a test will be developed the project manager will assist in mentioning critical questions about the test setup

      • chashmataklu 28 minutes ago ago

        Pretty sure most don't. Most A/B Test SAAS vendors cater to lightweight clickstream optimization, which is why they don't have features like Stratified Sampling. Internal systems are lightyears ahead of most SAAS vendors.

    • alvarlagerlof 3 hours ago ago

      Pretty sure that http://statsig.com can

  • kqr 4 hours ago ago

    See also sample unit engineering: https://entropicthoughts.com/sample-unit-engineering

    Statisticians have a lot of useful tricks to get higher quality data out of the same cost (i.e. sample size.)

    Another topic I want to learn properly is running multiple experiments in parallel in a systematic way to get faster results and be able to control for confounding. Fisher advocated for this as early as 1925, and I still think we're learning that lesson today in our field: sometimes the right strategy is not to try one thing at a time and keep everything else constant.

    • authorfly 3 hours ago ago

      Can you help me understand why we would use sample unit engineering/bootstrapping? Imagine if we don't care about between subjects variance (and thus P-values in T-tests/AB tests), in that case, it doesn't help us right...

      I just feel intuitively that it's masking the variance by converting it into within-subjects variance arbitrarily.

      Here's my layman-ish interpretation:

      P-values are easier to obtain when the variance is reduced. But we established P-values and the 0.05 threshold before these techniques. With the new techniques reducing SD, which P-values directly interpret, you need to counteract the reduction in SD of the samples with a harsher P-value in order to obtain the same number of true positive experiments as when P-values were originally proposed. In other words, allowing more experiments to have less variance in group tests and result in more statistical significant if there is an effect size is not necessarily advantageous. Especially if we consider the purpose of statistics and AB testing to be rejecting the null hypothesis, rather than showing significant effect sizes.

      • kqr 3 hours ago ago

        Let's use the classic example of "Lady tasting tea". Someone claims to be able to tell, by taste alone, if milk was added before or after boiling water.

        We can imagine two versions of this test. In both, we serve 12 cups of tea, six of which have had milk added first.

        In one of the experiments, we keep everything else the same: same quantities of milk and tea, same steeping time, same type of tea, same source of water, etc.

        In the other experiment, we randomly vary quantities of milk and tea, steeping time, type of tea etc.

        Both of these experiments are valid, both have the same 5 % risk of false positives (given by the null hypothesis that any judgment by the Lady is a coinflip). But you can probably intuit that in one of the experiments, the Lady has a greater chance of proving her acumen, because there are fewer distractions. Maybe she is able to discern milk-first-or-last by taste, but this gets muddled up by all the variations in the second experiment. In other words, the cleaner experiment is more sensitive, but it is not at a greater risk of false positives.

        The same can be said of sample unit engineering: it makes experiments more sensitive (i.e. we can detect a finer signal for the same cost) without increasing the risk of false positives (which is fixed by the type of test we run.)

        ----

        Sometimes we only care about detecting a large effect, and a small effect is clinically insignificant. Maybe we are only impressed by the Lady if she can discern despite distractions of many variations. Then removing distractions is a mistake. But traditional hypothesis tests of that kind are designed from the perspective of "any signal, however small, is meaningful."

        (I think this is even a requirement for using frequentist methods. They neef an exact null hypothesis to compute probabilities from.)

  • sanchezxs an hour ago ago

    Yes.