This isn't just a startup thing. This is common also at FAANG.
Not only are expriments commonly multi-arm, you also repeat your experiment (usually after making some changes) if the previous experiment failed / did not pass the launch criteria.
This is further complicated by the fact that lauch criteria is usually not well defined ahead of time. Unless it's a complete slam dunk, you won't know until your launch meeting whether the experiment will be approved for launch or not. It's mostly vibe based, determined based on tens or hundreds of "relevant" metric movements, often decided on the whim of the stakeholder sitting at the lauch meeting.
On the one hand, this is a very nicely presented explanation of how to run statistically significant A/B style tests.
It's worth emphasizing though that if your startup hasn't achieved product market fit yet this kind of thing is a huge waste of time! Build features, see if people use them.
On the third point (peeking at p-values), I created an A/A test simulator that compares peeking vs not peeking in terms of false positive rate: https://stop-early-stopping.osc.garden/
> This isn't academic nit-picking. It's how medical research works when lives are on the line. Your startup's growth deserves the same rigor.
But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?
While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.
At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.
But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.
IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)
But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
I see where you are coming from, and overtesting is a thing, but I really believe that the baseline of quality of all software out there is terrible. We are just so used to it and it's been normalized. But there is really no day going by during which I'm not annoyed by a bug that somebody with more attention to quality would have not let through.
It's not about space rocket type of rigor, but it's about a higher bar than the current state.
(Besides, Elon's rockets are failing left and right, in contrast to what NASA achieved in the 60s, so there are some lessons there too.)
Completely agree. The sign up flow for your startup does not need the same rigor as medical research. You don’t need transportation engineering standards for your product packaging, either. They’re just totally different levels of risk.
I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.
At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.
For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).
0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.
Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!
But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.
If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.
> You don’t need all the status quo bias of null hypothesis testing.
You don't have to make the status quo be the null hypothesis. If you make a change, you probably already think that your change is better or at least neutral, so make that the null. If you get a strong signal that your change is actually worse, rejecting the null, revert the change.
Not "only keep changes that are clearly good" but "don't keep changes that are clearly bad."
This is a reasonable approach, particularly when you’re looking at moving towards a bigger redesign that might not pay off right away. I’ve seen it called “non-inferiority test,” if you’re curious.
Not many users means that getting to stat sig will take longer (if at all).
Sometimes you just need to trust your design/product sense and assert that some change you’re making is better and push it without an experiment. Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep
100% this. I’ve seen people get too excited to A/B test everything even when it’s not appropriate. For us, changing prices was a common A/B test when the relatively low number of conversions meant the tests took 3 months to run! I believe we’ve moved away from that, now.
The company has a large user base, it’s just SaaS doesn’t have the same conversion # as, say, e-commerce.
The idea you should be going after bigger wins than .05 misses the point. The p value is a function of the effect size and the sample size. If you have a big effect you’ll see it even with small data.
Completely agree on the Bayesian point though, and the importance of defining the loss function. Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time.
There is but you can decide that up front. There’s tools that will show you how long it’ll take to get statistical significance. You can then decide if you want to wait that long or have a softer p-value.
Even if you have to be honest with yourself about how much you care about being right, there’s still a place for balancing priorities. Two things can be true at once.
Sometimes someone just has to make imperfect decisions based on incomplete information, or make arbitrary judgment calls. And that’s totally fine… But it shouldn’t be confused with data-driven decisions.
The two kinds of decisions need to happen. They can both happen honestly.
Make your expectations explicit instead of implicit. 0.05 is completely arbitrary. If you are comfortable with a 50/50 chance of being right, make your threshold less rigorous.
Most of us don't, indeed. So still aligned with your perspective, it's good to take in consideration what we are currently working on, and what will be the possible implication. Sometimes the line is not so obvious though. If we design a library or framework which is not very specific to a inconsequential outcome, it's no longer obvious what policy make the more sense.
The thing is though, you're just as likely to be not improving things.
I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.
The other thing is that in those medical contexts, the choice is often between "use this specific treatment under consideration, or do nothing (i.e., use existing known treatments)". Is anyone planning to fold their startup if they can't get a statistically significant read on which website layout is best? Another way to phrase "do no harm" is to say that a null result just means "there is no reason to change what you're doing".
> Most companies don't cost peoples' lives when you get it wrong.
True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.
We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.
As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.
> A lot of companies are, arguably, _too rigorous_ when it comes to testing.
My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.
Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.
> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.
Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.
> I do like their proposal for "peeking" and subsequent testing.
What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.
> We're shipping software. We can change things if we get them wrong.
That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.
> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.
While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.
> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.
I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."
> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.
Keep in mind that Frequent A/B tests burn statistical “credit.” Any time you ship a winner at p = 0.05 you’ve spent 5 % of your false-positive budget. Do that five times in a quarter and the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %.
There are several approaches you can take to reduce that source of error:
Quarterly alpha ledger
Decide how much total risk you want this quarter (say 10 %). Divide the remaining α by the number of experiments left and make that the threshold for the next launch. Forces the “is this button-color test worth 3 % of our credibility?” conversation. More info: “Sequential Testing in Practice: Why Peeking Is a Problem and How to Fix It” (https://medium.com/@aisagescribe/sequential-testing-in-pract...).
Benjamini–Hochberg (BH) for metric sprawl
Once you watch a dozen KPIs, Bonferroni buries real lifts. BH ranks all the p-values at the end, then sets the cut so that, say, only 5 % of declared winners are false positives. You keep power, and you can run the same BH step on the primary metric from every experiment each quarter to catch lucky launches. More info: “Controlling False Discoveries: A Guide to BH Correction in Experimentation” (https://www.statsig.com/perspectives/controlling-false-disco...).
Bayesian shrinkage + 5 %
“ghost” control for big fleets
FAANG-scale labs run hundreds of tests and care about 0.1 % lifts. They pool everything in a simple hierarchical model; noisy effects get pulled toward the global mean, so only sturdy gains stay above water. Before launch, they sanity-check against a small slice of traffic that never saw any test. Cuts winner’s-curse inflation by ~30 %. Clear explainer: “How We Avoid A/B Testing Errors with Shrinkage” (https://eng.wealthfront.com/2015/10/29/how-we-avoid-ab-testi...) and (https://www.statsig.com/perspectives/informed-bayesian-ab-te...)
<10 tests a quarter: alpha ledger or yolo; dozens of tests and KPIs: BH; hundreds of live tests: shrinkage + ghost control.
> the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %
Yes, but that's not really the big deal that you're making it out to be, since it's (usually) not an all-or-nothing thing. Usually, the wins are additive. The chance of each winner being genuine is still 95% (assuming no p-hacking), and so the expected number of wins out of those 5 will be be 0.95 * 5 = 4.75 wins (by linearity of expectation), which is a solid win rate.
Hard disagree with this. Unlike medical experiments, the cost of being wrong startup experiments is very low: you thought there was a small effect and there was none. It’s usually just a matter of pushing one variant vs another and moving on.
There are certainly scenarios where more rigor is appropriate, but usually those come from trying to figure out why you’re seeing a certain effect and how that should affect your overall company strategy.
My advice for startups is to run lots of experiments, do bad statistics, and know that you’re going to have some false positives so that you don’t take every result as gospel.
The danger I think is less the numbers but what are you measuring makes sense. E.g. sure your A beats B in click through rate. But if the person then thinks fuck I was duped and closes the browser then that's no good.
I'll add one from my experience as a PM dealing with very "testy" peers in early stage startups: don't do any of this if you don't have {enough} users -- rely on intuition and focus on the core product.
If the goal is for the company to maximize profit from having the best page, this is an instance of a very well-studied problem https://en.m.wikipedia.org/wiki/Multi-armed_bandit?useskin=v... and one can do much better than statistical significance testing. (If the goal is to validate scientific theories, or there are other extenuating factors, things may be different.)
> Back to the dashboard experiment: after you applied the Bonferroni correction you got... nothing.
I guess you got something: Users are not sensitive to these changes, or that any effect is too small to detect with your current sample size/test setup.
In a startup scenario, I'd quickly move on and possibly ship all developed options if good enough.
Also, running A/B tests might not be the most appropriate method in such a scenario. What about user-centric UX research methods?
Good news: no p-value threshold needs to be passed to switch from one UI layout to another. As long as they all cost the same amount of money to host/maintain/whatever, the point estimate is sufficient. The reason is, at the end of the day, some layout has to be shown, and if each option had an equal number of visitors during the test, you can safely pick the one with the most signups.
When choosing one of several A/B test options, a hypothesis test is not needed to validate the choice.
P-value is something new for me, so the post is starting with prerequisites in mind that I miss. Tough I can go search by myself, would anyone have some online resources I can follow and test myself against to recommend, please?
Read a few more posts and it shouts GPT occasionally. Plus the author's (as I like to call them still) role is listed as 'Content Engineer' which isn't inspiring either. Too bad, the topics sounded interesting.
Yeah, I thought he was talking about 1 out of 20 features, but that's kind of why I was wondering if AI had written it. Sometimes it'll have mis-aligned figures etc.
As you scale it improves. More often at a small scale you ask users and they’ll give you invaluable information. As you scale you abstract folks into buckets. At about 1million MAU I’ve found A/B testing and p-value starts to make sense.
Great read, thanks! Could you dive a little deeper into example 2 & pre-registration? Conceptually I understand how the probability of false positives increases with the number of variants.
But how does a simple act such as "pre-registration" change anything? It's not as if observing another metric that already existed changes anything about what you experimented with.
If you have many metrics that could possibly be construed as "this was what we were trying to improve", that's many different possibilities for random variation to give you a false positive. If you're explicit at the start of an experiment that you're considering only a single metric a success, it turns any other results you get into "hmm, this is an interesting pattern that merits further exploration" and not "this is a significant result that confirms whatever I thought at the beginning."
It's basically a variation on the multiple comparisons, but sneakier: it's easy to spend an hour going through data and, over that time, test dozens of different hypotheses. At that point, whatever p-value you'd compute for a single comparison isn't relevant, because after that many comparisons you'd expect at least one to have uncorrected p = 0.05 by random chance.
There are many resources that will explain this rigorously if you search for the term “p-hacking”.
The TLDR as I understand it is:
All data has patterns. If you look hard enough, you will find something.
How do you tell the difference between random variance and an actual pattern?
It’s simple and rigorously correct to only search the data for a single metric; other methods, eg. Bonferroni correction (divide p by k) exist, but are controversial (1).
Basically, are you a statistician? If not, sticking to the best practices in experimentation means your results are going to be meaningful.
If you see a pattern in another metric, run another experiment.
Aside from the p-values, I don't understand the reasoning behind whatever "experiment" is being used for the A/B testing. What test is being done whose result is interpreted as "A is winner"? The discussion is about those being separate comparisons, and yeah, okay, but what are they comparisons of? Each group in isolation vs. all the others? If (as the article says) the hypothesis is "layout influences signup behavior" then it seems more reasonable to do a chi-squared test on a contingency table of layout vs. signed-up-or-didn't, which would give you one p-value for "is there anything here at all".
And then, if there isn't. . . it means you can just ship whatever you want! The real root cause of p-hacking is glossed over in the article: "Nobody likes arriving empty-handed to leadership meetings." This is the corporate equivalent of "no one will publish a null result", and is just as harmful here. The statistical techniques described are fine, but there's not necessarily a reason to fortify your stats against multiple comparisons rather than just accepting a null result.
And you can, because of the other thing I kept thinking when reading this: you have to ship something. There isn't really a "control" condition if you're talking about building a website from scratch. So whether the result is null doesn't really matter. It's not like comparing different medicines or fertilizers or something where if none of them work you just do nothing; there is no "do nothing" option in this situation. So why not just take a simple effect measurement (e.g., proportion who signed up) and pick the layout that performs best? If that result is statistically significant, great, it means you picked the best one, and if it's not, it just means it doesn't matter which one you pick, so the one you picked is still fine. (And if you have an existing design and you're trying to see if a new one will be better, the null result just means "there's no reason to switch", which means the existing design is also fine.)
Don't do any of this. It's very outdated advice. And you're going to get it wrong anyway. These threshold adjustment methods were invented before we had access to reasonable computers.
You shuffle the data. Say you want to know if viewing time is affected by color. Literally randomly shuffle viewing time and color. Then, look at that distribution. Is the data that you observed significant?
As long as you shuffle everything related to your experiment and you don't double dip into the data you're going to get things right.
This also has the big advantage that it doesn't overcorrect like traditional methods which apply such strict corrections that eventually it's impossible to get significant results.
This post hasn't even begun scratching the surface on what can go wrong with traditional tests. Just don't use them.
This has nothing to do with speed or rigor. Permutation tests are much simpler to run and faster to analyze. Sadly we keep teaching crappy statistics to our students.
Permutation tests don't account for family-wise error rate effects, so I'm curious why you would say that "it doesn't overcorrect like traditional methods".
I'm also curious why you say those "cover every case", because permutation tests tend to be underpowered, and also tend to be cumbersome when it comes to constructing confidence intervals of statistics, compared to something like the bootstrap.
Don't get me wrong -- I like permutation tests, especially for their versatility, but as one tool out of a bunch of methods.
Even though this post says exactly the thing that most Proper Analysts will say, and write long LinkedIn posts about where other Proper Analysts congratulate them on standing up for Proper Analysis in the face of Evil And Stupid Business Dummies who just want to make bad decisions based on too little data, it's wrong. The Jedi Bell Curve meme is in full effect on this topic, and I say this as someone who took years to get over the midwit hump and correct my mistaken beliefs.
The business reality is, you aren't Google. You can't collect a hundred million data points for each experiment that you run so that you can reliably pick out 0.1% effects. Most experiments will have a much shorter window than any analyst wants them to, and will have far too few users, with no option to let them run longer. You still have to make a damned decision, now, and move on to the next feature (which will also be tested in a heavily underpowered manner).
Posts like this say that you should be really, REALLY careful about this, and apply Bonferonni corrections and make sure you're not "peeking" (or if you do peek, apply corrections that are even more conservative), preregister, etc. All the math is fine, sure. But if you take this very seriously and are in the situation that most startups are in where the data is extremely thin and you need to move extremely fast, the end result is that you should reject almost every experiment (and if you're leaning on tests, every feature). That's the "correct" decision, academically, because most features lie in the sub 5% impact range on almost any metric you care about, and with a small number of users you'll never have enough power to pick out effects that small (typically you'd want maybe 100k, depending on the metric you're looking at, and YOU probably have a fraction of that many users).
But obviously the right move is not to just never change the product because you can't prove that the changes are good - that's effectively applying a very strong prior in favor of the control group, and that's problematic. Nor should you just roll out whatever crap your product people throw at the wall: while there is a slight bias in most experiments in favor of the variant, it's very slight, so your feature designers are probably building harmful stuff about half the time. You should apply some filter to make sure they're helping the product and not just doing a random walk through design space.
The best simple strategy in a real world where most effect sizes are small and you never have the option to gather more data really is to do the dumb thing: run experiments for as long as you can, pick whichever variant seems like it's winning, rinse and repeat.
Yes, you're going to be picking the wrong variant way more often than your analysts would prefer, but that's way better than never changing the product or holding out for the very few hugely impactful changes that you are properly powered for. On average, over the long run, blindly picking the bigger number will stack small changes, and while a lot of those will turn out to be negative, your testing will bias somewhat in favor of positive ones and add up over time. And this strategy will provably beat one that does Proper Statistics and demands 95% confidence or whatever equivalent Bayesian criteria you use, because it leaves room to accept the small improvements that make up the vast majority of feature space.
There's an equivalent and perhaps simpler way to justify this, which is to throw out the group labels: if we didn't know which one was the control and we had to pick which option was better, then quite obviously, regardless of how much data we have, we just pick the one that shows better results in the sample we have. Including if there's just a single user in each group! In an early product, this is TOTALLY REASONABLE, because your current product sucks, and you have no reason to think that the way it is should not be messed with. Late lifecycle products probably have some Chesterton's fence stuff going on, so maybe there's more of an argument to privilege the control, but those types of products should have enough users to run properly powered tests.
Yes! (Correct) pre-registration is everything. ("Correct" meaning: There's no point "pre-registering" if you fail to account for the number of tests you'll do -- but hopefully the fact that you have thought to pre-register at all is a strong indication that you should be performing such corrections.)
That said, I agree with the other poster here about how important this really is for startups. It's critical to know if the drug really improves lung function; it's probably not critical to know whether the accent colour on your landing page should be mauve or aqua blue.
This isn't just a startup thing. This is common also at FAANG.
Not only are expriments commonly multi-arm, you also repeat your experiment (usually after making some changes) if the previous experiment failed / did not pass the launch criteria.
This is further complicated by the fact that lauch criteria is usually not well defined ahead of time. Unless it's a complete slam dunk, you won't know until your launch meeting whether the experiment will be approved for launch or not. It's mostly vibe based, determined based on tens or hundreds of "relevant" metric movements, often decided on the whim of the stakeholder sitting at the lauch meeting.
On the one hand, this is a very nicely presented explanation of how to run statistically significant A/B style tests.
It's worth emphasizing though that if your startup hasn't achieved product market fit yet this kind of thing is a huge waste of time! Build features, see if people use them.
“This kind of thing” being running AB tests at all.
There’s no reason to run AB / MVT tests at all if you’re not doing them properly.
…or, you could have a product that does not hinge on some micro-optimization of a website layout.
Exactly. This is what micro-optimization looks like on the product side.
On the third point (peeking at p-values), I created an A/A test simulator that compares peeking vs not peeking in terms of false positive rate: https://stop-early-stopping.osc.garden/
> This isn't academic nit-picking. It's how medical research works when lives are on the line. Your startup's growth deserves the same rigor.
But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?
While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.
At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.
But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.
IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)
But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
I see where you are coming from, and overtesting is a thing, but I really believe that the baseline of quality of all software out there is terrible. We are just so used to it and it's been normalized. But there is really no day going by during which I'm not annoyed by a bug that somebody with more attention to quality would have not let through.
It's not about space rocket type of rigor, but it's about a higher bar than the current state.
(Besides, Elon's rockets are failing left and right, in contrast to what NASA achieved in the 60s, so there are some lessons there too.)
Completely agree. The sign up flow for your startup does not need the same rigor as medical research. You don’t need transportation engineering standards for your product packaging, either. They’re just totally different levels of risk.
I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.
At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.
For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).
0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.
Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!
But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.
If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.
> You don’t need all the status quo bias of null hypothesis testing.
You don't have to make the status quo be the null hypothesis. If you make a change, you probably already think that your change is better or at least neutral, so make that the null. If you get a strong signal that your change is actually worse, rejecting the null, revert the change.
Not "only keep changes that are clearly good" but "don't keep changes that are clearly bad."
This is a reasonable approach, particularly when you’re looking at moving towards a bigger redesign that might not pay off right away. I’ve seen it called “non-inferiority test,” if you’re curious.
Especially for startups with a small user base.
Not many users means that getting to stat sig will take longer (if at all).
Sometimes you just need to trust your design/product sense and assert that some change you’re making is better and push it without an experiment. Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep
100% this. I’ve seen people get too excited to A/B test everything even when it’s not appropriate. For us, changing prices was a common A/B test when the relatively low number of conversions meant the tests took 3 months to run! I believe we’ve moved away from that, now.
The company has a large user base, it’s just SaaS doesn’t have the same conversion # as, say, e-commerce.
The idea you should be going after bigger wins than .05 misses the point. The p value is a function of the effect size and the sample size. If you have a big effect you’ll see it even with small data.
Completely agree on the Bayesian point though, and the importance of defining the loss function. Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time.
It does, if you assume you care about the validity of the results or about making changes that improve your outcomes.
The degree of care can be different in less critical contexts, but then you shouldn’t lie to yourself about how much you care.
But there’s an opportunity cost that needs to be factored in when waiting for a stronger signal.
There is but you can decide that up front. There’s tools that will show you how long it’ll take to get statistical significance. You can then decide if you want to wait that long or have a softer p-value.
Even if you have to be honest with yourself about how much you care about being right, there’s still a place for balancing priorities. Two things can be true at once.
Sometimes someone just has to make imperfect decisions based on incomplete information, or make arbitrary judgment calls. And that’s totally fine… But it shouldn’t be confused with data-driven decisions.
The two kinds of decisions need to happen. They can both happen honestly.
One solution is to gradually move instances to you most likely solution.
But continue a percentage of A/B/n testing as well.
This allows for a balancing of speed vs. certainty
do you use any tool for this, or simply crunk up slightly the dial each day
There are multi armed bandit algorithms for this. I don’t know the names of the public tools.
This is especially useful for something where the value of the choice is front loaded, like headlines.
Can this be solved by setting p=0.50?
Make your expectations explicit instead of implicit. 0.05 is completely arbitrary. If you are comfortable with a 50/50 chance of being right, make your threshold less rigorous.
I think at that point you may as well skip the test and just make the change you clearly want to make!
Or collect some data and see if the net effect is positive. It’s possibly worth collecting some data though to rule out negative effects?
> The consequences of getting it wrong are... you sell fewer widgets?
If that’s the difference between success and failure then that is pretty important to you as a business owner.
> do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive
That’s a reasonable, and in plenty of contexts the absolute best, approach to take. But don’t call it A/B testing, because it’s not.
> We aren't shooting rockets into space.
Most of us don't, indeed. So still aligned with your perspective, it's good to take in consideration what we are currently working on, and what will be the possible implication. Sometimes the line is not so obvious though. If we design a library or framework which is not very specific to a inconsequential outcome, it's no longer obvious what policy make the more sense.
The thing is though, you're just as likely to be not improving things.
I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.
Allow me to rephrase what I think you’re saying:
Startups need to ship because they need to have a habit of moving constantly to survive. Stasis is death for a startup.
The other thing is that in those medical contexts, the choice is often between "use this specific treatment under consideration, or do nothing (i.e., use existing known treatments)". Is anyone planning to fold their startup if they can't get a statistically significant read on which website layout is best? Another way to phrase "do no harm" is to say that a null result just means "there is no reason to change what you're doing".
> Most companies don't cost peoples' lives when you get it wrong.
True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.
We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.
As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.
> A lot of companies are, arguably, _too rigorous_ when it comes to testing.
My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.
Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.
> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.
Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.
> I do like their proposal for "peeking" and subsequent testing.
What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.
> We're shipping software. We can change things if we get them wrong.
That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.
> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.
While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.
> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.
I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."
> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.
[dead]
Keep in mind that Frequent A/B tests burn statistical “credit.” Any time you ship a winner at p = 0.05 you’ve spent 5 % of your false-positive budget. Do that five times in a quarter and the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %.
There are several approaches you can take to reduce that source of error:
Quarterly alpha ledger
Decide how much total risk you want this quarter (say 10 %). Divide the remaining α by the number of experiments left and make that the threshold for the next launch. Forces the “is this button-color test worth 3 % of our credibility?” conversation. More info: “Sequential Testing in Practice: Why Peeking Is a Problem and How to Fix It” (https://medium.com/@aisagescribe/sequential-testing-in-pract...).
Benjamini–Hochberg (BH) for metric sprawl
Once you watch a dozen KPIs, Bonferroni buries real lifts. BH ranks all the p-values at the end, then sets the cut so that, say, only 5 % of declared winners are false positives. You keep power, and you can run the same BH step on the primary metric from every experiment each quarter to catch lucky launches. More info: “Controlling False Discoveries: A Guide to BH Correction in Experimentation” (https://www.statsig.com/perspectives/controlling-false-disco...).
Bayesian shrinkage + 5 %
“ghost” control for big fleets FAANG-scale labs run hundreds of tests and care about 0.1 % lifts. They pool everything in a simple hierarchical model; noisy effects get pulled toward the global mean, so only sturdy gains stay above water. Before launch, they sanity-check against a small slice of traffic that never saw any test. Cuts winner’s-curse inflation by ~30 %. Clear explainer: “How We Avoid A/B Testing Errors with Shrinkage” (https://eng.wealthfront.com/2015/10/29/how-we-avoid-ab-testi...) and (https://www.statsig.com/perspectives/informed-bayesian-ab-te...)
<10 tests a quarter: alpha ledger or yolo; dozens of tests and KPIs: BH; hundreds of live tests: shrinkage + ghost control.
> the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %
Yes, but that's not really the big deal that you're making it out to be, since it's (usually) not an all-or-nothing thing. Usually, the wins are additive. The chance of each winner being genuine is still 95% (assuming no p-hacking), and so the expected number of wins out of those 5 will be be 0.95 * 5 = 4.75 wins (by linearity of expectation), which is a solid win rate.
Good point. The 23% in the example refers to the worst case where 5 tests are all null throughout the period.
Hard disagree with this. Unlike medical experiments, the cost of being wrong startup experiments is very low: you thought there was a small effect and there was none. It’s usually just a matter of pushing one variant vs another and moving on.
There are certainly scenarios where more rigor is appropriate, but usually those come from trying to figure out why you’re seeing a certain effect and how that should affect your overall company strategy.
My advice for startups is to run lots of experiments, do bad statistics, and know that you’re going to have some false positives so that you don’t take every result as gospel.
The danger I think is less the numbers but what are you measuring makes sense. E.g. sure your A beats B in click through rate. But if the person then thinks fuck I was duped and closes the browser then that's no good.
I like the points and I'll probably link to this.
I'll add one from my experience as a PM dealing with very "testy" peers in early stage startups: don't do any of this if you don't have {enough} users -- rely on intuition and focus on the core product.
I was waiting for that comment to appear.
If your core product isn't any good, A/B testing seems like rearranging the deck chairs on the Titanic.
If the goal is for the company to maximize profit from having the best page, this is an instance of a very well-studied problem https://en.m.wikipedia.org/wiki/Multi-armed_bandit?useskin=v... and one can do much better than statistical significance testing. (If the goal is to validate scientific theories, or there are other extenuating factors, things may be different.)
> Back to the dashboard experiment: after you applied the Bonferroni correction you got... nothing.
I guess you got something: Users are not sensitive to these changes, or that any effect is too small to detect with your current sample size/test setup.
In a startup scenario, I'd quickly move on and possibly ship all developed options if good enough.
Also, running A/B tests might not be the most appropriate method in such a scenario. What about user-centric UX research methods?
When reading this article, be aware that there are some percent signs missing, and their absence might cause confusion. For example:
> After 9 peeks, the probability that at least one p-value dips below 0.05 is: 1 − (1 − 0.05)^9 = 37.
There should be a percent sign after that 37. (Probabilities cannot be greater than one.)
> Imagine you're a product manager trying to optimize your website’s dashboard. Your goal is to increase user signups.
This would be Series B or later right? I don't really feel like it's a core startup behavior.
Good news: no p-value threshold needs to be passed to switch from one UI layout to another. As long as they all cost the same amount of money to host/maintain/whatever, the point estimate is sufficient. The reason is, at the end of the day, some layout has to be shown, and if each option had an equal number of visitors during the test, you can safely pick the one with the most signups.
When choosing one of several A/B test options, a hypothesis test is not needed to validate the choice.
Yes, but assuming it was enhancing something already there, it was all pointless work.
P-value is something new for me, so the post is starting with prerequisites in mind that I miss. Tough I can go search by myself, would anyone have some online resources I can follow and test myself against to recommend, please?
Here you go
https://pmc.ncbi.nlm.nih.gov/articles/PMC5187603/
Wouldn't the correct tool here be a multi-armed bandit optimization, like an epsilon-greedy algorithm?
If you like reading blogs, I suggest Spotify's: https://engineering.atspotify.com/2023/03/choosing-sequentia...
1 - (1-0.95)^9 = 64
Did they generate this blog post with AI? That math be hallucinating. Don’t need a calculator to see that.
Read a few more posts and it shouts GPT occasionally. Plus the author's (as I like to call them still) role is listed as 'Content Engineer' which isn't inspiring either. Too bad, the topics sounded interesting.
I’m so confused by the math in this article. It’s also not 37. I can’t be the only person scratching their head.
Probably mangled the expression. (0.95)^9 = 63%, 1 - (0.95)^9 = 37%
Hmm, (1 - (1-0.95))^9 also = 63%. No idea why 64, closest I can see is 1-(0.95)^20 or 1-(1-0.05)^20 = 64%.
Yeah, I thought he was talking about 1 out of 20 features, but that's kind of why I was wondering if AI had written it. Sometimes it'll have mis-aligned figures etc.
I don't have any first hand experience with customer facing startups, SaaS or otherwise. How common is rigorous testing in the first place?
As you scale it improves. More often at a small scale you ask users and they’ll give you invaluable information. As you scale you abstract folks into buckets. At about 1million MAU I’ve found A/B testing and p-value starts to make sense.
Great read, thanks! Could you dive a little deeper into example 2 & pre-registration? Conceptually I understand how the probability of false positives increases with the number of variants.
But how does a simple act such as "pre-registration" change anything? It's not as if observing another metric that already existed changes anything about what you experimented with.
If you have many metrics that could possibly be construed as "this was what we were trying to improve", that's many different possibilities for random variation to give you a false positive. If you're explicit at the start of an experiment that you're considering only a single metric a success, it turns any other results you get into "hmm, this is an interesting pattern that merits further exploration" and not "this is a significant result that confirms whatever I thought at the beginning."
It's basically a variation on the multiple comparisons, but sneakier: it's easy to spend an hour going through data and, over that time, test dozens of different hypotheses. At that point, whatever p-value you'd compute for a single comparison isn't relevant, because after that many comparisons you'd expect at least one to have uncorrected p = 0.05 by random chance.
There are many resources that will explain this rigorously if you search for the term “p-hacking”.
The TLDR as I understand it is:
All data has patterns. If you look hard enough, you will find something.
How do you tell the difference between random variance and an actual pattern?
It’s simple and rigorously correct to only search the data for a single metric; other methods, eg. Bonferroni correction (divide p by k) exist, but are controversial (1).
Basically, are you a statistician? If not, sticking to the best practices in experimentation means your results are going to be meaningful.
If you see a pattern in another metric, run another experiment.
[1] - https://pmc.ncbi.nlm.nih.gov/articles/PMC1112991/
1−(1−0.05)^9=64 (small mistake; should be ^20)
Aside from the p-values, I don't understand the reasoning behind whatever "experiment" is being used for the A/B testing. What test is being done whose result is interpreted as "A is winner"? The discussion is about those being separate comparisons, and yeah, okay, but what are they comparisons of? Each group in isolation vs. all the others? If (as the article says) the hypothesis is "layout influences signup behavior" then it seems more reasonable to do a chi-squared test on a contingency table of layout vs. signed-up-or-didn't, which would give you one p-value for "is there anything here at all".
And then, if there isn't. . . it means you can just ship whatever you want! The real root cause of p-hacking is glossed over in the article: "Nobody likes arriving empty-handed to leadership meetings." This is the corporate equivalent of "no one will publish a null result", and is just as harmful here. The statistical techniques described are fine, but there's not necessarily a reason to fortify your stats against multiple comparisons rather than just accepting a null result.
And you can, because of the other thing I kept thinking when reading this: you have to ship something. There isn't really a "control" condition if you're talking about building a website from scratch. So whether the result is null doesn't really matter. It's not like comparing different medicines or fertilizers or something where if none of them work you just do nothing; there is no "do nothing" option in this situation. So why not just take a simple effect measurement (e.g., proportion who signed up) and pick the layout that performs best? If that result is statistically significant, great, it means you picked the best one, and if it's not, it just means it doesn't matter which one you pick, so the one you picked is still fine. (And if you have an existing design and you're trying to see if a new one will be better, the null result just means "there's no reason to switch", which means the existing design is also fine.)
I would look in the directon of SPRT:
https://en.wikipedia.org/wiki/Sequential_probability_ratio_t...
Why don't the p values in the first figure sum to 1?
related book: Trustworthy Online Controlled Experiments
https://experimentguide.com/
Don't do any of this. It's very outdated advice. And you're going to get it wrong anyway. These threshold adjustment methods were invented before we had access to reasonable computers.
There's a far simpler method that covers every case: permutation tests. https://bookdown.org/ybrandvain/Applied-Biostats/perm1.html
You shuffle the data. Say you want to know if viewing time is affected by color. Literally randomly shuffle viewing time and color. Then, look at that distribution. Is the data that you observed significant?
As long as you shuffle everything related to your experiment and you don't double dip into the data you're going to get things right.
This also has the big advantage that it doesn't overcorrect like traditional methods which apply such strict corrections that eventually it's impossible to get significant results.
This post hasn't even begun scratching the surface on what can go wrong with traditional tests. Just don't use them.
This has nothing to do with speed or rigor. Permutation tests are much simpler to run and faster to analyze. Sadly we keep teaching crappy statistics to our students.
Permutation tests don't account for family-wise error rate effects, so I'm curious why you would say that "it doesn't overcorrect like traditional methods".
I'm also curious why you say those "cover every case", because permutation tests tend to be underpowered, and also tend to be cumbersome when it comes to constructing confidence intervals of statistics, compared to something like the bootstrap.
Don't get me wrong -- I like permutation tests, especially for their versatility, but as one tool out of a bunch of methods.
Even though this post says exactly the thing that most Proper Analysts will say, and write long LinkedIn posts about where other Proper Analysts congratulate them on standing up for Proper Analysis in the face of Evil And Stupid Business Dummies who just want to make bad decisions based on too little data, it's wrong. The Jedi Bell Curve meme is in full effect on this topic, and I say this as someone who took years to get over the midwit hump and correct my mistaken beliefs.
The business reality is, you aren't Google. You can't collect a hundred million data points for each experiment that you run so that you can reliably pick out 0.1% effects. Most experiments will have a much shorter window than any analyst wants them to, and will have far too few users, with no option to let them run longer. You still have to make a damned decision, now, and move on to the next feature (which will also be tested in a heavily underpowered manner).
Posts like this say that you should be really, REALLY careful about this, and apply Bonferonni corrections and make sure you're not "peeking" (or if you do peek, apply corrections that are even more conservative), preregister, etc. All the math is fine, sure. But if you take this very seriously and are in the situation that most startups are in where the data is extremely thin and you need to move extremely fast, the end result is that you should reject almost every experiment (and if you're leaning on tests, every feature). That's the "correct" decision, academically, because most features lie in the sub 5% impact range on almost any metric you care about, and with a small number of users you'll never have enough power to pick out effects that small (typically you'd want maybe 100k, depending on the metric you're looking at, and YOU probably have a fraction of that many users).
But obviously the right move is not to just never change the product because you can't prove that the changes are good - that's effectively applying a very strong prior in favor of the control group, and that's problematic. Nor should you just roll out whatever crap your product people throw at the wall: while there is a slight bias in most experiments in favor of the variant, it's very slight, so your feature designers are probably building harmful stuff about half the time. You should apply some filter to make sure they're helping the product and not just doing a random walk through design space.
The best simple strategy in a real world where most effect sizes are small and you never have the option to gather more data really is to do the dumb thing: run experiments for as long as you can, pick whichever variant seems like it's winning, rinse and repeat.
Yes, you're going to be picking the wrong variant way more often than your analysts would prefer, but that's way better than never changing the product or holding out for the very few hugely impactful changes that you are properly powered for. On average, over the long run, blindly picking the bigger number will stack small changes, and while a lot of those will turn out to be negative, your testing will bias somewhat in favor of positive ones and add up over time. And this strategy will provably beat one that does Proper Statistics and demands 95% confidence or whatever equivalent Bayesian criteria you use, because it leaves room to accept the small improvements that make up the vast majority of feature space.
There's an equivalent and perhaps simpler way to justify this, which is to throw out the group labels: if we didn't know which one was the control and we had to pick which option was better, then quite obviously, regardless of how much data we have, we just pick the one that shows better results in the sample we have. Including if there's just a single user in each group! In an early product, this is TOTALLY REASONABLE, because your current product sucks, and you have no reason to think that the way it is should not be messed with. Late lifecycle products probably have some Chesterton's fence stuff going on, so maybe there's more of an argument to privilege the control, but those types of products should have enough users to run properly powered tests.
Example 01 is basically the “green jellybeans cause acne” problem
https://xkcd.com/882/
Yes! (Correct) pre-registration is everything. ("Correct" meaning: There's no point "pre-registering" if you fail to account for the number of tests you'll do -- but hopefully the fact that you have thought to pre-register at all is a strong indication that you should be performing such corrections.)
That said, I agree with the other poster here about how important this really is for startups. It's critical to know if the drug really improves lung function; it's probably not critical to know whether the accent colour on your landing page should be mauve or aqua blue.