The Gambler’s Fallacy and Every Ecommerce A/B Test Ever
- Belinda Anderton
- Oct 3
- 4 min read
I collect data. Mostly about being wrong. It’s exceptionally useful because being wrong teaches you where your intuition fails, and nowhere does intuition fail more spectacularly than in probability and statistics.
I came to mathematics late. Failed it spectacularly in school, actually. The kind of failure where you convince yourself you’re “just not a maths person” and move on with your life. Then somewhere in my twenties, I discovered I loved it. Not the abstract proofs or the theoretical elegance (though those have their place), but the applied statistics, the probability, the pattern recognition in data. Turns out I’m phenomenal at it. I just needed it to be about something real instead of arbitrary textbook problems.
I’ve applied this same pattern recognition to people. For years I called it instinct (that gut feeling about whether someone was trustworthy, competent, or full of it). Now I realize it was never instinct. It was data signals. Microexpressions, word choices, behavioral consistency, response times, what people prioritize when under pressure. Hundreds of tiny data points synthesized into a pattern. The same skill that lets me spot anomalies in conversion data lets me read people. It’s just applied statistics with a different dataset.
This is relevant because the gambler’s fallacy is both a math problem and a pattern recognition problem, and most people who work in ecommerce are better at the latter than the former. Which is exactly how they get it wrong.
The gambler’s fallacy is simple: after a coin lands heads five times in a row, people believe tails is “due.” It’s not. Each flip is independent. The coin has no memory. The probability remains 50/50. Everyone knows this intellectually, but almost nobody believes it emotionally. This same cognitive failure destroys ecommerce A/B testing every single day.
Why your weeklong test means nothing.
Here’s a scenario I’ve seen dozens of times: a team tests two button colors. After a week, variant B has a 3.2% conversion rate versus variant A’s 2.9%. Someone declares victory, ships variant B, and moves on. Six weeks later, conversion is back to 2.9%. What happened?
Random variation happened. Natural noise in the data. The appearance of a pattern where none existed.
Let’s do the actual math. Assuming 1,000 visitors per variant over that week, variant A converted 29 people and variant B converted 32 people. The question isn’t whether B performed better in that sample (it did, by definition). The question is: what’s the probability this difference represents a real effect versus random chance?
This requires calculating statistical significance using a two-proportion z-test. The formula for the z-score is:
z = (p₂ - p₁) / √[p(1-p)(1/n₁ + 1/n₂)]
Where p is the pooled conversion rate: (29 + 32) / (1000 + 1000) = 0.0305
Plugging in the numbers: z = (0.032 - 0.029) / √[0.0305(0.9695)(0.002)] z = 0.003 / √0.0000592 z = 0.003 / 0.0077 z ≈ 0.39
A z-score of 0.39 corresponds to a p-value of about 0.70.
In plain language: there’s a 70% chance you’d see this size difference (or larger) purely by random chance even if the buttons were identical. This is not statistically significant. You’ve learned nothing except that you don’t understand statistics.
For 95% confidence (the standard threshold), you need a z-score of at least 1.96, which corresponds to a p-value of 0.05 or less.
The early stopping problem.
The gambler’s fallacy’s evil twin is early stopping based on “obvious” winners. A test runs for 48 hours. Variant B is winning by 8%. Someone says “this is clearly better, why wait?” and ships it. This is statistical suicide.
With small sample sizes, random variation produces dramatic swings. If you run 20 A/B tests and stop each one the moment any variant hits “significance,” you’ll get false positives roughly 64% of the time, not 5%. This is called p-hacking, and it’s why most published research is wrong.
The math behind this involves understanding that statistical significance at α = 0.05 means: if there’s truly no difference, you’ll incorrectly reject the null hypothesis 5% of the time. But if you peek at your results repeatedly and stop the moment you hit significance, you’re conducting multiple hypothesis tests, each with its own 5% false positive rate. The cumulative probability of at least one false positive across k tests is approximately: 1 - (0.95)^k
For 20 tests: 1 - (0.95)^20 ≈ 0.64
This is why you need to determine your sample size in advance and commit to it. The sample size calculation requires knowing your baseline conversion rate, minimum detectable effect, and desired statistical power. For a baseline of 3%, detecting a 20% relative improvement (0.6 percentage points) with 80% power and 95% confidence requires approximately 8,740 visitors per variant.
Most ecommerce teams test with a few hundred visitors and declare victory. They’re reading tea leaves.
Pattern recognition in noise.
Humans are spectacular at finding patterns. We’re so good at it that we find them where they don’t exist. This served us well evolutionarily (better to see a predator in random shadows than miss a real one), but it’s catastrophic for data analysis.
I’ve watched teams attribute conversion changes to button color, page layout, copy variations, when the actual cause was: it was Tuesday, weather was different, there was a competing promotion, a celebrity wore the product, an email went out, traffic source changed, or pure random variation. None of these factors were controlled for.
The A/B test measured noise and called it signal.
The correct approach requires:
Pre-determined sample size based on power analysis
No peeking until you hit that sample size
Controlling for known confounds (day of week, traffic source, seasonality)
Understanding that statistical significance ≠ practical significance
Running follow-up tests to confirm findings
What I actually do.
I still run A/B tests. I just accept that most of them will find nothing, which is itself information. I calculate required sample sizes before starting. I resist the urge to stop early. I treat “significant” results with suspicion until replicated. I collect data about being wrong because that’s how you eventually become slightly less wrong.
The gambler’s fallacy isn’t believing the coin is due for tails. It’s believing you can predict the next flip based on the last five. In ecommerce, it’s believing your 48-hour test with 200 visitors told you anything except that you don’t understand math.
The coin has no memory. Neither does your traffic.
###



Comments