Interpretation does matter
Interpreting experiment results, whether you’re building a product, running a marketing campaign, or doing research, can be tricky. It’s easy to fall into common traps that mess up your conclusions or waste your time. Let’s talk about some of the biggest mistakes people make, why they matter, and how you can avoid them.
1. Statistical Significance: What It Really Means
A lot of folks think “not statistically significant” means “no effect.” That’s not true! If your confidence interval includes zero, it just means you can’t be sure there’s an effect. But you also can’t be sure there isn’t one: there could be a big effect hiding in there.
Why this matters: If your confidence intervals are wide, you might be missing large effects simply because your data isn’t precise enough.
What to do: Don’t just look for whether zero is in the interval. Consider the range of possible effects. If you want to be sure you’re not causing harm, check that the lower bound of your confidence interval is above any value you’d consider risky.
2. Small Experiments = Big Problems
Running an experiment with too few participants is like trying to predict the weather by looking out the window for five minutes. Small sample sizes lead to noisy results and wide confidence intervals, making it hard to detect real changes.
Why this matters: You could miss something big, or get fooled by random noise.
What to do: Estimate the size of the effect you care about and calculate how many participants you need to reliably detect it (this is called a “power calculation”). Don’t rely on default sample sizes, tailor your experiment to your specific needs.
3. Health Check Failures: Red Flags
Health checks are designed to catch problems in your experiment setup, like unbalanced groups or unexpected data issues. If a health check fails, it’s a warning sign.
Why this matters: Ignoring these warnings can mess up your results.
What to do: Investigate and fix the underlying issue before drawing conclusions. Only in rare, well-understood cases should you proceed despite a failed health check.
4. Don’t Dismiss Results Because of Pre-Experiment Differences
Sometimes, the groups in your experiment differ even before the experiment starts. This can happen by chance, especially with random assignment.
Why this matters: It’s tempting to ignore results if you see these differences, but that’s not necessary. Statistical methods can adjust for these imbalances.
What to do: Use methods that account for pre-experiment differences, and remember that confidence intervals already factor in the possibility of random imbalances.
5. Beware of P-Hacking
P-hacking happens when you keep searching through different metrics, time periods, or subgroups until you find something statistically significant. This practice increases the risk of finding “false positives”, results that look real but are actually just due to chance.
Why this matters: The more you look, the more likely you are to find something that isn’t actually there.
What to do: Decide what you’re going to measure before you start, and stick to it. If you do explore, treat those findings as ideas for future experiments, not as proof.
6. A/A Tests and False Positives
An A/A test is when you split your sample into two groups but don’t apply any treatment. Even in these cases, you might see some statistically significant results just by chance.
Why this matters: This is a normal part of statistical testing and doesn’t mean your system is broken.
What to do: Expect a small percentage of false positives, especially if you’re looking at many metrics.
7. Filtering on Post-Experiment Variables Can Bias Results
Keep reading with a 7-day free trial
Subscribe to Data Marks to keep reading this post and get 7 days of free access to the full post archives.
