A/B Testing Sample Size Calculator
Estimate how many users you need in each variant before launching an experiment. This calculator uses a two-sample proportion test to help you plan statistically sound A/B tests based on baseline conversion rate, minimum detectable effect, confidence level, and statistical power.
Calculator
Expert Guide to Using an A/B Testing Sample Size Calculator
An A/B testing sample size calculator helps marketers, product managers, analysts, and growth teams answer one of the most important experimental design questions: how many users do we need before we can trust the outcome of a test? It sounds simple, but this planning step is often the difference between a reliable learning program and a series of inconclusive experiments. If your sample is too small, random noise can dominate the result. If your sample is unnecessarily large, you may waste valuable traffic and slow decision making. A strong calculator brings balance by tying sample requirements to statistical confidence, statistical power, baseline performance, and the minimum effect you care about detecting.
At a practical level, an A/B test sample size estimate tells you the number of visitors or users required in each variant, often control and treatment, to detect a difference in conversion rates. This matters because many optimization teams focus heavily on design changes, targeting strategies, and messaging while underestimating the impact of experimental planning. A more disciplined planning process leads to cleaner measurement, faster prioritization, and fewer false wins. If you know in advance that a tiny uplift would require a massive amount of traffic, you can either revise your hypothesis or reserve that experiment for a higher traffic page.
Why sample size matters so much in experimentation
Every A/B test operates under uncertainty. Even if your true conversion rate is stable, observed results vary from sample to sample simply because of randomness. Statistical testing is designed to manage this uncertainty, but it works only when enough observations are collected. Sample size is therefore central to two risks:
- False positives: concluding that a variation won when it did not actually improve the metric.
- False negatives: failing to detect a real improvement because the experiment was too underpowered.
Most serious experimentation programs target a 95% confidence level and 80% power. In plain language, that means the team accepts a 5% chance of a false positive and wants an 80% chance of detecting a real effect at least as large as the minimum detectable effect. These are not arbitrary settings. They reflect a widely accepted compromise between rigor and operational speed.
The four core inputs you need
- Baseline conversion rate: your current control rate. If your page converts at 10%, the variability and sample requirement will differ from a page converting at 1% or 40%.
- Minimum detectable effect, or MDE: the smallest lift worth detecting. A 2% relative lift and a 20% relative lift imply very different sample requirements.
- Confidence level: commonly 90%, 95%, or 99%. Higher confidence requires more observations.
- Statistical power: commonly 80% or 90%. Higher power also increases the needed sample.
The most frequently misunderstood input is the MDE. Teams often choose an MDE because it looks optimistic rather than because it is meaningful. A better method is to tie MDE to business value. For example, if a signup flow converts at 8% and each additional signup creates measurable downstream revenue, you can estimate the smallest lift that justifies implementation costs. That uplift becomes the effect you want enough power to detect.
How the calculator works behind the scenes
This calculator uses the standard sample size logic for comparing two proportions, which is appropriate for conversion metrics like clicks, signups, purchases, starts, and completed forms. The baseline conversion rate represents the control group. The MDE determines the target treatment conversion rate. Then the calculator uses critical values from the normal distribution to account for your selected confidence level and power. The output is the required sample size per variant and the total traffic needed across both variants.
For equal allocation, the classic setup assumes 50% of traffic goes to the control and 50% goes to the treatment. If you use a different split, test duration can change even if the total sample is similar, because one group may fill more slowly than the other. Equal allocation is usually the most statistically efficient when the primary goal is to compare two versions under similar conditions.
| Scenario | Baseline rate | MDE | Confidence / Power | Approximate sample per variant |
|---|---|---|---|---|
| Homepage CTA test | 5.0% | +10% relative lift | 95% / 80% | About 31,300 users |
| Checkout completion test | 40.0% | +5% relative lift | 95% / 80% | About 9,400 users |
| Lead form redesign | 12.0% | +15% relative lift | 95% / 80% | About 13,600 users |
| Low traffic B2B pricing page | 2.0% | +20% relative lift | 95% / 80% | About 96,800 users |
These examples show a pattern many teams discover only after running several tests: small effects on low conversion pages need a lot of traffic. That is why prioritization matters. A test on a low traffic page with a tiny expected lift can be statistically expensive. In contrast, a test on a higher converting funnel step or a page with much larger traffic can often produce interpretable results faster.
Interpreting relative uplift versus absolute change
Many calculators let you express MDE as either a relative lift or an absolute change. The distinction is crucial. A relative lift of 10% means you multiply the baseline by 1.10. If the baseline is 10%, the target becomes 11%, which is a 1 percentage point absolute increase. In contrast, an absolute increase of 10 percentage points would move 10% to 20%, which is dramatically larger and would require far less traffic to detect because the effect is much bigger.
When stakeholders discuss goals informally, they often mix these terms. To avoid confusion, always state both values. For example, instead of saying, “We want a 10% improvement,” say, “We want to detect a relative lift of 10%, which means a movement from 10.0% to 11.0%.” This level of precision keeps forecasts aligned across analytics, design, and leadership.
Real world benchmarks and what they imply
Publicly reported conversion rates vary significantly by industry, funnel step, and audience quality. Broad benchmark studies often show ecommerce purchase conversion rates in the low single digits, while email opt-in or account creation flows can be much higher depending on traffic intent. The lower the baseline conversion rate, the more volatile raw conversion counts become relative to sample size, and the more traffic you often need to detect modest changes.
| Planning factor | If you lower it | If you raise it | Practical impact |
|---|---|---|---|
| Confidence level | Less required sample | More required sample | Higher confidence is more conservative. |
| Power | Less required sample | More required sample | Higher power reduces missed real wins. |
| MDE size | More required sample | Less required sample | Small lifts are expensive to detect. |
| Baseline uncertainty | Can mislead planning | Can improve accuracy if updated | Use recent stable data, not outdated averages. |
Common mistakes when calculating sample size
- Using a guessed baseline: if the baseline rate is stale or estimated from a different audience, the sample forecast can be misleading.
- Choosing an unrealistic MDE: if the expected lift is too ambitious, the plan may understate the true traffic needed.
- Stopping early when results look promising: peeking inflates false positive risk when not handled with a proper sequential framework.
- Ignoring multiple comparisons: if you test many variations or many primary metrics, false positive risk increases.
- Running the test during unstable periods: promotions, outages, holidays, and tracking changes can distort assumptions and outcomes.
A particularly damaging mistake is launching a test without first estimating duration. If your required total sample is 80,000 users and your eligible traffic is only 10,000 per month, the test may need many months to finish. That is often too long if seasonality or product changes will interfere. In those cases, you may need to simplify the test, increase the expected effect size threshold, or choose a different metric with a higher event rate.
Best practices for trustworthy experiment planning
- Start with a single primary metric, such as purchase conversion or signup completion.
- Use the latest stable baseline from the same audience and device mix.
- Set MDE based on business value, not optimism.
- Commit to a runtime plan before launching the test.
- Monitor data quality, sample allocation, and tracking parity across variants.
- After significance is reached, evaluate practical significance and implementation cost.
It is also wise to build a documentation habit. Record the baseline, MDE, confidence, power, sample requirement, launch date, and stopping rule in a test brief. This creates accountability and reduces retroactive interpretation. Mature experimentation teams treat pre-test planning as seriously as post-test analysis.
Authoritative sources for deeper statistical guidance
If you want to go beyond a basic planning calculator, consult high quality methodological references. The U.S. Census Bureau provides helpful material on statistical testing concepts. The National Center for Biotechnology Information hosts educational resources on sample size and hypothesis testing through U.S. government supported infrastructure. For a more academic treatment of probability and inference, the Penn State Department of Statistics offers university-level statistical instruction that can help teams understand confidence, power, and test design in greater depth.
When to question the calculator output
No calculator should be treated as a substitute for judgment. If your experiment has clustered users, repeated exposures, heavy segmentation needs, multiple treatment arms, or major novelty effects, the simple two-proportion model may understate complexity. The same caution applies if your metric is continuous rather than binary, such as average order value or revenue per visitor. Those cases need a different power analysis framework. Still, for a classic conversion-focused A/B test, a well-configured sample size calculator is one of the most useful planning tools available.
In summary, an A/B testing sample size calculator is not just a mathematical convenience. It is a decision quality tool. It helps you determine whether an idea is testable, how long a test should run, and whether your traffic can support the question you want to answer. Teams that calculate sample size before launching experiments tend to avoid inconclusive tests, make more disciplined tradeoffs, and build stronger trust in experimentation overall. Use the calculator above to estimate your requirements, sanity check your assumptions, and plan tests that are both statistically credible and operationally realistic.