A/B Test Sample Size Calculator
Estimate how many visitors you need per variation before launching an experiment. This calculator uses a standard two-proportion sample size model to help marketers, product teams, CRO specialists, and analysts plan statistically sound A/B tests.
Interactive Sample Size Calculator
Enter your baseline conversion rate, minimum detectable effect, confidence level, power, and traffic assumptions. The calculator returns the required sample size per variant, total sample size, expected conversions, and a rough time estimate.
Expert Guide: How an A/B Test Sample Size Calculator Improves Experiment Quality
An A/B test sample size calculator helps you answer one of the most important questions in experimentation: how much traffic do you need before you can trust a result? Teams often focus on design, copy, offer strategy, page speed, or funnel steps, but the statistical planning stage is where strong experiments begin. If your test is too small, you can miss a meaningful improvement. If your test is oversized, you may waste time and traffic that could have been used for other learning opportunities. A well-built sample size estimate balances rigor, speed, and business practicality.
For most web experiments, the core problem is comparing two proportions. In plain language, you are checking whether version B converts at a meaningfully different rate than version A. To make that comparison reliable, you choose a baseline conversion rate, a minimum detectable effect, a significance level, and a desired statistical power. The calculator then estimates how many users or sessions need to see each version before your test has a reasonable chance of detecting that effect.
In practical terms: smaller effects require much larger samples, higher confidence requires more traffic, and stronger power also increases sample requirements. This is why experiment planning is not just a math exercise. It directly affects your roadmap, testing cadence, and time to insight.
What each calculator input means
- Baseline conversion rate: your current estimate of performance on the metric you care about, such as lead rate, purchase rate, account creation, or click-through rate.
- Minimum detectable effect: the smallest relative lift worth detecting. If your baseline is 5% and your minimum detectable effect is 10%, the calculator plans to detect a change from 5.00% to 5.50%.
- Confidence level: often set at 95%, this controls how strict you are about false positives. A stricter threshold means you need more data.
- Power: usually 80% or 90%. Power reflects the likelihood of detecting a real effect if it exists.
- Test type: two-sided tests check for both positive and negative change. One-sided tests only look in one direction and are less common in robust business experimentation.
- Traffic volume and number of variants: these inputs convert sample size into a rough duration estimate.
Why sample size matters so much
Many failed experimentation programs do not fail because the ideas are weak. They fail because the statistical design is loose. Undersized tests often create noisy outcomes, false confidence, and arguments between teams. One stakeholder sees a temporary lift and wants to ship. Another sees volatility and wants to wait. Without enough traffic, both may be reacting to randomness.
Sample size planning solves that issue by turning a vague debate into a structured threshold. Before the test starts, everyone agrees on the rules. That improves governance and reduces the temptation to stop early when one line on a dashboard looks exciting. It also helps set realistic expectations. If your baseline conversion rate is low and your target uplift is tiny, the required sample size can be very large. Knowing that upfront helps you decide whether the test is worth running, whether to test a stronger change, or whether to use a more sensitive metric higher in the funnel.
The key tradeoff: sensitivity versus speed
An A/B test sample size calculator makes one reality very clear: if you want to detect tiny improvements, you need a lot of traffic. This is not a flaw in the calculator. It is a reflection of statistical reality. Teams with limited traffic often benefit from designing bolder experiments so the minimum detectable effect is larger. Larger effects are easier to detect, which means you can make decisions faster. By contrast, mature programs with very high traffic can afford to test smaller optimizations because they can gather large samples quickly.
| Scenario | Baseline Conversion Rate | Relative MDE | Expected Variant Rate | Approximate Sample Per Variant |
|---|---|---|---|---|
| Email signup test | 5.0% | 20% | 6.0% | 8,100 to 8,300 |
| SaaS trial form test | 5.0% | 10% | 5.5% | 31,000 to 31,500 |
| Ecommerce checkout test | 3.0% | 10% | 3.3% | 84,000 to 85,500 |
| High intent landing page test | 10.0% | 15% | 11.5% | 10,100 to 10,500 |
The figures above are consistent with standard two-proportion sample size planning at about 95% confidence and 80% power. The exact value changes slightly depending on whether the calculation uses pooled assumptions, continuity corrections, or one-sided versus two-sided thresholds, but the pattern is stable. Small uplifts and low baseline rates make tests expensive in terms of traffic.
How the underlying calculation works
For conversion-focused A/B tests, the standard approach models the outcome as a Bernoulli event: each user either converts or does not convert. The sample size formula then estimates how many observations are needed to distinguish the baseline rate from the target variant rate at your chosen error thresholds. In essence, the calculator compares the expected signal, which is the effect size you care about, to the expected noise, which comes from natural variability in user behavior.
As baseline rates move closer to 50%, variance tends to rise. As the absolute difference between A and B gets smaller, the denominator in the formula shrinks, so sample size rises sharply. This is why a 1 percentage point lift may be easy to detect when the baseline is 5% and the target is 6%, but a 0.1 percentage point lift from 5.0% to 5.1% can require a very large audience.
Common mistakes when using a sample size calculator
- Using a guess instead of a recent baseline: If your baseline conversion estimate is stale or pulled from a different segment, your planning may be off. Use recent, relevant data from the exact audience entering the experiment.
- Choosing an unrealistically small minimum detectable effect: Teams often enter a tiny lift because they would love to find it, not because they can realistically detect it. That creates impractically long tests.
- Ignoring seasonality and traffic quality: Not all traffic is equally stable. Campaign swings, holidays, or product launches can alter variance and conversion behavior.
- Stopping early: Looking at a dashboard every day is normal. Declaring a winner before the planned sample is reached is not. Early peeking can inflate false positives.
- Testing too many variants with too little traffic: Every additional variant divides traffic and increases duration, especially when power targets remain unchanged.
How to choose a realistic minimum detectable effect
The best minimum detectable effect is not purely statistical. It is economic. Ask what level of uplift would change a business decision. Suppose a signup page gets 100,000 monthly visitors and a completed signup is worth a projected $12 in expected value. Even a modest lift could be financially meaningful. On the other hand, if implementation is costly, a tiny effect might not justify engineering effort. Your minimum detectable effect should reflect the smallest improvement worth shipping and maintaining.
| Confidence / Power Choice | Typical Use Case | Impact on Sample Size | Operational Implication |
|---|---|---|---|
| 90% confidence / 80% power | Fast directional learning | Lowest among common choices | Quicker tests, higher risk of false positives |
| 95% confidence / 80% power | Standard product and CRO testing | Balanced default | Good compromise between rigor and speed |
| 95% confidence / 90% power | Higher stakes product or pricing tests | Noticeably larger than standard | More reliable detection, longer runtime |
| 99% confidence / 90% power | Very high risk decisions | Largest sample requirement | Useful when false positives are especially costly |
Real-world interpretation of results
Once the calculator gives you a per-variant sample size, convert it into a time estimate using your experiment traffic. If you need 31,000 users per variant and you have 100,000 eligible monthly visitors with a 50/50 split, each variant receives about 50,000 visitors per month. In that case, the traffic requirement can theoretically be met in less than a month. But do not treat that estimate as the only constraint. Most teams should also run tests across full weekly cycles so weekday and weekend behavior are represented fairly.
It is also smart to sanity-check the expected number of conversions. If your sample is large but your actual conversions remain very low, practical interpretation becomes harder. This is one reason some teams choose a funnel metric with a higher event rate for earlier-stage iteration, then validate with a lower-funnel metric later.
Authoritative sources worth consulting
Experimentation works best when grounded in strong statistical practice and credible public guidance. For foundational statistical concepts, useful references include the National Institute of Standards and Technology, which provides technical material on measurement and applied statistics. For probability and statistical education, see the Carnegie Mellon University Department of Statistics. For broad public education on health and behavioral research methods, including careful treatment of evidence and uncertainty, the National Center for Biotechnology Information offers extensive methodological resources. While these sources are not CRO tools, they reinforce the statistical principles behind sound experiment planning.
When not to rely on a standard sample size model alone
The classic A/B test sample size calculator is ideal for binary outcomes like converted versus not converted. However, some experiments involve revenue per visitor, time on task, retention curves, or non-independent user behavior. In those cases, additional methods may be more appropriate. Sequential testing frameworks, Bayesian approaches, CUPED variance reduction, cluster-based designs, and uplift modeling can all change planning assumptions. If your experiment involves network effects, repeated exposure, or strong user heterogeneity, involve a statistician or experienced experimentation lead.
Best practices before launching any experiment
- Define one primary metric and document it clearly.
- Choose guardrail metrics such as bounce rate, revenue per user, or error rate.
- Set the sample size threshold before the test begins.
- Decide in advance whether the test is one-sided or two-sided.
- Plan for full business cycles, not just minimum traffic accumulation.
- Check tracking quality and event deduplication before launch.
- Avoid changing targeting or creative mid-test unless the experiment is restarted.
Final takeaway
An A/B test sample size calculator is not just a convenience widget. It is a planning tool that protects decision quality. By estimating traffic needs before launch, you reduce false certainty, improve stakeholder alignment, and make better use of your experimentation pipeline. If the estimated sample is too large, that is valuable insight. It may mean you need a bigger idea, a higher-traffic segment, a more sensitive metric, or a longer runway. If the sample is manageable, you can launch with confidence knowing the test is designed to answer a meaningful question.
Use the calculator above as a practical starting point for test planning. The strongest experimentation teams do not just run more tests. They run well-powered tests, choose meaningful effects, and interpret results in the context of both statistical evidence and business value.