A/B Test Guide Sample Size Calculator
Estimate how many users you need per variant to detect a meaningful conversion lift with statistical confidence.
Results
Ready to calculate
Enter your baseline rate, desired uplift, confidence level, power, and traffic to see the required sample size per variant, total users, and estimated runtime.
How to use an A/B test guide sample size calculator effectively
An A/B test guide sample size calculator helps you answer one of the most important questions in experimentation: how much traffic do I need before I can trust the outcome? Many teams launch split tests without a rigorous estimate, then stop too early when the numbers look promising. That creates false positives, wasted development time, and poor product decisions. A proper sample size calculation prevents that by setting the traffic requirement up front.
In practical terms, sample size is the number of visitors, sessions, or users needed in each variant to detect a meaningful difference between version A and version B. In a conversion-focused test, the calculation usually depends on five variables: baseline conversion rate, minimum detectable effect, confidence level, statistical power, and traffic allocation. Once those are known, you can estimate both the required exposure and a realistic test duration.
What each input means
- Baseline conversion rate: your current observed conversion rate. If your existing landing page converts at 5%, this is the starting point.
- Minimum detectable uplift: the smallest relative improvement worth detecting. A 10% uplift on a 5% baseline means you want to detect a move from 5.0% to 5.5%.
- Confidence level: how strict you want to be about false positives. A 95% confidence level is common and corresponds to a 5% significance threshold.
- Statistical power: the probability that your test detects a real effect if it exists. Most teams choose 80% or 90% power.
- Traffic split: how visitors are allocated between variants. Equal allocation is usually the most efficient for two-variant tests.
The calculator above uses a two-proportion sample size formula for equal-variance approximation, which is the standard approach for binary outcomes like conversions. It estimates sample size per variant, then multiplies by the number of variants to show the total traffic requirement.
Key idea: Smaller expected lifts require much larger sample sizes. Detecting a 5% relative uplift can need several times more traffic than detecting a 20% uplift, especially when the baseline conversion rate is low.
Why sample size matters in A/B testing
Without enough observations, random variation can dominate the test outcome. A variant may look better simply because of chance. This is especially dangerous when conversion rates are low or when the expected improvement is small. Teams often react to early noise, ship the wrong experience, and later discover that the measured gain does not persist.
Sample size planning also improves organizational discipline. It forces stakeholders to agree on what effect is meaningful before seeing the data. That means the decision rule is not changed mid-experiment. In mature experimentation programs, this step is standard because it aligns product, analytics, and marketing teams around a shared threshold for action.
Common testing errors caused by poor sample planning
- Peeking too early: checking significance every day and stopping as soon as one variant appears to win.
- Underpowered tests: running experiments that never had enough traffic to detect the intended effect.
- Inflated expectations: assuming a large uplift to justify a small sample, then missing realistic improvements.
- Ignoring allocation: using uneven traffic splits without realizing they increase runtime.
- Skipping seasonality: ending a test before it spans a full business cycle such as weekdays and weekends.
Real-world sample size examples
The table below illustrates how traffic needs can change dramatically based on baseline rate and minimum detectable uplift. These are approximate values for a two-sided test with 95% confidence and 80% power, using balanced traffic.
| Baseline conversion rate | Target relative uplift | Expected variant rate | Approx. sample size per variant | Approx. total sample |
|---|---|---|---|---|
| 2.0% | 10% | 2.2% | 38,000+ | 76,000+ |
| 5.0% | 10% | 5.5% | 31,000+ | 62,000+ |
| 5.0% | 20% | 6.0% | 8,000+ | 16,000+ |
| 10.0% | 10% | 11.0% | 14,700+ | 29,400+ |
Notice that a baseline of 5% with a 10% relative uplift can need over 30,000 users per variant. That surprises many teams because the absolute change is only 0.5 percentage points, yet proving it reliably requires substantial traffic.
Interpreting confidence and power
Confidence level and power are often confused, but they answer different questions. Confidence controls your tolerance for false alarms. With a 95% confidence level, you are using a threshold that would wrongly declare a difference only 5% of the time when no true effect exists. Power controls your sensitivity. At 80% power, you have an 80% chance of detecting the target effect if it is real.
Increasing either one will usually increase the required sample size. If you move from 95% to 99% confidence, the bar becomes stricter and the traffic requirement rises. The same happens when power increases from 80% to 90%. These settings are not “better” in isolation. They must match the cost of error in your business context.
| Testing choice | Business effect | Typical implication |
|---|---|---|
| 95% confidence, 80% power | Common balance for product and marketing tests | Moderate sample sizes, widely accepted standard |
| 99% confidence, 80% power | Stricter false-positive control | Larger sample, longer tests |
| 95% confidence, 90% power | More sensitive to real effects | Larger sample than 80% power |
| 90% confidence, 80% power | Faster directional learning | Smaller sample, higher false-positive risk |
How to choose a minimum detectable effect
The minimum detectable effect, or MDE, is one of the most strategic inputs in any A/B test guide sample size calculator. If you choose an unrealistically large uplift, your test becomes easier to run but less useful. If you choose a tiny uplift, the sample size may become impractical. The right value depends on economics.
Start by asking what improvement would justify implementation. If changing a signup flow takes engineering time, design review, QA, and analytics support, maybe a 1% relative lift is too small to matter. On the other hand, for a very high-traffic revenue page, even a 2% relative lift might be valuable enough to pursue. The MDE should reflect the smallest effect worth acting on, not the largest effect you hope to see.
A practical method for setting MDE
- Estimate the annual or quarterly value of a 1 percentage point absolute conversion gain.
- Translate that into a realistic relative lift based on your baseline rate.
- Compare the potential value with implementation cost and risk.
- Choose the smallest lift that still makes the test worth shipping if confirmed.
Why runtime estimates matter
Traffic alone does not determine whether a test is operationally sound. Runtime matters because user behavior changes by day of week, campaign mix, pay cycle, and season. A test that reaches sample size in three days may still be unreliable if your business has a seven-day cycle. As a rule, many experimentation teams prefer to run tests for at least one full weekly cycle and often two, provided that the pre-calculated sample size is also reached.
The calculator estimates runtime by dividing required total sample by your daily traffic, adjusted for the smallest allocation share. If you use a 60/40 split instead of 50/50, the smaller variant receives traffic more slowly, so the test takes longer to complete.
Best practices for using this calculator in production
- Use recent baseline data. If your current conversion rate is stale or based on a different channel mix, your estimate will be off.
- Keep the primary metric stable. Do not change the success metric mid-test.
- Prefer balanced allocation. A 50/50 split minimizes variance and reaches the target sample faster.
- Account for data quality. Bot traffic, duplicate users, and tracking loss can invalidate the assumptions.
- Do not stop because of early spikes. Reach the planned sample size and enough calendar coverage before deciding.
- Segment after significance with caution. Looking at many segments increases the chance of false discoveries.
Authoritative statistics resources
If you want to go deeper on hypothesis testing, power, and sample size, these sources are useful references:
- NIST Engineering Statistics Handbook
- Penn State STAT 500 course materials on hypothesis testing and inference
- U.S. Census Bureau statistical guidance resources
Final takeaways
An A/B test guide sample size calculator is more than a convenience tool. It is a planning framework that helps you protect against false wins and missed opportunities. Before launching a test, define your baseline, pick an economically meaningful MDE, set an appropriate confidence level and power, and estimate runtime based on actual traffic allocation. Once the test starts, avoid changing the rules.
If your calculated sample size seems too large, that is not a sign the math is broken. It usually means one of three things: your expected uplift is too small relative to your traffic, your baseline conversion rate is low, or your confidence requirements are strict. In that case, consider testing bigger changes, aggregating traffic across a longer period, or prioritizing higher-impact pages. The goal is not to force significance. The goal is to make decisions you can trust.