AB Test Sample Size Calculator Formula
Estimate how many users you need in your control and variant groups before launching an experiment. This calculator uses the standard two sample proportion test formula commonly applied to conversion rate A/B testing.
Interactive Calculator
Enter your baseline conversion rate, the absolute uplift you care about, your alpha level, your desired power, and the traffic split. Results will appear here along with a visual chart.
How the AB test sample size calculator formula works
An A/B test sample size calculator answers a simple but expensive question: how much traffic do you need before trusting a result? If you stop too early, a random spike can look like a winner. If you require too much traffic, you delay decisions and lose time. The right sample size balances speed, statistical confidence, and business risk.
For most digital experiments, the outcome is binary. A user either converts or does not convert. That means the standard planning model is a two sample test for proportions. The calculator above applies that framework to estimate how many users are needed in the control group and the treatment group to detect a chosen uplift with a selected significance level and power.
In practical terms, you enter five main ideas:
- Baseline conversion rate: your current expected conversion probability.
- Minimum detectable effect: the smallest absolute change worth detecting.
- Alpha: the probability of a false positive you are willing to accept.
- Power: the probability of detecting a true effect of at least that size.
- Allocation ratio: how traffic is split between the two variants.
The core formula
For a two group A/B test on conversion rate, a common planning formula for the sample size per group is based on the normal approximation:
n per group = ((z_alpha term × pooled standard error) + (z_beta term × group standard error))² / (p2 – p1)²
Where p1 is the baseline rate, p2 is the expected rate after uplift, z_alpha is the critical value tied to significance, and z_beta is the critical value tied to statistical power.
When traffic is split equally, the formula is often written as:
n = ((z × sqrt(2 × pbar × (1 – pbar)) + z_power × sqrt(p1 × (1 – p1) + p2 × (1 – p2)))²) / (p2 – p1)²
Here, pbar is the average of p1 and p2. For unequal allocation, the variance changes slightly, which is why this calculator adjusts the required users in the control and treatment based on your selected ratio.
Why baseline rate matters so much
The variance of a proportion depends on p × (1 – p). As a result, sample size is not determined only by the uplift you want to detect. A 1 percentage point change from 2% to 3% is very different from a 1 percentage point change from 30% to 31%. The absolute effect is the same, but the underlying variance and relative business impact differ.
One useful way to think about this is that lower baseline rates often require large samples because the conversion signal is sparse. With a 1% baseline rate, most users do not convert, so random noise can dominate. At mid range conversion rates, the binomial variance is often larger. Either way, smaller effects always require substantially more traffic.
Real comparison table: sample size by baseline and uplift
The table below shows approximate users required per variant using a two sided test, alpha 0.05, and power 0.80 with equal traffic split. These are real values generated by the same planning logic used in the calculator, rounded for clarity.
| Baseline rate | Target rate | Absolute uplift | Approx. users per group | Total users needed |
|---|---|---|---|---|
| 2% | 3% | 1 percentage point | 3,820 | 7,640 |
| 5% | 6% | 1 percentage point | 8,150 | 16,300 |
| 5% | 7% | 2 percentage points | 2,220 | 4,440 |
| 10% | 11% | 1 percentage point | 14,700 | 29,400 |
| 20% | 22% | 2 percentage points | 6,430 | 12,860 |
The key lesson is that halving the minimum detectable effect does not just double your required traffic. Because the effect sits in the denominator squared, traffic grows very quickly as your target effect gets smaller.
Real comparison table: effect of confidence and power
Teams sometimes increase confidence and power without appreciating the traffic cost. The next table uses a baseline rate of 5% and a target of 6%, with equal allocation. The values are approximate and represent users needed per group.
| Alpha | Power | Confidence style | Approx. users per group | Approx. total |
|---|---|---|---|---|
| 0.10 | 0.80 | 90% confidence, 80% power | 6,420 | 12,840 |
| 0.05 | 0.80 | 95% confidence, 80% power | 8,150 | 16,300 |
| 0.05 | 0.90 | 95% confidence, 90% power | 10,900 | 21,800 |
| 0.01 | 0.90 | 99% confidence, 90% power | 15,530 | 31,060 |
How to choose a minimum detectable effect
The minimum detectable effect, often called MDE, should be a business decision before it becomes a statistical input. A common mistake is selecting an unrealistically tiny uplift because it sounds precise. If the baseline conversion rate is 5%, asking the experiment to detect an uplift of only 0.1 percentage points can require very large traffic volumes. That may be mathematically possible but operationally inefficient.
A better process is to tie the MDE to economics:
- Estimate the revenue or value per conversion.
- Estimate the implementation cost and opportunity cost.
- Define the smallest lift that meaningfully changes your decision.
- Run the sample size calculation to see whether the traffic requirement is realistic.
If the required sample is too large, you have several options. You can test a bigger change, improve instrumentation, focus on a higher intent audience, or choose a stronger metric with lower variance.
One sided versus two sided tests
This calculator lets you choose between one sided and two sided tests. In a two sided test, you are asking whether the variant is different from the control in either direction. That is the standard default for many experimentation programs because it protects against the possibility that the variant performs worse.
In a one sided test, you only care about improvement in one direction. Because the critical threshold is lower, the required sample can be smaller. However, you should only choose one sided testing if that directional decision is justified before the test begins and documented in your analysis plan.
What traffic allocation ratio does to your sample size
Equal allocation usually minimizes total required traffic for a fixed two group design. If you send twice as much traffic to the variant as the control, you may get operational benefits, but the total sample size tends to increase because the information content becomes less balanced. That said, uneven allocation can still be useful when a product team wants faster learning on a preferred variation or when risk control requires a smaller exposure to the challenger.
Common mistakes in A/B test sample size planning
- Mixing relative and absolute uplift: a move from 5% to 6% is a 1 percentage point absolute lift and a 20% relative lift. Do not confuse them.
- Stopping when the chart looks good: peeking repeatedly inflates false positive risk unless your method accounts for it.
- Using post test observed uplift to justify sample size: power planning should happen before data collection.
- Ignoring seasonality or time based effects: even if you hit the traffic target, a test that spans unusual days can still mislead.
- Forgetting multiple testing: if many variants or metrics are reviewed, error rates can drift upward.
When the normal approximation is appropriate
The formula used here is the standard approximation for planning two proportion tests and works well in many web experiments, especially when expected counts are reasonably large. If your baseline rate is extremely low, your event is rare, or your traffic is small, an exact method or simulation can be more appropriate. Advanced teams may also adjust for sequential monitoring, heterogeneity, or Bayesian decision frameworks. Even so, the classical sample size formula remains the common starting point because it is transparent, fast, and interpretable.
Step by step interpretation of your calculator result
After you click calculate, the tool returns four key outputs:
- Control users needed: the number of users required in the baseline group.
- Variant users needed: the number required in the treatment group after adjusting for the allocation ratio.
- Total users needed: the sum of both groups.
- Estimated variant conversion rate: the baseline plus your selected absolute uplift.
If you know your weekly eligible traffic, you can estimate the test duration by dividing the total required users by weekly exposure. Always add buffer for tracking gaps, bot filtering, holdouts, and traffic volatility.
Why this formula remains the industry default
The main reason is that it directly maps to decision risk. Alpha controls false positives. Power controls false negatives for the effect size you care about. The conversion rate assumptions are explicit. That makes the formula easy to communicate to product managers, analysts, executives, and engineers. It also encourages discipline: define the smallest worthwhile effect, estimate the traffic needed, and commit to a test plan before results arrive.
Authoritative references for deeper reading
NIST Engineering Statistics Handbook
Penn State STAT course materials on hypothesis testing and sample size
UCLA Statistical Consulting resources
Practical takeaway
If you remember only one rule, remember this: the sample size requirement grows dramatically as your target effect gets smaller. A/B testing is not just about collecting data. It is about collecting enough data to separate signal from randomness. Use the calculator to set realistic expectations before launch, align stakeholders on the minimum effect that matters, and avoid the costly trap of deciding too early.
For teams running experiments weekly, this planning step is often what separates a mature experimentation culture from a reactive one. When sample size, power, confidence, and business impact are aligned before deployment, your experiment outcomes become much easier to trust.