A/B Sample Size Calculator
Estimate how many users you need in each test variant before launching an experiment. This calculator uses a standard two-proportion z-test framework to estimate per-variant sample size from your baseline conversion rate, minimum detectable effect, significance level, statistical power, and expected traffic.
Calculator Inputs
How to use an A/B sample size calculator correctly
An A/B sample size calculator helps you decide how much traffic your experiment needs before you can trust the outcome. In practical terms, it protects teams from two costly mistakes: ending a test too early and running a test that was never large enough to detect a meaningful difference. If you work in product, growth, UX, or ecommerce, sample size planning is one of the most important disciplines in experimentation because it anchors your decisions in probability instead of hope.
Most A/B tests compare two conversion rates. The control version, often called A, represents the current experience. The treatment version, B, contains a change such as new messaging, layout, price presentation, onboarding flow, or checkout design. Before the test starts, you need to answer a few statistical questions: what is your current baseline conversion rate, how small of an improvement is still valuable, what level of false-positive risk are you willing to tolerate, and how much chance do you want of detecting a true effect if it exists? Those inputs determine sample size.
What the calculator is actually estimating
This calculator estimates the number of users required in each variant for a two-proportion z-test. It uses the baseline conversion rate, your minimum detectable effect, the significance level, and desired power. In experimentation terms:
- Baseline conversion rate is your current observed success rate, such as purchases, sign-ups, or clicks.
- Minimum detectable effect, or MDE is the smallest relative uplift worth detecting. If your baseline conversion rate is 10% and your MDE is 10%, the calculator assumes the variant conversion rate you care about is 11%.
- Significance level controls the Type I error rate, also called alpha. At 0.05, you accept a 5% risk of falsely declaring a difference when no real difference exists.
- Power controls the Type II error rate. With 80% power, your test has an 80% chance of detecting the target effect if it is real.
- Test sidedness determines whether you are checking for any difference at all or only for movement in one expected direction.
These assumptions produce a planning estimate, not a guarantee. Real experiments can be noisier than planned because of traffic quality changes, instrumentation issues, novelty effects, or unstable user behavior. Even so, sample size planning remains the best first step for serious test design.
Why sample size matters so much in A/B testing
Underpowered tests are one of the most common reasons teams make weak product decisions. If your test is too small, even a genuinely useful improvement can look indistinguishable from random noise. That leads to false negatives, where you reject a winning idea simply because the experiment lacked sensitivity. On the other side, if you repeatedly peek at a tiny test and stop when the numbers look good, you increase your chance of false positives.
Sample size planning creates a disciplined framework. It tells your stakeholders, ahead of launch, roughly how many users are needed and about how long the experiment should run. This improves test governance, helps avoid midstream pressure to stop early, and makes tradeoffs visible. For example, if leadership wants to detect a 2% relative lift at 99% confidence with 95% power on a low-converting page, the calculator will likely show a very large traffic requirement. That conversation is healthy because it forces the business to decide whether the expected value of learning justifies the cost of running the test.
Typical inputs and how they affect the result
1. Baseline conversion rate
Baseline rate changes the math substantially. Extremely low conversion rates often require larger samples to detect small relative changes because the absolute difference between control and variant is tiny. For example, moving from 2.0% to 2.2% is only a 0.2 percentage-point increase even though it is a 10% relative lift.
2. Minimum detectable effect
The smaller the MDE, the larger the required sample. This is often the strongest lever. If the business only cares about large wins, the test can finish faster. If small gains are financially meaningful, plan for more traffic and longer duration.
3. Significance level
A stricter significance threshold such as 0.01 lowers the chance of false positives but increases sample size. In most product and marketing contexts, 0.05 is a common default because it balances rigor and feasibility.
4. Statistical power
Higher power means a greater chance of catching a true effect, but it comes at the cost of larger samples. Many teams use 80% as a planning standard. If the decision is especially high stakes, 90% power may be justified.
5. Traffic allocation and split
Even if your required sample is fixed, actual test duration depends on how much traffic enters the experiment and how that traffic is split between variants. A 50/50 split is statistically efficient for two versions. Uneven splits may be useful for business risk management, but they typically slow down completion because the smaller arm becomes the bottleneck.
Illustrative sample size scenarios
The table below shows approximate per-variant sample sizes for common A/B testing scenarios using a two-sided test, 95% confidence, and 80% power. Values are rounded and shown for planning intuition rather than as audited clinical-grade calculations.
| Baseline conversion rate | Relative uplift | Target variant rate | Approx. users per variant | Approx. total sample |
|---|---|---|---|---|
| 5.0% | 10% | 5.5% | 31,400 | 62,800 |
| 10.0% | 10% | 11.0% | 14,700 | 29,400 |
| 20.0% | 10% | 22.0% | 6,500 | 13,000 |
| 10.0% | 5% | 10.5% | 61,000 | 122,000 |
Notice how sensitive the required traffic is to the MDE. A test seeking a 5% relative lift at a 10% baseline needs dramatically more data than a test seeking a 10% relative lift on the same page. This is why advanced experimentation programs prioritize tests based on expected business impact and traffic availability.
Benchmarks for confidence and power
Choosing the right alpha and power should reflect the cost of being wrong. A pricing test, checkout flow experiment, or compliance-sensitive message may deserve stricter settings than a low-risk copy test on a blog landing page. The table below summarizes common defaults.
| Setting | Common value | Interpretation | Typical use case |
|---|---|---|---|
| Alpha | 0.05 | 5% false-positive risk | General product and marketing experimentation |
| Alpha | 0.01 | 1% false-positive risk | High-stakes decisions with strong governance |
| Power | 0.80 | 80% chance to detect the planned effect | Standard planning default |
| Power | 0.90 | Greater sensitivity to true effects | Important launches or expensive follow-up decisions |
How to interpret the output
- Users per variant tells you how many observations each arm needs under the assumptions entered.
- Total sample size is the total participants across both A and B.
- Expected variant conversion rate converts your relative uplift into a concrete target rate.
- Estimated duration uses your monthly eligible traffic and allocation to estimate how long it may take to reach the required sample.
If the projected duration is too long, you have several levers. You can increase allocated traffic, focus the test on a higher-traffic area, target a larger MDE, simplify the experiment to reduce variance, or postpone the test until you have more traffic. What you should not do is quietly lower rigor after launch without documenting the tradeoff.
Common mistakes teams make
- Using a guessy baseline: If the control conversion rate is stale or seasonal, your planning estimate can be off. Use recent and representative data.
- Choosing an unrealistic MDE: Teams often ask for detection of tiny uplifts that are not feasible with available traffic.
- Ignoring practical significance: A statistically significant result is not always economically meaningful. Tie MDE to business value.
- Stopping early: Repeatedly checking results and ending on a favorable spike inflates false-positive risk.
- Changing primary metrics mid-test: This weakens inferential integrity unless the changes are pre-registered and accounted for.
- Running too many variants: Multi-arm tests dilute traffic and often demand much larger samples than teams expect.
One-sided vs two-sided testing
Two-sided tests are the default in many experimentation programs because they detect both improvement and harm. That matters when a change could plausibly reduce conversion or create unexpected friction. A one-sided test can lower the sample size requirement, but it should be used carefully and only when negative movement would not be interpreted as success under any circumstances. If you would reverse course after seeing a large drop, then a two-sided framework is usually more honest.
When this calculator is most useful
This type of A/B sample size calculator is ideal for binary outcomes like sign-up completed, purchase completed, trial started, or CTA clicked. It is especially useful during experiment intake, prioritization, and stakeholder planning. It helps answer questions such as: Can this test finish in two weeks? Is the expected upside large enough to justify engineering work? Should we use all traffic or start with a lower allocation? Will a smaller expected lift make the test infeasible?
However, if you are testing revenue per user, retention, average order value, or metrics with highly skewed distributions, a proportion-based calculator may not be the right tool. In those cases you may need a calculator designed for means, variance reduction techniques, or sequential methods.
Trusted references and further reading
For readers who want more rigorous statistical grounding, the following public resources are helpful:
- U.S. Census Bureau guidance on sample size concepts
- Penn State STAT resources on hypothesis testing and power
- NIST statistical references and methodology resources
These sources do not all focus exclusively on product experimentation, but they provide reliable foundations in inferential statistics, error rates, and sample planning. If your organization runs many experiments each month, it is worth developing a documented testing policy that standardizes alpha, power, MDE selection, and stopping rules.
Final takeaway
An A/B test without a sample size plan is like a budget without a target. You may spend time and attention, but you do not know whether the result is dependable. A proper sample size calculator gives your team a rational starting point. It frames the tradeoff between speed and certainty, aligns expectations before launch, and reduces the temptation to overreact to noisy early data. In mature experimentation programs, this discipline compounds over time because it improves both the quality of decisions and the credibility of the testing function.
Use the calculator above to estimate your required traffic, then treat the result as part of a broader decision framework. Pair it with clean instrumentation, a pre-defined primary metric, a sensible run-time window, and strong test governance. When those pieces work together, A/B testing becomes a high-confidence engine for continuous improvement rather than a source of random anecdotes.