A/B Test Size Calculator
Estimate the required sample size for a statistically valid A/B test, understand runtime based on your traffic, and visualize how the minimum detectable uplift changes your experiment cost.
Calculator Inputs
How an A/B Test Size Calculator Helps You Design Better Experiments
An A/B test size calculator answers one of the most important questions in experimentation: how much traffic do you need before you can trust the result? Many teams know how to launch a split test, but far fewer know whether they have enough observations to detect a meaningful lift. That gap creates a serious risk. If your test is underpowered, you may stop too early, declare a winner that is not real, or miss a valuable improvement because the experiment never had a fair chance to detect it.
At its core, an A/B test compares two proportions. In conversion optimization, those proportions are usually conversion rates: purchases, signups, form completions, or clicks. The calculator on this page estimates the sample size required for a two-sample proportion test. It uses the baseline conversion rate, the minimum detectable uplift, your target confidence level, statistical power, and the traffic split between variations.
That matters because every experiment is a tradeoff between speed and certainty. If you want to detect very small changes, your sample requirement grows sharply. If you are willing to test only for larger lifts, you can run the experiment faster, but you may overlook smaller wins that are still meaningful at scale. This is why skilled growth teams do not ask only, “Can we run a test?” They ask, “What effect size matters to the business, and what traffic investment is justified to measure it?”
What the calculator is actually measuring
When you enter a baseline conversion rate and a minimum detectable uplift, the tool converts those assumptions into two proportions:
- Control rate: your current best estimate of conversion performance.
- Treatment rate: the smallest improved rate that you want the test to reliably detect.
The calculator then combines those rates with your selected confidence and power settings. Confidence level controls the risk of a false positive, often called Type I error. Power controls the risk of a false negative, often called Type II error. In simpler language:
- Higher confidence means you demand stronger evidence before calling a winner.
- Higher power means you want a better chance of detecting a real difference when it truly exists.
Both are good goals, but both increase required sample size. That is why many teams choose 95% confidence and 80% power as a practical starting point. Those settings are common because they strike a useful balance between rigor and cost.
The statistical formula behind sample size for A/B tests
For a standard A/B test comparing two conversion rates, the required equal-size sample per variation can be estimated with a normal approximation for two proportions. In plain language, the formula considers the natural variability in each conversion rate and then asks how many observations are needed before the expected difference is larger than the random noise.
where p1 = baseline rate, p2 = expected treatment rate, p_bar = (p1 + p2) / 2
If you use an uneven split such as 60 / 40 or 70 / 30, the total sample requirement rises because one group has less information. Equal splits are usually the most efficient for learning, which is why many experimentation platforms default to 50 / 50 traffic allocation unless there is a strong business reason to protect one variant.
Why sample size planning matters more than most teams realize
Sample size is not a technical detail for statisticians. It is one of the biggest drivers of experiment quality, roadmap speed, and business credibility. Here is what often goes wrong when teams ignore it:
- Premature stopping. A test shows an early lift, stakeholders get excited, and someone ends the test before enough data accumulates. Early swings are common and often disappear.
- False negatives. A test is too small, so a real but modest improvement fails to reach significance. The team concludes that the idea did not work even though it might have delivered revenue at scale.
- Misaligned business expectations. Leadership may expect weekly wins, but the traffic profile may only support reliable readouts every few weeks for realistic effect sizes.
- Poor prioritization. Teams may spend time on low-impact tests that require huge sample sizes, when higher-contrast ideas could be validated more efficiently.
Using an A/B test size calculator before launch creates alignment. It tells stakeholders how long the test should run, what minimum lift is realistic to measure, and whether the available traffic supports the desired learning goal.
Reference table: confidence, power, and common z-score inputs
The following values are standard statistical constants used in sample size planning.
| Setting | Meaning | Common value | Approximate z-score |
|---|---|---|---|
| 90% confidence | Accepts a higher false positive risk than 95% | Alpha = 0.10, two-sided | 1.645 |
| 95% confidence | Most common benchmark in business experiments | Alpha = 0.05, two-sided | 1.960 |
| 99% confidence | Very conservative evidence threshold | Alpha = 0.01, two-sided | 2.576 |
| 80% power | 20% chance of missing a true effect of the target size | Beta = 0.20 | 0.842 |
| 90% power | Stronger sensitivity, larger sample needed | Beta = 0.10 | 1.282 |
| 95% power | High sensitivity, often expensive in traffic terms | Beta = 0.05 | 1.645 |
Example scenarios using the same statistical method
To make the tradeoffs concrete, here are sample calculations based on a 95% confidence level, 80% power, and equal traffic allocation. These values are representative outputs from the same two-proportion approach used in the calculator.
| Baseline conversion rate | Minimum detectable uplift | Expected treatment rate | Approximate sample per variant | Approximate total sample |
|---|---|---|---|---|
| 2.0% | 10% | 2.2% | 38,000+ | 76,000+ |
| 5.0% | 15% | 5.75% | 10,000+ | 20,000+ |
| 10.0% | 20% | 12.0% | 3,800+ | 7,600+ |
| 20.0% | 10% | 22.0% | 6,400+ | 12,800+ |
These examples reveal an important pattern: low baseline conversion rates and small target uplifts are expensive. If your site converts at 2% and you want to detect a 10% relative lift, you need a lot of traffic. On the other hand, if the baseline is higher or the change is more dramatic, the sample requirement drops substantially.
How to choose a realistic minimum detectable effect
The minimum detectable effect, often shortened to MDE, is the smallest change you care about enough to act on. This should not be chosen randomly. It should come from business context. Ask questions like:
- What lift would materially affect revenue, lead volume, or retention?
- How much test traffic can we realistically invest without delaying other experiments?
- Is the proposed change bold enough to plausibly move behavior by that amount?
Teams often make one of two mistakes. First, they choose an extremely small MDE because they want precision. That can force impractically long tests. Second, they choose a huge MDE just to get a quick result, even though such a large change is unlikely. A strong workflow is to define several potential MDEs and compare the traffic cost. The chart above helps with that by showing how sample size changes as the uplift assumption moves up or down.
How long should an A/B test run?
A calculator can estimate runtime by dividing the total required audience by the monthly visitors available for the experiment. This gives a directional duration estimate in days or months. However, you should also respect your business cycle. If behavior differs by weekday, pay period, season, or campaign calendar, the test should span enough time to capture those patterns. Even when a test appears to reach the target sample quickly, it is wise to avoid ending before a full business cycle has passed.
As a practical standard, many experimentation teams try to run tests for at least one to two full weeks and long enough to include the same mix of weekdays and weekends for both variants. If the required sample implies a much shorter period, consider whether seasonality or campaign timing still argues for a longer run.
Common mistakes when using an A/B test size calculator
- Using total site traffic instead of eligible test traffic. Only count visitors who actually see the experiment.
- Ignoring uneven allocation. A 70 / 30 split may feel safer, but it often increases runtime.
- Mixing absolute and relative lift. A change from 5% to 6% is a 1 percentage point increase but a 20% relative uplift.
- Peeking too often and stopping opportunistically. Traditional sample size formulas assume a fixed horizon unless you use a sequential testing framework.
- Running tests with contaminated data. Tracking errors, inconsistent audience assignment, and implementation bugs can invalidate even a perfectly sized test.
Best practices for trustworthy experimentation
- Define the primary metric before launch.
- Choose confidence and power deliberately, not by habit alone.
- Set an MDE tied to business value.
- Estimate runtime using only eligible traffic.
- Keep traffic allocation balanced when possible.
- Validate instrumentation before the experiment starts.
- Resist stopping early based on temporary significance swings.
- Document assumptions so stakeholders understand the decision framework.
Authoritative resources for deeper statistical guidance
If you want to go deeper into power analysis, hypothesis testing, and experimental design, these sources are excellent starting points:
- NIST Engineering Statistics Handbook
- Penn State STAT 415: Introduction to Mathematical Statistics
- NIH NCBI overview of statistical significance and hypothesis testing
Final takeaway
An A/B test size calculator is not just a forecasting tool. It is a planning tool that helps you run experiments with discipline. By understanding the relationship between baseline rate, desired uplift, confidence, power, and traffic allocation, you can build tests that are both practical and statistically credible. The best experimentation programs do not chase significance at all costs. They define meaningful effects, measure them rigorously, and make decisions with enough data to trust the outcome.
Use the calculator above before you launch any test. It will help you estimate how much audience you need, how long the experiment should run, and whether your target uplift is realistic for the traffic you have available. That upfront planning can save weeks of wasted effort and lead to better product, marketing, and conversion decisions.