A/B Testing Calculator Sample Size
Estimate the number of visitors you need per variation before launching a statistically sound experiment. This calculator uses a standard two-proportion sample size approach for conversion rate testing.
Your results will appear here
Enter your baseline conversion rate, target lift, confidence level, and power to estimate the required sample size for each variation.
Expert guide to using an A/B testing calculator sample size correctly
An A/B testing calculator sample size tool helps you decide how much traffic you need before you can trust the outcome of an experiment. If you run a website test without enough visitors, your data can look exciting while still being mostly noise. If you wait for far more traffic than necessary, you slow your experimentation program, delay learning, and reduce business momentum. That is why sample size is not just a statistical detail. It is an operational decision that influences speed, confidence, and revenue.
For conversion experiments, the most common setup compares two proportions: the control conversion rate and the variant conversion rate. The calculator on this page estimates the sample size required per variation using your baseline conversion rate, your minimum detectable effect, your chosen confidence level, and your desired statistical power. In practical terms, it answers a simple question: how many users must I observe before I can realistically detect the lift I care about?
Why sample size matters in A/B testing
Many teams focus on design ideas, copy changes, or new offers, but strong experimentation starts with measurement discipline. A test with poor sample planning can produce three common problems:
- False negatives: a real uplift exists, but the test does not have enough power to detect it.
- False positives: the data appears significant too early because random variation is mistaken for a real effect.
- Unstable decisions: results flip as more traffic arrives, creating confusion and reducing trust in experimentation.
Good sample size planning protects against those issues. It aligns the test with a realistic effect size, helps stakeholders understand how long the test will run, and prevents constant peeking or early stopping based on incomplete evidence. In mature optimization programs, sample size planning is often as important as hypothesis quality.
The four inputs that drive your sample size
Most A/B testing sample size calculators rely on four core inputs:
- Baseline conversion rate: your current performance, such as 5% of visitors completing a sign-up.
- Minimum detectable effect: the smallest relative improvement you care enough to detect, such as a 10% lift over the baseline.
- Confidence level: how strict you want to be about avoiding false positives, commonly 95%.
- Statistical power: the probability of detecting a true effect if it really exists, commonly 80% or 90%.
Each input changes the result in intuitive ways. Lower baseline rates usually require larger samples because conversions are rarer. Smaller minimum detectable effects require larger samples because you are trying to detect subtler differences. Higher confidence levels require larger samples because the test standard is stricter. Higher power also requires more traffic because you want a better chance of catching real improvements.
Practical rule: if your expected improvement is small, your required sample size rises quickly. Teams often underestimate how expensive it is to detect a tiny lift such as 3% or 5% relative improvement.
How the calculator interprets minimum detectable effect
On this page, the minimum detectable effect is treated as a relative lift versus the baseline conversion rate. If your baseline is 5% and your minimum detectable effect is 10%, the calculator assumes the variant target is 5.5%. That means the absolute difference being measured is 0.5 percentage points. This distinction matters because relative and absolute improvements are often confused in experimentation discussions.
For example, moving from 5% to 6% is a 1 percentage point absolute increase, but a 20% relative increase. Always make sure the team agrees on which one is being used when estimating experiment impact.
Confidence level and power explained without jargon
A 95% confidence level means you are setting a relatively high bar before saying the variant truly differs from the control. In simple terms, you want to reduce the chance of being fooled by random fluctuation. Statistical power, often set to 80%, tells you how likely the test is to detect a real effect of the size you care about.
These two numbers create a trade-off between rigor and speed. Stronger standards improve reliability, but they also increase sample size and runtime. That is why many product and growth teams standardize on 95% confidence and 80% power. It is a practical middle ground for many web experiments.
| Confidence level | Approximate Z value | What it means in practice | Typical use case |
|---|---|---|---|
| 90% | 1.645 | Faster tests, slightly greater risk of false positives | Early-stage optimization or lower-risk tests |
| 95% | 1.960 | Balanced standard used by many experimentation teams | General website, landing page, and product tests |
| 99% | 2.576 | Very strict threshold, much larger sample requirements | High-stakes changes or highly regulated decisions |
Real sample size examples
The table below shows representative sample size estimates per variant using a standard two-sided two-proportion setup. These examples use 80% power and common web testing assumptions. The figures illustrate how quickly required traffic grows when baseline rates are low or when the target lift is small.
| Baseline conversion rate | Target relative lift | Variant conversion rate | Confidence | Estimated sample per variant |
|---|---|---|---|---|
| 5.0% | 10% | 5.5% | 95% | 31,240 visitors |
| 5.0% | 10% | 5.5% | 90% | 24,640 visitors |
| 10.0% | 15% | 11.5% | 95% | 6,694 visitors |
| 20.0% | 10% | 22.0% | 95% | 6,505 visitors |
These values show why low-conversion pages are harder to test efficiently. If only a small fraction of visitors convert, random noise can dominate the signal for a long time. By contrast, if your baseline is stronger or your expected lift is larger, the test becomes easier to power.
How to choose a realistic minimum detectable effect
The minimum detectable effect should not be chosen at random. It should connect to business value. Ask yourself what level of lift would justify design, engineering, or opportunity cost. If your test can only detect a 20% lift but your expected improvements are usually in the 3% to 8% range, your program may systematically miss valuable wins. If you set the effect too low, such as 1% relative lift, the sample size may become so large that the test is impractical.
A practical method is to estimate the revenue or lead value of different conversion lifts. Then define the smallest lift that is meaningful enough to act on. That becomes the minimum detectable effect for planning. This approach keeps sample size grounded in economics rather than guesswork.
How to estimate runtime from sample size
Sample size tells you how much data you need, but runtime tells you whether the test is operationally feasible. The calculator above uses your average daily test visitors and traffic allocation to estimate the number of days required. If the projected duration is too long, you have several choices:
- Increase the amount of traffic assigned to the experiment.
- Choose a larger minimum detectable effect that better matches business significance.
- Focus the test on a higher-conversion segment.
- Simplify the experiment so it can run on a page or audience with more traffic.
- Reduce the number of simultaneous variants if traffic is being spread too thin.
One more operational note: most tests should run through complete business cycles, often at least one to two weeks, to account for weekday and weekend behavior. Even if a sample target is hit early, stopping before a full cycle can distort interpretation if traffic quality varies by day.
Common mistakes when using an A/B testing sample size calculator
- Using a guessed baseline: if your baseline conversion rate is outdated or pulled from a different audience, the sample estimate can be badly off.
- Confusing absolute and relative lift: this can lead to dramatic underestimation or overestimation of required traffic.
- Checking results too often: repeated peeking increases the chance of declaring a winner too early.
- Ignoring practical significance: a statistically significant result is not automatically meaningful for the business.
- Testing too many variants at once: splitting traffic across multiple versions increases time to reach the required sample.
- Not accounting for implementation quality: tracking errors, uneven targeting, and page bugs can invalidate a perfectly sized test.
When you may need more advanced methods
This calculator is ideal for standard conversion tests with two variations and an equal split. However, some experiments require more specialized planning. Examples include multi-armed tests, revenue per visitor metrics, sequential testing frameworks, Bayesian decision models, and experiments with strong seasonality or clustering effects. In those cases, a simple fixed-horizon sample size estimate is still useful as a benchmark, but it may not capture every nuance of the test design.
If your metric is not binary, such as average order value or time on page, you may need a different formula based on means and variance. If your traffic split is intentionally unequal, the sample size per group changes. If your experiment affects multiple steps in a funnel, the effective sample for downstream metrics can be much smaller than total page visitors.
Recommended defaults for most growth teams
If you are building a practical experimentation routine and need a starting standard, these defaults work well for many websites:
- Use a 95% confidence level.
- Use 80% power for routine optimization tests.
- Base the baseline conversion rate on recent, representative data.
- Set the minimum detectable effect according to economic impact, not wishful thinking.
- Commit to a full planned runtime and avoid ad hoc early stopping.
These choices are not universal rules, but they are sensible defaults that balance discipline and execution speed. Teams that test frequently often gain more from consistency than from endlessly changing statistical settings from one experiment to the next.
Authoritative resources for deeper study
If you want to go beyond a calculator and understand the statistical foundations, review these high-quality references:
- NIST Engineering Statistics Handbook on sample size and power
- Penn State STAT resources on inference for proportions
- CDC overview of confidence intervals and statistical interpretation
Final takeaway
An A/B testing calculator sample size tool is not just a convenience. It is one of the simplest ways to improve experiment quality before a test even starts. By planning sample size up front, you avoid premature conclusions, align expectations with stakeholders, and create a more reliable learning system. The best experiment is not the one with the flashiest design change. It is the one that can produce a trustworthy answer within a realistic timeframe.
If you use the calculator on this page thoughtfully, you will be able to estimate per-variant traffic, total audience requirement, and test duration with much more confidence. That makes it easier to prioritize experiments, manage calendars, and focus effort on tests that are both statistically sound and commercially meaningful.