Ab Test Sample Size Calculation

A/B Test Sample Size Calculator

Estimate how many users you need in control and variant before launching an experiment. This calculator uses a standard two proportion sample size approach for conversion rate testing, helping teams avoid underpowered experiments and misleading winners.

Experiment inputs

Current control conversion rate, such as 10 for 10%.
Choose whether your change is absolute or relative.
Example: 10 means +10% uplift if relative is selected.
Lower alpha reduces false positives but increases sample size.
Higher power lowers missed wins but requires more traffic.
Two-sided is standard for most product and marketing tests.
Use 1 for equal split. Use 0.5 if variant gets half of control traffic.
Used to estimate test duration.

Results

Enter your experiment assumptions and click calculate to see the required sample size per variation, total traffic, estimated duration, and a chart visualization.

Expert guide to A/B test sample size calculation

A/B test sample size calculation is the discipline of estimating how many observations you need before you can trust the outcome of an experiment. In product optimization, email testing, landing page design, and pricing experiments, the temptation is often to start a test quickly, watch the dashboard, and stop as soon as one option appears ahead. That approach feels efficient, but it is one of the fastest ways to create false confidence. The result is often a team that declares winners too early, ships changes that do not generalize, and slowly loses trust in experimentation.

The purpose of sample size planning is simple: decide in advance how much data is required to detect a meaningful change with a specified degree of reliability. In an A/B test, you usually compare two conversion rates, such as the percentage of visitors who purchase, submit a form, or click a button. If your baseline conversion rate is known or reasonably estimated, and you know the smallest lift worth acting on, then you can calculate the number of users needed in the control and treatment groups.

At a practical level, sample size depends on four core inputs: baseline conversion rate, minimum detectable effect, significance level, and statistical power. If any one of those assumptions changes, the required traffic can change dramatically. A tiny effect size with high power and strict significance can turn a one week test into a multi month test. A bigger expected lift or a more lenient alpha can shrink the traffic requirement, but not without tradeoffs.

Why sample size matters so much

Underpowered experiments are one of the biggest hidden costs in optimization programs. When a test is underpowered, real improvements often fail to reach significance. That creates false negatives, meaning your team misses beneficial changes because the experiment did not have enough users to separate signal from noise. The opposite problem also occurs: if you repeatedly peek at data and stop early, random variation can masquerade as a strong uplift.

  • Too small a sample increases the chance of inconclusive results and false negatives.
  • Stopping too early increases the chance of false positives if you are not using a sequential framework.
  • Oversized tests waste time and traffic that could be used for the next experiment.
  • Poor assumptions around baseline rate or effect size can produce unrealistic timelines.

A good sample size calculation lets teams align experimentation plans with business reality. If the required audience is too large, that is not a failure of the math. It is a signal. Maybe the proposed change is too small to be economically relevant. Maybe the page does not receive enough traffic. Maybe a different primary metric, stronger design change, or broader user segment would make the test more feasible.

The main inputs in an A/B test sample size calculation

1. Baseline conversion rate. This is the current performance of your control group. If your site converts at 10%, then 10 out of 100 users convert on average. Sample size is sensitive to this number because variance in a proportion depends on both the conversion rate and non conversion rate.

2. Minimum detectable effect, or MDE. This is the smallest change worth detecting. You may think in absolute points, such as an increase from 10% to 11%, or relative uplift, such as +10%, which also moves 10% to 11%. Smaller MDEs require larger sample sizes because the experiment must distinguish a subtle difference from normal noise.

3. Significance level, alpha. Common values are 0.05 or 0.01. This controls your false positive rate under the testing framework. A lower alpha means you demand stronger evidence before calling a winner. The cost is more required traffic.

4. Power. Power is the probability that the test will detect the effect if the true effect is at least as large as your MDE. A common target is 0.80, while more conservative organizations use 0.90. Higher power reduces false negatives, but again requires more traffic.

5. Allocation ratio. Many tests split traffic 50/50, but some teams send more users to control for risk management. Unequal allocation increases the total sample size needed compared with an equal split.

The standard formula behind the calculator

For a classic two proportion z test, one common approximation uses the baseline rate p1 and expected variant rate p2. The required sample depends on the difference p2 – p1, the pooled variance, and the critical values from the normal distribution.

In plain English, the formula says this: the noisier your data and the smaller your expected effect, the more users you need. The confidence threshold contributes one z value, and the power target contributes another z value. Those z values are why stricter thresholds grow sample size quickly.

Setting Common value Approximate z value Interpretation
Two-sided alpha 0.10 1.645 Looser evidence threshold, smaller required sample
Two-sided alpha 0.05 1.960 Standard level for many business experiments
Two-sided alpha 0.01 2.576 More conservative, larger required sample
Power 0.80 0.842 Widely used balance between rigor and feasibility
Power 0.90 1.282 Better protection against false negatives
Power 0.95 1.645 Very cautious, often expensive in traffic

How to choose a realistic minimum detectable effect

The MDE is often chosen poorly. Teams sometimes pick a tiny value because they would love to detect even a small lift. But the right MDE is not the smallest imaginable improvement. It is the smallest improvement that would justify implementation cost, engineering time, brand risk, and opportunity cost. If a redesign takes two sprints and QA time, maybe a 0.2% absolute lift is not worth pursuing. If an email subject line test can be shipped instantly, a smaller MDE may be perfectly reasonable.

A useful framing is to connect MDE to business value. Ask how much additional revenue, lead volume, or retention would be created by the change if it worked. Then compare that gain against the cost of developing, launching, and maintaining the new experience. This pushes experimentation toward economically meaningful thresholds instead of purely statistical ones.

Worked scenarios and practical traffic planning

The table below shows example sample sizes for equal split tests using a two-sided alpha of 0.05 and 80% power. These are representative outputs from the standard two proportion calculation. Notice how shrinking the effect size inflates the required sample per group.

Baseline conversion rate Target variant rate Absolute lift Relative lift Approximate sample per group Total sample
5.0% 5.5% 0.5 points 10% 31,200 62,400
10.0% 11.0% 1.0 point 10% 14,700 29,400
20.0% 22.0% 2.0 points 10% 6,500 13,000
10.0% 10.5% 0.5 points 5% 57,800 115,600

These numbers illustrate a core truth of experimentation: detecting small effects is expensive. If your page receives only a few thousand users per week, a low MDE may be unrealistic. That is why many high performing experimentation teams prioritize bold hypotheses over tiny cosmetic changes. Bigger changes generate bigger expected effects, which can make tests more affordable and more informative.

Interpreting the output from this calculator

This calculator estimates the sample required in both control and treatment based on your assumptions. It also computes an estimated run time from your daily eligible traffic. Treat that duration as a planning estimate, not a guarantee. Real experiments are affected by day of week patterns, traffic seasonality, outages, bot filtering, user eligibility rules, and how strictly your traffic is randomized.

  1. Enter your current conversion rate as accurately as possible.
  2. Select whether your MDE is relative or absolute.
  3. Choose alpha and power based on your risk tolerance.
  4. Set the traffic allocation ratio if your test is not a 50/50 split.
  5. Estimate daily eligible users, not total site visitors.
  6. Use the resulting duration to decide whether the experiment is feasible.

Common mistakes that distort sample size planning

One common mistake is basing the baseline conversion rate on a short or noisy historical window. If the baseline jumps around from campaign to campaign, use a representative average rather than a single unusual week. Another mistake is confusing visitors with sessions or counting total traffic rather than eligible users. If only mobile users on a product page see the test, that is the denominator you should plan around.

Another frequent issue is setting an overly optimistic MDE. Teams may assume a 20% lift because that makes the math look easier, even though prior experiments suggest most changes produce only 2% to 8% relative improvement. When assumptions are unrealistic, the test plan becomes unreliable.

  • Do not change the primary metric after the test starts.
  • Do not ignore sample ratio mismatch if traffic is not splitting as expected.
  • Do not stop simply because a dashboard briefly shows significance.
  • Do not use the same traffic in multiple overlapping experiments without understanding interaction effects.

What authoritative sources say about power and sample size

If you want a deeper statistical foundation, several highly credible public resources explain hypothesis testing, power, and sample size determination. The NIST Engineering Statistics Handbook provides practical statistical guidance from a U.S. government standards body. Penn State’s online statistics materials at online.stat.psu.edu explain power and hypothesis testing clearly in an educational setting. For broader trial design concepts including error rates and planning considerations, the U.S. Food and Drug Administration also publishes extensive methodological material on statistical testing frameworks.

Although these sources are not written specifically for product managers running button color tests, the mathematical principles are the same. A/B testing in digital products still relies on core inference concepts: Type I error, Type II error, effect size, and variance estimation. The business context changes, but the statistical foundation does not.

When a classic fixed horizon calculation is not enough

Some experimentation programs use sequential methods, Bayesian approaches, or always valid inference frameworks instead of a fixed horizon z test. If that is your setup, the exact sample size planning process may differ. Sequential methods can allow continuous monitoring with adjusted decision rules, while Bayesian methods focus on posterior probabilities and expected loss rather than frequentist p values. Even so, traffic planning still matters. You still need enough information to make a confident decision, and tiny effects still take a long time to detect.

For many organizations, the fixed horizon approach remains a strong default because it is easy to explain, easy to audit, and consistent across teams. The key is to define the rules before launch and stick to them during execution.

Final takeaway

A/B test sample size calculation is not just a statistical formality. It is the planning mechanism that determines whether an experiment is credible, actionable, and worth running. A disciplined sample size process helps teams protect themselves from noisy wins, disappointing false starts, and wasted product cycles. If your result suggests the test is too large to be practical, that insight is valuable. It means you should revisit the hypothesis, target a bigger effect, improve traffic quality, or choose a more sensitive metric.

Use the calculator above as a decision tool, not just a number generator. Enter assumptions carefully, challenge unrealistic MDE targets, and align your statistical design with business value. That is how experimentation programs become both rigorous and fast.

Educational note: this calculator provides planning estimates based on a standard normal approximation for two proportions. For highly regulated decisions, extreme conversion rates, or complex experimental designs, consult a statistician.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top