A B Test Size Calculator

A/B Test Size Calculator

Estimate the sample size needed per variant before you launch an experiment. This calculator uses a standard two-proportion power analysis for conversion-rate tests and helps you balance confidence, power, and minimum detectable effect.

Two-sample proportion test Confidence + power based Interactive chart included

Your current control conversion rate. Example: 5 means 5%.

Relative improvement you want to reliably detect. Example: 10 means +10% lift.

Higher confidence reduces false positives but requires more traffic.

Higher power reduces false negatives and usually increases sample size.

50 means even split. Uneven splits usually increase total sample requirements.

Used to estimate runtime based on your available experiment traffic.

Your experiment estimate

Enter your assumptions and click Calculate Test Size to see sample requirements, projected runtime, and an experiment sizing chart.

How an A/B test size calculator helps you plan stronger experiments

An A/B test size calculator estimates how many users, sessions, or visitors you need in each experiment variant before you can trust the outcome. In practice, this is one of the most important steps in experimentation because underpowered tests often create expensive confusion. Teams launch a test, wait a few days, see a tempting difference, and stop early. Later the result fails to replicate. The most common reason is simple: the experiment never had enough data to distinguish random noise from a meaningful change.

A proper sample size estimate solves that problem by linking your business assumptions to your statistical target. The calculator above uses baseline conversion rate, minimum detectable effect, confidence level, statistical power, and traffic allocation. Together, these inputs define how sensitive your experiment should be and how much evidence you need before declaring a result. If your baseline conversion rate is low, your minimum detectable lift is small, or your confidence and power are high, your required sample size will grow quickly.

For growth teams, product managers, UX researchers, and CRO specialists, this matters because traffic is limited. Every week spent on an oversized experiment has an opportunity cost, while every week spent on an undersized experiment can produce false wins or false losses. The right target is not “the smallest possible sample,” but the smallest sample that still gives decision-grade evidence.

What the calculator is actually measuring

In a classic website or product A/B test, you compare two proportions: the share of users who convert in the control group and the share who convert in the variant group. A conversion might be a purchase, signup, click, trial activation, appointment booking, or any binary outcome. The sample size calculation estimates the number of observations needed per group for a two-sample proportion test.

The most important terms are:

  • Baseline conversion rate: your current expected conversion probability in the control.
  • Minimum detectable effect (MDE): the smallest lift worth detecting, often expressed as a relative percentage.
  • Confidence level: usually tied to alpha, the probability of a false positive.
  • Power: the probability that the test detects a real effect of your chosen size.
  • Allocation ratio: how traffic is split between control and variant.

The relationship between these variables is intuitive once you see it in business terms. If you want to detect a large improvement, you need less traffic because the signal is easier to see. If you want to detect a tiny lift, you need much more traffic. If you demand 99% confidence instead of 95%, you require stronger evidence. If you increase power from 80% to 90%, you are asking the experiment to be more reliable at finding a true effect, so sample size rises again.

Why small lifts are expensive to measure

Many organizations want to detect very small changes, such as a 2% relative uplift in conversion. That sounds reasonable, but the traffic burden can become massive. Suppose your baseline conversion rate is 5%. A 2% relative lift means moving from 5.00% to 5.10%, an absolute difference of just 0.10 percentage points. Detecting a change that small is difficult because natural variation in conversion data can easily hide it. By contrast, a 20% relative lift would move 5.00% to 6.00%, a full 1.00 percentage point difference, which is much easier to detect.

Baseline conversion Relative lift Variant conversion Absolute difference Interpretation
5.0% 2% 5.1% 0.1 percentage points Very subtle signal, typically requires large traffic volumes.
5.0% 10% 5.5% 0.5 percentage points Common planning target for conversion optimization programs.
5.0% 20% 6.0% 1.0 percentage points Easier to detect, often feasible with moderate traffic.
5.0% 50% 7.5% 2.5 percentage points Large effect, often visible faster if the change is truly impactful.

Practical interpretation of confidence and power

Confidence level and power are often treated as technical settings, but they are really business tradeoffs. A 95% confidence level means you are setting a relatively strict standard before calling a winner. An 80% power target means that if the true effect is at least your chosen MDE, your test has an 80% chance of detecting it. These defaults are widely used because they balance caution and feasibility.

If your organization tests high-risk changes, such as pricing, policy, financial disclosures, or medical messaging, you may prefer stricter thresholds. If you are running low-risk UI experiments with abundant traffic, you may still use 95% confidence and 80% power for consistency. The key is not to pick thresholds arbitrarily; choose them in a way that matches the cost of a wrong decision.

Typical planning assumptions used by experimentation teams

Scenario Confidence Power Common use case Tradeoff
Lean experimentation 90% 80% Fast-moving product iterations with moderate risk. Lower sample size, higher tolerance for false positives.
Standard optimization 95% 80% Default choice for most web and product A/B tests. Balanced rigor and speed.
High assurance testing 95% 90% Revenue-critical or customer-sensitive experiments. More traffic required, fewer false negatives.
Very strict evidence 99% 90% High-stakes decisions where acting on noise is costly. Sample size can become very large.

How to choose a realistic minimum detectable effect

The minimum detectable effect is one of the most misunderstood inputs. It should not be the lift you hope for; it should be the smallest lift that would justify implementing the variant. If a design change would only be worth shipping if it raises signup rate by at least 8%, then 8% is a more sensible planning target than 1% or 2%.

A strong way to select MDE is to combine economics and prior evidence:

  1. Estimate the value of one additional conversion.
  2. Measure the implementation cost and the cost of delayed decisions.
  3. Review historic A/B tests to see what lifts are common for similar changes.
  4. Set your MDE at the smallest effect that still produces meaningful business impact.

If your MDE is too small, your tests run forever. If it is too large, you may miss worthwhile but modest wins. That is why many mature programs define several experiment tiers, such as navigation tests, pricing tests, and onboarding tests, each with different expected impact ranges.

Why uneven traffic allocation usually costs more

Some teams prefer sending only 10% or 20% of traffic to a variant to reduce exposure. That can be sensible for risky tests, but it reduces statistical efficiency. The most traffic-efficient split for a standard two-variant test is usually 50/50. Once you push allocation away from balance, total required traffic tends to increase because one arm becomes starved of data. The calculator above incorporates this by estimating sample needs under your chosen allocation ratio and then reporting both per-group and total requirements.

Common mistakes that damage sample size planning

  • Stopping early: peeking daily and ending a test when the result looks favorable inflates error rates.
  • Using total site traffic instead of eligible traffic: only visitors who truly enter the test should count.
  • Ignoring seasonality: weekends, holidays, campaigns, and promotions can shift baseline behavior.
  • Planning for too many metrics: multiple comparisons can raise your false positive risk.
  • Overestimating baseline rate: if your real baseline is lower than planned, you may underpower the test.
  • Confusing relative and absolute lift: a 10% relative lift from 5% is 5.5%, not 15%.

How to use the calculator step by step

  1. Enter your current baseline conversion rate as a percentage.
  2. Choose the minimum detectable lift you care about as a relative percentage.
  3. Select confidence level and power based on your organization’s tolerance for risk.
  4. Set your variant allocation percentage. Use 50% when efficiency matters most.
  5. Enter weekly eligible visitors to estimate duration.
  6. Click calculate and review required sample per group, total sample, estimated runtime, and target variant rate.

If the runtime is too long, do not immediately lower confidence or power. First ask whether the MDE is realistic, whether the audience can be broadened responsibly, or whether the experiment should run on a higher-frequency funnel event before being validated on a downstream business metric.

Interpreting results responsibly

Sample size calculators are planning tools, not guarantees. Real-world data can deviate from assumptions. Your actual conversion rate may vary during the test, and implementation quality or segmentation can alter the effect size. Use the estimate as a disciplined starting point, then monitor test health carefully. Keep randomization clean, ensure analytics tags fire consistently, and avoid introducing major confounders while the experiment is active.

It is also worth remembering that statistical significance is not the same as business significance. A tiny statistically significant improvement may not justify engineering effort, support complexity, or long-term design debt. Likewise, a result that narrowly misses significance but points to a high-value opportunity may deserve a follow-up test with sharper targeting, better instrumentation, or a stronger treatment.

Authoritative references for experimentation and sample size fundamentals

If you want to validate the statistical thinking behind A/B test planning, these sources are reliable starting points:

Final takeaway

Good experimentation is not just about creative ideas. It is about planning those ideas with enough rigor to make trustworthy decisions. An A/B test size calculator helps you do exactly that by translating business goals into traffic requirements. When you choose a realistic MDE, maintain a clear confidence and power standard, and match your runtime to real eligible traffic, you dramatically improve the quality of your experimentation program.

Use the calculator above before every major test. It will help you avoid underpowered experiments, unrealistic schedules, and premature conclusions while giving stakeholders a clear expectation for how long the test should run and what kind of uplift it can genuinely detect.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top