Ab Sample Size Calculator

A/B Sample Size Calculator

Estimate how many users you need in each variant before launching an A/B test. This calculator is designed for conversion rate experiments and helps you balance baseline rate, minimum detectable effect, confidence level, and statistical power.

Current control conversion rate. Example: 10 means 10%.
Choose whether your target uplift is relative or absolute.
Example: 10 relative means +10%; 1 absolute means +1 percentage point.
Internally this uses alpha = 1 – confidence, two-sided testing.
Higher power reduces false negatives but increases required traffic.
50 means a balanced 50/50 split. Uneven allocation needs more total traffic.
Optional label used in the result summary.

Results

Enter your assumptions and click Calculate Sample Size.

Expert Guide to Using an A/B Sample Size Calculator

An A/B sample size calculator helps you answer one of the most important questions in experimentation: how much traffic do you need before a test can reliably detect a real effect? If you launch an experiment without enough users, you may stop too early, miss a real win, or trust a result that is mostly noise. If you overestimate the needed sample, you may slow down decision-making and keep valuable product changes waiting longer than necessary. This is why sample size planning is a core part of conversion optimization, product analytics, growth marketing, and user research.

For a classic A/B test, you compare two versions of a page, message, checkout flow, pricing presentation, or product experience. Version A is your control, and version B is your variation. The calculator on this page estimates how many observations you need in each group for a two-sided test of two conversion rates. It uses your baseline conversion rate, your minimum detectable effect, your significance level, your desired power, and your traffic split to compute a practical target sample size.

What sample size means in A/B testing

Sample size is the number of users, sessions, visitors, or eligible exposures required in each arm of the experiment. For conversion rate tests, each user is usually treated as a Bernoulli outcome: either they convert or they do not. The more variable the outcome and the smaller the effect you want to detect, the larger the sample you need. This is why small uplifts can demand surprisingly high traffic.

Key principle: the smaller the expected improvement, the larger the sample size required to detect it with confidence. Detecting a 1% relative lift is far harder than detecting a 20% relative lift.

The five inputs that matter most

  1. Baseline conversion rate: This is the expected performance of your control experience. If your current signup rate is 10%, that becomes the starting point for the calculation.
  2. Minimum detectable effect, or MDE: This is the smallest improvement you care enough to act on. It may be entered as a relative lift, such as +10%, or as an absolute increase, such as +1 percentage point.
  3. Significance level: Often represented by alpha, this reflects how much false-positive risk you accept. A 95% confidence level corresponds to alpha = 0.05 for a two-sided test.
  4. Power: Power is the probability that your test will detect a real effect of at least your MDE. Many teams use 80% or 90% power.
  5. Allocation ratio: A balanced 50/50 traffic split is generally the most sample-efficient approach. Uneven splits require more total observations.

Why baseline conversion rate changes the answer

Baseline conversion rate strongly influences sample size because binomial variance depends on the rate itself. In simple terms, a conversion rate near the middle of the range carries higher uncertainty than one near the edges. However, the practical answer is not always obvious. A test with a 2% baseline and a tiny MDE can still require an enormous audience because absolute differences remain small. That is why using a calculator is better than relying on intuition.

Baseline rate Relative lift target Expected variant rate Typical implication
2.0% 10% 2.2% Very small absolute change, usually requires a large sample
5.0% 10% 5.5% Moderate traffic may still be needed because the difference is only 0.5 points
10.0% 10% 11.0% Often more feasible for growth teams with steady weekly traffic
20.0% 10% 22.0% Larger absolute movement, usually easier to detect than tiny low-rate shifts

Relative versus absolute MDE

Many teams confuse relative and absolute effect sizes. Suppose your baseline conversion rate is 10%. A 10% relative lift means your target rate is 11%, because 10% multiplied by 1.10 equals 11%. A 1 percentage point absolute lift also leads to 11%, but only in this specific case. At a 2% baseline, a 10% relative lift means 2.2%, whereas a 1 percentage point absolute lift means 3.0%, which is much larger. Always make sure the MDE you enter reflects the real business threshold for action.

Confidence level and false positives

Most A/B testing teams use a 95% confidence level, which corresponds to a 5% significance level in a two-sided setting. That does not mean there is only a 5% chance your specific result is wrong. Rather, it means that if there were actually no difference between A and B and you repeated this process many times, about 5% of those tests would show a statistically significant difference by chance alone. A stricter threshold such as 99% lowers false-positive risk, but it increases the required sample size.

Authoritative statistical references from public institutions can help if you want to go deeper into Type I and Type II errors, power, and sample planning. For example, the NIST Engineering Statistics Handbook provides high-quality explanations of hypothesis testing concepts. Penn State’s STAT program also offers clear educational material on significance and power. For broader evidence-based research methods, the National Library of Medicine is another trusted public source.

Power and false negatives

Power measures your ability to detect a true effect. If your test has 80% power, and the real effect is at least as large as your chosen MDE, then your experiment should detect that effect 80% of the time in repeated testing. Low power raises the chance of false negatives. This is one reason underpowered tests are so dangerous: they often produce inconclusive outcomes, encourage repeat testing, and waste traffic.

Setting Common choice Reason teams use it Trade-off
Confidence level 95% Balanced standard for controlling false positives Needs more sample than 90%
Power 80% Widely accepted practical baseline Higher false-negative risk than 90%
Power 90% Better detection of true wins Requires meaningfully more traffic
Traffic split 50/50 Most efficient allocation for fixed total traffic Less traffic left for the current winner during the test

Why a 50/50 split is usually best

When your goal is to detect a difference between two variants, equal allocation is statistically efficient. If you send 80% of traffic to the control and 20% to the variation, the smaller arm becomes the bottleneck. That does not mean unequal splits are never valid. Sometimes you intentionally protect revenue by limiting the exposure of a riskier variant. Just remember that an uneven allocation generally increases total sample size and may lengthen the test duration.

How the calculator works

This calculator estimates sample size for a two-proportion z-test. It takes the baseline conversion rate for version A, derives the expected conversion rate for version B based on your MDE, and uses standard normal critical values for alpha and power. It then adjusts the required observations for the selected traffic split. The result is an estimated number of observations needed in each arm and in total. The chart visualizes the expected conversion rates and the required audience by group.

How to interpret the result

  • Per-variant sample size: the target number of observations needed in each group under your allocation plan.
  • Total sample size: the sum across both groups.
  • Expected variant rate: the conversion rate implied by your MDE.
  • Estimated lift: the relative improvement that version B would need to achieve for the experiment to be powered for detection.

These outputs help you decide whether the test is practical. If the required sample is much higher than your expected weekly traffic, you may need to choose a larger MDE, improve targeting, reduce test complexity, or combine low-volume pages into a broader experiment. Sometimes the right choice is not to test at all if the potential business gain is too small relative to traffic constraints.

Common mistakes that lead to bad sample size decisions

  1. Using an unrealistic MDE: Teams often choose tiny uplifts because they sound attractive, but those uplifts may be impossible to detect within a reasonable timeframe.
  2. Ignoring seasonality or campaign swings: Traffic quality can shift over time, which affects the observed baseline and test duration.
  3. Stopping early after a promising trend: Early peeking can inflate false positives if not handled with a proper sequential testing method.
  4. Changing the primary metric mid-test: This undermines the validity of the original power calculation.
  5. Using sessions instead of users without thinking through repeat behavior: If repeat visitors are common, independence assumptions may be weaker.

Practical recommendations for experiment teams

  • Choose an MDE tied to real business value, not just statistical curiosity.
  • Use 95% confidence and 80% to 90% power as a reasonable starting point for most product and marketing tests.
  • Favor 50/50 allocation unless risk management clearly justifies a different split.
  • Estimate test duration before launch by dividing required sample by average daily eligible traffic.
  • Document your assumptions so stakeholders understand what the result means and what it does not mean.

Example planning workflow

Imagine your current signup conversion rate is 10%. Your team says any uplift smaller than 10% relative is not large enough to justify engineering effort. You choose 95% confidence and 80% power with a 50/50 split. In this setup, the expected variant conversion rate becomes 11%. The calculator estimates the users needed in each arm to reliably detect that change. If the result implies a six-week runtime but your team can only wait two weeks, you now have an informed trade-off: increase the MDE threshold, expand traffic, or postpone the experiment.

When this calculator is most useful

An A/B sample size calculator is ideal for binary outcomes such as signup conversion, purchase conversion, lead form completion, click-through to a key step, or feature adoption. It is less appropriate for continuous metrics like revenue per visitor unless the underlying model is changed. It also assumes a standard fixed-horizon design. If your experimentation platform uses Bayesian methods or sequential monitoring, the logic for planning may differ.

Final takeaway

Strong experimentation starts before the first visitor enters the test. By setting a realistic baseline, choosing a meaningful MDE, and planning around confidence and power, you dramatically improve the quality of your decisions. A good A/B sample size calculator does more than produce a number. It forces clear thinking about what counts as a win, how much uncertainty you can tolerate, and whether your traffic can support the question you want to answer.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top