A B Test Calculate Sample Size

A/B Test Calculate Sample Size

Use this premium sample size calculator to estimate how many visitors you need in each variation before launching an A/B test. Enter your baseline conversion rate, minimum detectable effect, confidence level, and statistical power to get a practical estimate for experiments on landing pages, checkout flows, email campaigns, and product experiences.

Calculator Inputs

Your current conversion rate for the control variant.
Relative lift you want to reliably detect.
Used to estimate how many days the test may need to run.
This calculator estimates the required sample size for comparing two conversion rates in a classic fixed-horizon A/B test using a normal approximation. It is best for directional planning before an experiment goes live.

Results

Enter your assumptions and click Calculate Sample Size to see the required visitors per variant, expected conversions, and estimated test duration.

Expert Guide: How to Calculate Sample Size for an A/B Test

When marketers, product managers, growth teams, and UX researchers ask how to calculate A/B test sample size, they are really asking a deeper question: “How much evidence do we need before we can trust the outcome?” Sample size is the foundation of valid experimentation. If your test is too small, random noise can easily look like a win or hide a real improvement. If your test is too large, you may waste traffic, time, and revenue opportunities. Getting sample size right is one of the most important decisions in experimentation design.

An A/B sample size calculation estimates how many users must see the control and variant before you can detect a meaningful difference with a chosen level of confidence. In conversion testing, that difference is usually measured between two proportions, such as signup rate, click-through rate, lead form completion, or checkout completion. The sample requirement depends on four major factors: your baseline conversion rate, your minimum detectable effect, your significance threshold, and your desired statistical power.

Why sample size matters so much

Suppose your current landing page converts at 10%. You launch a new design and hope it improves conversion by 15% relative, which would move the rate to 11.5%. That sounds small, but in statistical terms it can require thousands of visitors per variation to detect reliably. The reason is simple: binary outcomes are noisy. Any single visitor either converts or does not. To separate a true lift from chance variation, you need enough observations for the signal to become visible.

  • Too little traffic creates underpowered tests that often return inconclusive results.
  • Too much traffic can delay decisions and expose too many users to weaker experiences.
  • Correct sample sizing helps teams plan test duration, reduce false conclusions, and prioritize high-impact experiments.

The key inputs in an A/B test sample size calculation

Every sound calculator asks for a small set of assumptions. Understanding these inputs is more important than memorizing any formula.

  1. Baseline conversion rate: This is the expected conversion rate for the control group. If your current product page converts at 4%, use 4% as the baseline. Historical analytics data, not intuition, should drive this number.
  2. Minimum detectable effect, or MDE: This is the smallest uplift worth detecting. If your baseline is 10% and your MDE is 15% relative, your target variant rate becomes 11.5%. Smaller effects require larger samples.
  3. Confidence level: Often set to 95%, this controls the probability of a false positive. At 95% confidence, the implied significance level is 5%.
  4. Statistical power: Commonly 80% or 90%, this controls your ability to detect a real effect if one exists. Higher power requires more traffic.
  5. Traffic split: A balanced 50/50 split usually minimizes the total sample size. Uneven allocation increases the number of total users needed.

The practical formula behind the calculator

For a two-sample comparison of conversion rates, many planning calculators use a normal approximation for two proportions. In simplified form, the required sample per group can be estimated from the baseline rate p1, the expected variant rate p2, the pooled rate, the z-score tied to your significance threshold, and the z-score tied to your desired power. The effect size is the absolute difference p2 – p1. As this difference gets smaller, sample size rises rapidly.

This is why tiny expected uplifts are expensive to test. Detecting a 1% relative improvement is far harder than detecting a 20% relative improvement. In many businesses, the right question is not only “What can we detect?” but “What change would actually matter financially?” If a tiny uplift would not justify engineering effort or design cost, there is little value in sizing a test around it.

Reference z-scores used in experimentation planning

Assumption Typical Setting Approximate z-score Meaning in practice
Confidence level 90% two-sided 1.645 Less strict than 95%, needs somewhat less traffic but raises false-positive risk.
Confidence level 95% two-sided 1.960 The most common default for business experimentation.
Confidence level 99% two-sided 2.576 Stricter threshold, useful where decision errors are costly.
Power 80% 0.842 Widely used baseline for product and marketing tests.
Power 90% 1.282 Improves sensitivity, but increases required sample size.
Power 95% 1.645 Used when missing a true effect would be expensive.

Example sample size scenarios

To make sample size planning concrete, consider several realistic examples. These estimates assume a 95% confidence level, 80% power, and a balanced 50/50 allocation. Values are approximate and are meant to illustrate the scale of traffic required.

Baseline conversion rate Relative uplift target Expected variant rate Approximate visitors per variant
3.0% 10% 3.3% About 39,000 to 41,000
5.0% 10% 5.5% About 15,500 to 16,500
10.0% 15% 11.5% About 6,700 to 7,100
20.0% 10% 22.0% About 5,900 to 6,300

The pattern is worth noting. Lower baseline rates often need substantially larger samples because the absolute lift is small. A test moving from 3.0% to 3.3% may be strategically valuable, but it is statistically demanding. Conversely, larger absolute differences are easier to detect. This is why high-traffic websites can experiment on micro-optimizations while lower-traffic businesses often need to focus on larger design or offer changes.

How confidence level and power affect the result

Confidence and power are often misunderstood. Confidence level controls the chance of declaring a win when there is no real effect. Power controls the chance of missing a real effect. Raising either one increases sample size. For many business teams, 95% confidence and 80% power offer a practical balance. However, if you are testing pricing, financial compliance messaging, or a major product change, you may want stricter settings.

  • Higher confidence means fewer false wins, but more required traffic.
  • Higher power means a better chance to catch a true lift, but also more traffic.
  • Smaller MDE dramatically increases the sample size, often more than teams expect.

Why a 50/50 split is usually best

In fixed-horizon A/B testing, a balanced split is typically the most efficient. If one variant gets far less traffic than the other, the underrepresented group becomes the bottleneck. For example, a 70/30 split may be useful for risk control when testing a sensitive experience, but it usually increases the total sample requirement compared with a 50/50 split. Unless there is a clear business reason to skew traffic, equal allocation is generally the best default.

Common mistakes that ruin sample size planning

Even with a good calculator, teams can still make bad decisions if they misuse the assumptions. Here are common pitfalls to avoid:

  1. Using stale baseline data: A baseline from six months ago may no longer reflect current traffic quality, seasonality, device mix, or campaign sources.
  2. Choosing an unrealistic MDE: If you assume a huge uplift just to reduce sample size, your test may end up underpowered for the real effect.
  3. Stopping early: Repeatedly peeking at results before the planned sample is reached inflates false-positive risk under classic fixed-sample designs.
  4. Ignoring sample ratio mismatch: If your intended split was 50/50 but actual assignment is distorted, technical issues may bias results.
  5. Testing during abnormal periods: Promotions, outages, holidays, or tracking changes can invalidate baseline assumptions.

How to choose a realistic minimum detectable effect

The MDE should not be an arbitrary guess. A good MDE is tied to business value. Ask what size improvement would materially affect revenue, lead volume, retention, or user success. For example, if a 3% relative lift in signup rate would generate enough incremental value to justify implementation, then use that as your planning target. If a 3% lift would not matter operationally, testing for it may be unnecessary.

A practical way to set MDE is to combine business impact with historical experiment performance. If most UX changes produce lifts in the 5% to 12% range, planning every test around a 1% relative gain may be unrealistic. On the other hand, mature funnels often produce small gains, so your organization may need either more traffic or more selective prioritization.

Interpreting the calculator output

When you use the calculator above, the most important number is the required sample per variant. You should also pay attention to the total sample, the expected variant conversion rate, and the estimated run time based on your daily traffic. A test might be statistically possible but operationally impractical if it needs 10 weeks to complete. In those cases, the best response is usually to increase the expected effect size by testing a bigger change, improve traffic volume, or accept a lower sensitivity threshold if the business risk supports it.

How this connects to trustworthy experimentation

Reliable sample sizing is part of a broader experimentation discipline. Good tests also require stable randomization, clean event tracking, a clearly defined primary metric, guardrail metrics, and a documented stopping rule. Agencies, CRO teams, and internal product groups that consistently win with experimentation rarely succeed because they have a magic calculator. They succeed because they pair planning discipline with strong implementation discipline.

For readers who want methodologically sound references, these public resources are excellent starting points: the National Institute of Standards and Technology provides statistical engineering resources, the U.S. Census Bureau publishes survey methodology material related to sampling and statistical precision, and Penn State University STAT Online offers accessible educational coverage of hypothesis testing and sample size concepts.

A simple workflow for planning your next test

  1. Pull recent, trustworthy baseline conversion data from analytics or your experimentation platform.
  2. Define the smallest uplift that matters commercially.
  3. Select confidence and power based on decision risk.
  4. Use a 50/50 split unless there is a risk-based reason not to.
  5. Estimate run time using realistic daily eligible traffic, not total site sessions.
  6. Lock the test plan before launch so you are not changing assumptions midstream.

In summary, calculating A/B test sample size is not just a statistical exercise. It is a planning tool for better decision-making. It aligns business expectations with traffic reality, keeps teams from overreacting to random fluctuations, and improves the quality of experiment results. If you know your baseline rate, have a credible MDE, and choose sensible confidence and power settings, you can forecast the traffic you need and run more trustworthy experiments with fewer surprises.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top