Bayesian Ab Test Sample Size Calculator

Bayesian A/B Test Sample Size Calculator

Estimate how many visitors you need per variant using a practical Bayesian planning model. Adjust baseline conversion rate, minimum detectable uplift, posterior confidence target, prior strength, and allocation split to plan faster, more reliable experiments.

Current control conversion rate, such as 5.00 for 5%.
Relative lift over control. Example: 10 means B is expected to improve by 10%.
Decision threshold for declaring B more likely to beat A.
Higher power requires larger samples but reduces false negatives.
Bayesian prior weight, centered on the baseline rate.
Percentage of traffic allocated to variant B. Control receives the rest.
Ready to calculate. Enter your experiment assumptions and click the button to estimate sample size and posterior win probability.

How a Bayesian A/B Test Sample Size Calculator Works

A Bayesian A/B test sample size calculator helps experimenters estimate how much traffic they need before a decision becomes trustworthy. Instead of asking only whether a result is statistically significant under a null hypothesis, Bayesian planning focuses on the probability that one variation is better than another after observing the data. In practical product, ecommerce, and SaaS experimentation, this framing is often easier for decision makers to understand because it aligns directly with business questions such as, “What is the probability that the new page is better?”

The calculator above uses a practical approximation for binary conversion experiments. You enter a baseline conversion rate, your minimum expected uplift, a posterior confidence threshold, a power target, a prior effective sample size, and an allocation split. The tool then estimates the number of observations needed in each variant and visualizes how posterior confidence changes as total sample size grows. This is especially useful when teams need a planning number before launching a test, even though exact Bayesian stopping behavior can vary depending on the prior, the loss function, and the final decision rule.

Key idea: Sample size planning in Bayesian testing is still a design problem. You are choosing how much evidence you need before taking action. The higher your required posterior confidence and the smaller your expected uplift, the more traffic you will need.

Why Bayesian Sample Size Planning Matters

Without a planning framework, teams often make one of two expensive mistakes. First, they stop too early and mistake random noise for a meaningful lift. Second, they run tests too long, slowing product learning and tying up valuable traffic. A good Bayesian A/B test sample size calculator creates a disciplined middle path. It clarifies what lift is worth detecting, how certain you want to be, and how much prior information you are willing to bring into the experiment.

Bayesian methods are attractive because they make it possible to update beliefs as data arrives. If your organization has historical evidence that similar variants usually move conversion by only a small amount, that knowledge can be encoded into the prior. A stronger prior can reduce the amount of new data required, but only when that prior is reasonable. If the prior is too strong and poorly calibrated, it can distort decision making. That is why many organizations start with weakly informative priors unless they have a mature experimentation program and stable baseline behavior.

Inputs used in the calculator

  • Baseline conversion rate: the expected control rate before the test begins.
  • Expected uplift: the relative lift you want to detect, such as 10% over baseline.
  • Posterior confidence target: the probability threshold needed to call a winner, such as 95%.
  • Power target: the probability of detecting the uplift when the uplift truly exists.
  • Prior effective sample size: how much historical information you want to include.
  • Allocation split: the proportion of traffic directed to the treatment variant.

Understanding the Core Math

For a conversion test, each user either converts or does not convert. That means a beta-binomial model is a natural Bayesian choice. The prior belief about the conversion rate can be written as a Beta distribution, and the observed conversions update that prior into a posterior Beta distribution. Once you have posteriors for both A and B, you can estimate the probability that B is greater than A. In many production calculators, this probability is computed by simulation or by numerical approximation.

This page uses a planning shortcut that combines a standard two-proportion sample size approximation with a Bayesian prior adjustment, then estimates posterior win probability using a normal approximation to the updated Beta posteriors. That makes the calculator fast, intuitive, and practical for on-page use. It is a strong planning estimate, though not a replacement for a full Bayesian sequential design in a regulated or high-risk environment.

What increases required sample size?

  1. Lower baseline conversion rates, because absolute differences become smaller.
  2. Smaller uplifts, because they are harder to distinguish from normal random variation.
  3. Higher posterior confidence thresholds, such as 99% instead of 95%.
  4. Higher power targets, such as 90% or 95% instead of 80%.
  5. Uneven traffic allocation, because balanced tests are usually more efficient.
  6. Weaker priors or no prior information, because the test must learn more from scratch.

Bayesian vs Frequentist Sample Size Thinking

Frequentist sample size formulas are based on Type I error, power, and a fixed analysis plan. Bayesian sample size planning focuses on posterior evidence and decision thresholds. In practice, many experimentation teams still use a frequentist-style planning estimate as the initial traffic budget because it is transparent and familiar, then monitor Bayesian posterior metrics during the test. This hybrid workflow is common because it blends operational simplicity with decision-friendly interpretation.

Decision standard Common threshold Equivalent one-sided z value Typical use case
Posterior confidence 90% 1.282 Early exploration, lower-risk changes
Posterior confidence 95% 1.645 Standard product experiments
Posterior confidence 97.5% 1.960 Stronger evidence requirement
Posterior confidence 99% 2.326 High-cost or high-visibility decisions

The values above are real standard normal critical values commonly used in sample size planning. They provide intuition for why raising the threshold from 95% to 99% can materially increase sample size. The confidence target changes the amount of statistical separation needed between the variants before the posterior win probability crosses your decision rule.

Worked Example

Suppose your current signup page converts at 5%, and your design team believes a new page can improve conversion by 10% relative. That means your treatment conversion expectation is 5.5%, an absolute lift of 0.5 percentage points. A half-point increase may sound small, but on large traffic volumes it can be commercially meaningful. The challenge is that small absolute lifts require substantial data to estimate reliably.

If you choose a 95% posterior confidence threshold, 80% power, and a modest prior effective sample size of 200, the calculator will usually produce a recommendation in the tens of thousands of observations per group. If you reduce the uplift target to 5% relative instead of 10%, the required traffic can more than double. This is one of the most important truths in experimentation: detecting small gains is expensive, and many tests fail not because the idea is bad but because the experiment was underpowered.

Baseline rate Relative uplift Absolute lift Approximate per-group sample need Practical takeaway
2.0% 10% 0.20 percentage points About 76,000+ Very low baselines need large traffic
5.0% 10% 0.50 percentage points About 31,000+ Common ecommerce planning range
10.0% 10% 1.00 percentage point About 14,000+ Higher baselines reduce required sample
20.0% 10% 2.00 percentage points About 6,400+ Larger absolute gaps are easier to detect

These example magnitudes align with standard two-proportion planning assumptions under 95% confidence and 80% power. They are included to provide realistic intuition, not to replace the exact assumptions you enter into the calculator. Small changes in power, allocation, and prior strength can move these values meaningfully.

How Priors Change Planning

A Bayesian prior acts like historical information brought into the test before new observations arrive. In a conversion experiment, a prior centered on the baseline rate says, “Before we collect fresh data, our best guess is that performance is near the historical average.” The effective sample size determines how strong that belief is. A prior strength of 0 means the model learns entirely from the current experiment. A prior strength of 200 behaves like carrying in roughly 200 prior observations worth of information.

When your prior is well calibrated, it can make decisions more stable, especially early in the test. It may also slightly reduce the new sample size needed to reach the same posterior threshold. However, a prior should not be used as a shortcut to force certainty. If seasonality, audience mix, acquisition channels, or page intent changed recently, an aggressive prior can be misleading. Good experimentation programs review prior calibration regularly and test whether historical assumptions still hold in the current context.

Best practices for prior selection

  • Use weak or modest priors unless you have strong historical stability.
  • Center the prior on recent, relevant baseline performance.
  • Reduce prior strength when audience composition changes.
  • Document the prior so teams understand how it affects decisions.
  • Validate prior assumptions against real experiment outcomes over time.

Interpreting the Chart

The chart shows how the posterior probability that B beats A evolves as sample size increases, assuming the expected conversion rates you entered. This is not a report of actual live performance. It is a planning curve based on your assumptions. The horizontal target line marks your selected posterior confidence threshold. The point where the win-probability curve crosses that threshold is a practical indicator that your planned sample is in the right range.

If the line rises slowly, that usually means your expected uplift is small relative to the baseline variance. If the line rises quickly, the test is easier to resolve. Uneven traffic allocation typically makes the curve less efficient because one arm accumulates information more slowly than the other. Balanced allocation remains the default in many experimentation programs unless there is a business reason to protect control traffic or limit risk in treatment exposure.

Common Mistakes When Using a Bayesian A/B Test Sample Size Calculator

  • Using an unrealistic uplift. Teams often input the lift they want, not the lift they can plausibly achieve.
  • Ignoring absolute differences. A 10% relative lift means very different things at 2% and 20% baseline conversion.
  • Choosing too little power. Underpowered tests waste traffic and produce inconclusive results.
  • Using a strong prior without justification. Priors should reflect evidence, not optimism.
  • Stopping when charts look good. Even in Bayesian testing, stopping rules should be defined before launch.
  • Neglecting practical significance. A statistically credible lift may still be too small to matter commercially.

When to Trust the Estimate and When to Go Deeper

This calculator is ideal for quick planning, content experiments, landing page tests, signup flow optimizations, and other binary conversion scenarios where teams need a transparent estimate. It is especially useful during roadmap prioritization, when you want to compare whether a small button-color idea is worth testing against a more substantial checkout redesign.

You should move beyond a simple on-page calculator when the stakes are high. For example, regulated environments, clinical contexts, financial products, and mission-critical user journeys may require a fully specified Bayesian design with simulation-based operating characteristics. In those settings, analysts often model sequential looks, multiple metrics, heterogeneous traffic sources, and explicit loss functions rather than relying on a single formula.

Authoritative Resources for Statistical Planning

If you want to study the underlying statistics in more depth, these sources are excellent starting points:

Final Takeaway

A Bayesian A/B test sample size calculator is not just a math tool. It is a decision-quality tool. By forcing you to specify a baseline rate, a meaningful uplift, a confidence threshold, and a power target, it helps turn vague testing ambition into an operational plan. The strongest experimentation teams treat sample size planning as part of product strategy: they estimate expected impact, decide how much uncertainty they can tolerate, and invest traffic where learning is most valuable.

Use the calculator to estimate your needed sample, review the posterior probability curve, and pressure-test your assumptions. If the required sample is too large for your traffic, that is useful information. It may mean you need a bigger change, a longer test window, a more stable segment, or a different success metric. Better planning leads to better experiments, and better experiments lead to better business decisions.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top