A/B Test Power Calculator
Estimate the statistical power of your experiment before you launch or while it is running. Enter your baseline conversion rate, expected variant conversion rate, sample size per group, significance level, and test direction to see whether your A/B test is likely to detect a true lift.
Results
Enter your assumptions and click Calculate power to generate statistical power, effect size, required sample size, and a power curve chart.
How to Use an A/B Test Power Calculator Effectively
An A/B test power calculator helps you answer one of the most important questions in experimentation: if a real effect exists, how likely is your test to detect it? Power is the probability that your statistical test will correctly reject the null hypothesis when the variant truly performs differently from the control. In practical product, growth, and ecommerce work, this determines whether your experiment is informative or just expensive noise.
Many teams are comfortable talking about statistical significance, but significance alone is not enough. A test can be underpowered and still produce a non-significant result, even when a meaningful improvement is actually present. That means a weakly planned test can cause a team to discard winning ideas simply because the experiment never had a fair chance to detect the lift. This calculator addresses that problem by estimating achieved power from your assumptions and showing the sample size needed to reach a target power level.
What statistical power means in A/B testing
Power is usually written as 1 minus beta. Beta is the probability of a false negative, which happens when your test misses a real effect. If your test has 80% power, then, under the assumed effect size and significance threshold, you have an 80% chance of detecting that effect. In many business settings, 80% is considered the minimum acceptable standard, while 90% is often used for high-stakes decisions such as pricing, signup funnel redesigns, or checkout changes.
For a typical conversion experiment, the main ingredients are straightforward. You have a control conversion rate, an expected treatment conversion rate, an equal number of visitors in each group, and an alpha level such as 0.05. The calculator then estimates how separated the two conversion rates will look once random sampling error is taken into account.
Why underpowered A/B tests are expensive
Underpowered tests waste more than traffic. They also consume engineering resources, analyst attention, stakeholder trust, and calendar time. Imagine that your baseline conversion rate is 5% and you expect a lift to 5.5%. That is only a 0.5 percentage point difference, which sounds small but can be highly valuable in a large funnel. However, a small absolute difference is hard to detect statistically unless you send enough users into the experiment.
Suppose you launch with only a few thousand visitors per variant. If the true lift is 10% relative, your test may fail to reach significance simply because the standard error around each observed conversion rate is too large. Without a power check, a team might conclude the idea failed. In reality, the experiment design failed.
Inputs explained
- Baseline conversion rate: the expected conversion rate of your control experience.
- Expected variant conversion rate: the conversion rate you believe the treatment can achieve.
- Visitors per variant: the number of users assigned to each arm of the experiment.
- Alpha: the false positive threshold. Common choices are 0.10, 0.05, and 0.01.
- One-sided or two-sided: whether you are testing only for improvement or for any difference in either direction.
- Target power: the planning benchmark used to estimate the sample size you should aim for.
Interpreting the output
The calculator returns an achieved power estimate, the absolute lift, the relative lift, and a recommended sample size per group to reach your chosen target power. It also plots a power curve. That curve is one of the most useful planning tools because it shows how quickly power improves as traffic increases. In many tests the curve rises slowly at first, then more sharply, and finally flattens near 100% as additional traffic yields diminishing returns.
If your achieved power is below 80%, you should be cautious about any non-significant outcome. It may be better to keep the test running longer, increase allocation, reduce variance in the metric, or choose a larger minimum detectable effect that better aligns with meaningful business value. If your power is above 80% and the test still fails to detect the effect, your assumed lift may simply be too optimistic.
Benchmarks and Common Planning Scenarios
Below is a planning table for equal-sized groups using a two-sided alpha of 0.05 and target power of 80%. These values are approximate but realistic for conversion experiments. They highlight why low baseline rates often require very large sample sizes to detect small changes.
| Baseline rate | Variant rate | Absolute lift | Relative lift | Approx. sample size per group |
|---|---|---|---|---|
| 5.0% | 5.5% | 0.5 percentage points | 10% | 31,224 |
| 5.0% | 6.0% | 1.0 percentage point | 20% | 8,147 |
| 10.0% | 11.0% | 1.0 percentage point | 10% | 14,730 |
| 20.0% | 22.0% | 2.0 percentage points | 10% | 6,503 |
Notice how the required sample size changes with both baseline rate and absolute effect size. Even when the relative lift is the same, the variance of the underlying Bernoulli process matters. This is why experiments in low-conversion environments can take much longer than stakeholders expect.
Critical values used in common experimentation settings
Analysts often ask how one-sided and two-sided tests affect planning. The table below summarizes standard z critical values used in many calculators and planning sheets.
| Alpha | Test type | Critical z value | Typical use case |
|---|---|---|---|
| 0.10 | Two-sided | 1.645 | Exploratory product research |
| 0.05 | Two-sided | 1.960 | Standard business experimentation |
| 0.01 | Two-sided | 2.576 | High-risk decisions or many repeated tests |
| 0.05 | One-sided | 1.645 | Pre-registered directional hypotheses |
Best Practices for Running Well-Powered A/B Tests
- Estimate a realistic minimum detectable effect. Avoid planning around a lift that is larger than historical test outcomes suggest. Overstating the expected effect can make your test appear better powered than it really is.
- Choose one primary metric. Power calculations are most meaningful when tied to a clearly defined primary outcome such as purchase conversion, signup completion, or click-through rate.
- Do not peek and stop early without a proper sequential method. Repeated looks inflate false positive risk and distort the simple fixed-horizon power logic.
- Use equal allocation when practical. A 50-50 split usually maximizes power for a given total sample size.
- Account for business cycles. Even if you hit the numeric sample size quickly, running through at least one full weekly cycle often helps avoid weekday or weekend bias.
- Plan for data quality loss. Bot filtering, identity stitching issues, and analytics lag can all reduce the usable sample.
One-sided versus two-sided tests
One-sided tests can be more powerful because the rejection region is concentrated in one direction. However, they are only appropriate when a negative effect would not trigger the same decision rule and when the directional claim is made before data collection. In product experimentation, many teams default to two-sided tests because product changes can help or hurt behavior in unexpected ways. A smaller page can increase speed but reduce trust. A more visible button can raise clicks but hurt qualified conversions later in the funnel.
Relationship between significance and power
Teams sometimes believe lowering alpha from 0.05 to 0.01 always improves statistical rigor with no downside. In reality, a stricter alpha raises the evidence threshold, so the same sample size produces lower power. That tradeoff may be worthwhile for high-risk experiments, but it should be explicit. If you want both low false positives and high power, you usually need more traffic.
Worked Example
Assume your current signup page converts at 5.0% and your redesigned page is expected to convert at 5.5%. You select a two-sided alpha of 0.05 and plan for 30,000 visitors per variant. The calculator will estimate achieved power near the conventional 80% benchmark, because the assumed effect is small but the sample size is substantial. If you reduce traffic to 10,000 visitors per variant, power drops sharply. In that case a non-significant result would tell you very little.
Now suppose the same test is expected to lift conversion from 5.0% to 6.0% instead. The required sample size falls dramatically because the absolute difference doubled. This is why understanding business-relevant effect sizes matters so much. A change can be statistically detectable yet commercially unimportant, or commercially important yet statistically difficult to detect in your traffic environment.
When this calculator is most useful
- Before launching an experiment to estimate whether the test is feasible.
- During roadmap planning to compare traffic demands across ideas.
- While an experiment is running to understand whether more time is needed.
- When explaining to stakeholders why small lifts often require very large samples.
- When deciding whether to combine low-traffic segments or simplify a test design.
Limitations to keep in mind
This calculator uses a standard normal approximation for a two-sample test of proportions with equal group sizes. That is a strong and practical default for many digital experiments, but real-world experimentation can be more complex. Unequal allocation, clustered users, repeated measures, CUPED adjustments, multiple comparisons, sequential testing frameworks, and metric non-independence all affect true operating characteristics. Use this tool as a robust planning aid, not a substitute for a full experimental design review when stakes are high.
If you want deeper statistical references, these authoritative resources are worth reading: the NIST/SEMATECH e-Handbook of Statistical Methods, Penn State’s Department of Statistics materials, and Stanford’s statistics course resources. They provide rigorous background on hypothesis testing, sampling variability, and power analysis concepts that underpin sound A/B testing.
Final takeaway
A/B testing is not just about finding winners. It is about making decisions with calibrated uncertainty. A power calculator helps you decide whether your test is capable of revealing the truth you care about. If your expected effect is small, your sample must be large. If your alpha is strict, your traffic requirement increases. If your test is underpowered, a null result is often inconclusive rather than informative. The smartest experimentation teams treat power analysis as a standard planning step, not an afterthought.