A/B Test Duration Calculator
Estimate how long your experiment should run based on baseline conversion rate, minimum detectable effect, confidence level, statistical power, and available traffic. This calculator uses a two-sample proportion test approximation to estimate the required sample size per variant and the likely duration of an A/B test.
Experiment Inputs
Enter realistic assumptions before launch so your test reaches meaningful statistical sensitivity.
Sample Size and Runtime Projection
The chart compares the required visitors per variant, total required visitors, and projected days to completion using your assumptions.
How to use an A/B test duration calculator effectively
An A/B test duration calculator helps marketers, product managers, UX researchers, and growth teams answer a deceptively simple question: how long should a test run before you trust the result? In practice, this question sits at the center of experimentation quality. End a test too early and you risk acting on noise. Run it too long and you waste time, delay product decisions, and expose users to suboptimal experiences longer than necessary. A reliable duration estimate gives teams a planning baseline before a single visitor enters the experiment.
The idea behind this calculator is straightforward. Your expected baseline conversion rate, the smallest lift worth detecting, your desired confidence level, and your target power determine how many observations each variant needs. Once you know the necessary sample size, duration becomes a traffic problem. If your site sends enough qualified visitors into the test every day, you may finish in a week. If traffic is modest or your effect size is tiny, the same test might need several weeks or even months.
Most teams get duration wrong because they focus only on traffic volume and ignore the statistical sensitivity of the experiment. A page with 100,000 monthly users might still be difficult to test if the conversion event is rare, like a trial activation or completed checkout. Similarly, teams often choose unrealistic minimum detectable effects, hoping for a dramatic gain, when in reality many UX changes move results by only a few percentage points relative. That difference matters because smaller effects require dramatically larger samples.
What the calculator is measuring
This calculator estimates the required sample size for a two-variant test comparing conversion rates. It assumes an equal split between control and variant, then uses a normal approximation for a two-sample proportion test. The most important inputs are:
- Baseline conversion rate: the current expected conversion probability of the control experience.
- Minimum detectable effect: the smallest relative uplift worth identifying, such as 10% over baseline.
- Confidence level: how strict you want to be in avoiding false positives.
- Statistical power: how likely the test is to detect a true effect if that effect really exists.
- Daily eligible traffic: the number of visitors who actually qualify for the experiment each day.
- Traffic allocation: the portion of overall eligible traffic assigned to the test.
If your baseline conversion rate is 5% and your minimum detectable effect is 10%, the test is designed to detect a lift from 5.0% to 5.5%. That sounds small, but in many high-volume businesses, a half-point absolute improvement can be commercially significant. The statistical challenge is that detecting small differences reliably requires many observations, especially when confidence and power are set at standard levels such as 95% confidence and 80% power.
Why duration matters more than many teams expect
Duration is not merely a calendar estimate. It affects the operational quality of the test itself. If a test runs across different days of the week, pay cycles, campaign shifts, or seasonal demand patterns, your observed conversion rate may move for reasons unrelated to the variant. This is why mature experimentation programs usually want a test to run through complete business cycles, such as at least one or two full weeks, even if raw sample targets are reached earlier.
Another reason duration matters is user mix. Traffic at noon on a weekday may behave differently from mobile evening traffic or weekend visitors. Longer tests allow randomization to distribute these differences more evenly across variants. Teams that stop when a dashboard first shows a winner often end up peeking at incomplete data, which increases the risk of false discoveries. A duration calculator creates a pre-commitment device: before launch, you define a sensible sample target and estimated runtime, then avoid reacting emotionally to early fluctuations.
How each input changes the estimate
- Higher baseline conversion rates usually reduce required sample size for a given relative effect compared with very low-rate events, because the signal is easier to observe.
- Smaller minimum detectable effects increase duration sharply. Detecting a 5% relative lift can require several times the sample needed for a 20% relative lift.
- Higher confidence levels increase sample size. Moving from 90% to 95% or 99% means you demand stronger evidence before declaring a result significant.
- Higher power also increases sample size. This reduces the probability of missing a real improvement.
- Lower traffic allocation extends duration. If you throttle the experiment to 25% of traffic, runtime can increase roughly fourfold versus full allocation.
| Scenario | Baseline CVR | MDE | Confidence / Power | Approximate Sample per Variant | Interpretation |
|---|---|---|---|---|---|
| High intent landing page | 10.0% | 10% relative lift | 95% / 80% | About 14,700 | Moderate sample need because the event is common. |
| Typical SaaS signup flow | 5.0% | 10% relative lift | 95% / 80% | About 31,300 | Standard planning case for many product teams. |
| Checkout completion | 2.0% | 10% relative lift | 95% / 80% | About 80,800 | Low-rate events require substantially more traffic. |
| Enterprise demo request | 1.0% | 10% relative lift | 95% / 80% | About 163,000 | Very low rates are hard to test without huge traffic. |
The pattern in the table above is the key lesson for experimentation strategy. The lower the baseline conversion rate, the harder it is to detect modest improvements. Teams working on low-frequency outcomes often need to test upstream proxy metrics, improve instrumentation, or consolidate changes into larger redesigns that can plausibly generate bigger lifts. Otherwise, duration can become impractical.
Choosing a realistic minimum detectable effect
The minimum detectable effect, often abbreviated MDE, is one of the most misunderstood settings in testing. It is not a prediction of what you think the uplift will be. It is the smallest effect you care enough to detect reliably. If your business would only ship a change if it improves conversion by at least 8%, then an 8% MDE may be appropriate. If even a 2% lift is financially meaningful, then your test should be planned around 2%, with the understanding that the required runtime may become much longer.
Teams often choose an MDE that is too large because they want a shorter test. This can backfire. If the true uplift is 4% but you design the test to detect only 15%, the test is underpowered for the effect that actually matters to the business. It may come back inconclusive even though the variant is beneficial. Strong experimentation programs select an MDE based on economic significance, not wishful thinking.
| Relative MDE | Absolute Change from 5.0% Baseline | Approximate Sample per Variant at 95% / 80% | Effect on Runtime |
|---|---|---|---|
| 20% | 5.0% to 6.0% | About 8,100 | Fastest among these scenarios |
| 10% | 5.0% to 5.5% | About 31,300 | Common planning benchmark |
| 5% | 5.0% to 5.25% | About 123,000 | Much longer test required |
| 2% | 5.0% to 5.1% | About 767,000 | Often impractical without massive traffic |
This table demonstrates why tiny uplifts are so difficult to validate. Sample size grows nonlinearly as the target effect gets smaller. If your organization wants to optimize high-traffic pages by fractions of a percentage point, you should expect either very long tests or a specialized sequential testing framework. For many teams, pragmatic experimentation means prioritizing bigger opportunities first.
Confidence level and power in practical terms
Confidence level and power are often explained in mathematical language, but their operational meaning is simple. Confidence level is about avoiding false alarms. With a 95% confidence level, you are setting a relatively strict threshold before claiming the variant truly differs from control. Power is about sensitivity. At 80% power, if the true effect is at least your chosen MDE, your test has an 80% chance of detecting it.
Increasing confidence from 95% to 99% or power from 80% to 90% is not wrong, but it makes experiments more expensive in traffic terms. This is often justified in high-stakes contexts, such as pricing, healthcare communication, public policy messaging, or major checkout flow changes. In lower-risk website optimization, many teams use 95% confidence and 80% power as a reasonable balance between rigor and speed.
Traffic allocation and experiment operations
Duration estimates are only as good as your traffic assumptions. If your site gets 50,000 visitors per day but only 30% of them are in the target country, logged in, on the right device, and eligible for the experiment, then the relevant daily traffic is much lower. The same issue appears when teams deliberately throttle experiments to reduce risk. Allocating only 20% of users to the test may be operationally wise during a rollout, but it significantly extends completion time.
You should also think about test contamination and repeat visitors. If the same users return frequently, your experimentation platform should maintain consistent assignment. If instrumentation quality is poor or conversion tracking is delayed, the apparent runtime can diverge from the planned runtime. Good duration planning therefore includes not just raw traffic, but also audience eligibility, data quality, and the expected lag between exposure and conversion.
When a duration estimate should not be trusted blindly
- Your traffic is highly seasonal or campaign-driven and the next few weeks are not representative.
- The conversion event happens with a long delay after exposure, such as subscription renewal or downstream revenue.
- You are testing more than two variants, which increases complexity and often requires correction for multiple comparisons.
- Your experiment uses non-binary outcomes, ratio metrics, or clustered users rather than simple independent conversions.
- You plan to stop the test whenever significance appears, without a pre-registered stopping rule.
In these cases, the calculator is still useful as a first-pass planning tool, but expert statistical review may be needed. A simple duration model is most appropriate for classic A/B tests with binary outcomes, stable traffic, clean randomization, and a fixed analysis plan.
Best practices for better experimentation decisions
- Estimate duration before launch and communicate the expected runtime to stakeholders.
- Use a baseline conversion rate drawn from recent, representative data rather than intuition.
- Define the minimum business-relevant effect, not the most optimistic lift.
- Avoid peeking and stopping early just because a dashboard looks promising.
- Run through full business cycles when possible, even if sample thresholds are met sooner.
- Confirm instrumentation, event definitions, and audience eligibility before the test starts.
- Document exclusions, allocation, and any ramp-up periods that affect exposure counts.
For teams that want deeper statistical references, several authoritative educational and government resources explain hypothesis testing, power, and sample size planning in more detail. Useful references include the NIST Engineering Statistics Handbook, Penn State’s STAT 500 resources, and the University of California, Berkeley’s statistics department materials. These sources are valuable when you need more depth than a planning calculator can provide.
Final takeaway
An A/B test duration calculator is best viewed as a planning tool for disciplined experimentation. It forces you to define the baseline, articulate the smallest meaningful improvement, and confront the traffic required to detect that improvement with rigor. That alone improves decision quality. The strongest teams do not use duration estimates to chase quick wins. They use them to prioritize tests, align stakeholders, avoid premature conclusions, and design experiments that can actually answer the business question at hand.
If your estimate comes back longer than expected, that is not a failure. It is useful information. It may mean you should broaden the change, choose a higher-frequency metric, increase traffic allocation, simplify the variant set, or reconsider whether the idea is worth testing at all. In experimentation, the cost of a slow but valid answer is often lower than the cost of a fast but unreliable one.