Calculate Prediction Interval in R
Use this interactive calculator to estimate a prediction interval for a new observation from a regression model, then learn how to reproduce the result in R with correct syntax, interpretation, and practical modeling guidance.
How to calculate a prediction interval in R
A prediction interval estimates the range where a future individual observation is likely to fall, given a fitted statistical model and a chosen confidence level. In R, this task is usually done with the predict() function on a fitted model such as lm(). While many users know how to produce a fitted value, confusion often begins when deciding whether to request a confidence interval or a prediction interval. The difference matters because each interval answers a different question.
If your goal is to estimate the expected average response at a given predictor value, you usually want a confidence interval. If your goal is to estimate where an actual single future case may land, you want a prediction interval. Since individual outcomes vary around the regression line, a prediction interval must account for both uncertainty in the estimated mean and the random scatter of future observations. That extra uncertainty makes prediction intervals noticeably wider.
The core formula behind a prediction interval
For a linear regression model, the prediction interval for a new observation at predictor value x0 is commonly expressed as:
y-hat ± t* × s × sqrt(1 + h)
- y-hat is the predicted value from the model
- t* is the critical value from the t distribution for the selected confidence level and residual degrees of freedom
- s is the residual standard error
- h is the leverage term for the new predictor location
By contrast, a confidence interval for the mean response uses:
y-hat ± t* × s × sqrt(h)
The only visible difference is the 1 + h term instead of just h, but that difference is conceptually large. The added 1 captures the random variation of a brand-new observation.
How R calculates prediction intervals
In R, the standard workflow looks like this:
model <- lm(y ~ x, data = mydata) newdata <- data.frame(x = 10) predict(model, newdata = newdata, interval = “prediction”, level = 0.95)This returns a matrix with three columns:
- fit: the predicted value
- lwr: the lower bound of the prediction interval
- upr: the upper bound of the prediction interval
If you instead write interval = “confidence”, R returns the confidence interval for the mean response rather than the interval for an individual future observation. This distinction is one of the most common sources of errors in applied analysis, especially in business forecasting, quality control, and biomedical studies where the user wants to predict a single future case.
Example in R
sales_model <- lm(sales ~ ad_spend, data = marketing) new_campaign <- data.frame(ad_spend = 5000) predict(sales_model, newdata = new_campaign, interval = “prediction”, level = 0.95)If the fitted value is 120 units, the residual standard error is 15, the leverage is 0.08, and the residual degrees of freedom are 28, the 95% prediction interval is approximately:
- fit = 120
- t* ≈ 2.048
- standard error for prediction = 15 × sqrt(1 + 0.08) ≈ 15.59
- margin of error ≈ 31.93
- prediction interval ≈ [88.07, 151.93]
That means a future individual observation is expected to fall in that range about 95% of the time, assuming the model assumptions hold and the new observation is drawn from the same process as the training data.
Prediction interval versus confidence interval
Because people often compare these two intervals, it helps to see the practical difference clearly.
| Feature | Confidence Interval | Prediction Interval |
|---|---|---|
| Main purpose | Estimate the mean response at x0 | Estimate a future individual response at x0 |
| Formula factor | s × sqrt(h) | s × sqrt(1 + h) |
| Width | Narrower | Wider |
| Typical use | Inference about expected average outcome | Forecasting a single future case |
| R syntax | interval = “confidence” | interval = “prediction” |
Numerical comparison using the same model inputs
Using y-hat = 120, residual standard error = 15, leverage = 0.08, and df = 28, you get the following 95% intervals:
| Statistic | Confidence Interval | Prediction Interval |
|---|---|---|
| Critical t value | 2.048 | 2.048 |
| Standard error term | 15 × sqrt(0.08) = 4.24 | 15 × sqrt(1.08) = 15.59 |
| Margin of error | 8.69 | 31.93 |
| Lower bound | 111.31 | 88.07 |
| Upper bound | 128.69 | 151.93 |
| Total width | 17.38 | 63.86 |
The prediction interval is almost four times wider in this example. That is not a mistake. It reflects genuine uncertainty in the next individual outcome.
Step-by-step process in R
- Fit your regression model using lm() or another supported modeling function.
- Create a newdata data frame containing the predictor values of interest.
- Call predict() with interval = “prediction”.
- Set the confidence level using the level argument.
- Inspect the returned fit, lower bound, and upper bound.
- Validate assumptions before interpreting the interval as reliable.
Assumptions you should not ignore
Prediction intervals in ordinary least squares regression depend on several assumptions. Violations do not always make the model useless, but they can make interval estimates too narrow or too optimistic.
- Linearity: the relationship between predictors and outcome is correctly represented by the model.
- Independent errors: observations are not unduly correlated.
- Constant variance: error variance is reasonably stable across predictor values.
- Approximate normality of residuals: especially important for smaller samples when constructing t-based intervals.
- Comparable future cases: the new observation should come from the same process as the data used to fit the model.
If you are extrapolating far beyond your observed x range, the formal interval from R may still compute, but the real-world reliability can be poor. A mathematically valid interval is not automatically a substantively credible forecast.
When the interval gets wider
Your prediction interval becomes wider when any of the following happen:
- The residual standard error is large
- The confidence level increases from 90% to 95% or 99%
- The sample size is smaller, which raises the t critical value
- The new x value has high leverage
- The model fit is weak and unexplained variability is high
This is one reason that prediction intervals are more honest than point forecasts alone. A single predicted number can look precise even when the plausible range is broad.
Common mistakes when trying to calculate a prediction interval in R
1. Using interval = “confidence” by accident
This returns the interval for the mean response, not the interval for a future individual case. It is usually too narrow for forecasting a new observation.
2. Forgetting to supply newdata correctly
The variable names in newdata must exactly match those used in the model formula. If they do not, R may throw an error or produce unexpected behavior.
3. Interpreting the interval as covering 95% of all future data exactly
The interval is conditional on the model being correctly specified and the assumptions being approximately valid. In practice, actual coverage can differ from the nominal level.
4. Ignoring transformations
If your model uses log transformations or polynomial terms, the interval is calculated on the modeled scale unless you back-transform carefully. Back-transforming intervals may require extra attention.
5. Extrapolating beyond the data range
Even if R returns a prediction interval, the estimate may be unstable if the new point is far from the observed predictor values.
Authoritative references for interval estimation and regression
For high-quality reference material, review these authoritative sources:
- NIST/SEMATECH e-Handbook of Statistical Methods
- Penn State STAT 501: Regression Methods
- Centers for Disease Control and Prevention
Why use a calculator if R already computes it?
A calculator like the one above is useful because it reveals the mechanics hidden inside predict(). You can see exactly how the fitted value, residual standard error, leverage, degrees of freedom, and confidence level combine to produce the final range. This is especially helpful for teaching, auditing analysis pipelines, checking model outputs, or documenting a report for nontechnical stakeholders.
It also helps you understand why intervals differ between scenarios. For example, if two observations have the same fitted value but one has much higher leverage, the interval for that observation will be wider. That is not due to random luck. It is because predictions become less stable at unusual or extreme predictor settings.
Final takeaway
To calculate a prediction interval in R, the most direct method is to fit your model and call predict(…, interval = “prediction”). The interval you get reflects both uncertainty in the estimated mean response and the natural spread of future observations. Compared with confidence intervals, prediction intervals are wider and usually more appropriate when forecasting a single future case.
If you remember just one rule, let it be this: use confidence intervals for the mean, and use prediction intervals for an individual outcome. That small syntax choice in R changes the interpretation substantially.
Educational note: this calculator follows the standard t-based regression interval formula for a single prediction point. For generalized linear models, mixed models, time-series models, or heteroskedastic settings, interval estimation may require a different method.