How Is Variable Importance Calculated in Random Forest?
Use this premium calculator to estimate both permutation importance and mean decrease in impurity for a feature in a random forest model. Enter your model performance and impurity values to see how importance is quantified and visualized.
Expert Guide: How Variable Importance Is Calculated in Random Forest
Variable importance in a random forest is a way to measure how much each predictor contributes to the model. If you have ever trained a random forest and seen a ranked list of features, what you are really looking at is a summary of how strongly each variable helped the forest make better predictions. This sounds simple, but there are actually multiple ways to calculate feature importance, and each method reflects a different idea of what “important” means.
In practice, the two most common approaches are permutation importance and mean decrease in impurity, often abbreviated as MDI. Both are widely used, both can be useful, and both can also be misleading if interpreted without context. Understanding the calculation behind them helps you decide whether a top ranked variable is truly informative, merely correlated with another predictor, or favored by the splitting mechanics of tree based models.
What random forests are doing under the hood
A random forest is an ensemble of decision trees. Each tree is trained on a bootstrap sample of the training data, and at each split the algorithm considers only a random subset of features. This combination of bagging and random feature selection reduces variance and makes the final model more robust than a single tree.
Every tree in the forest repeatedly asks questions such as “is age greater than 45?” or “is median income below 62000?” in order to partition the data into more homogeneous groups. Those splits reduce uncertainty. In classification, that uncertainty is often measured by Gini impurity or entropy. In regression, it is usually measured by variance reduction or mean squared error reduction. Variable importance quantifies how much a given feature contributed to those improvements or to final predictive performance.
Method 1: Mean decrease in impurity
Mean decrease in impurity is the fastest and most direct importance measure available from a trained random forest. Every time a feature is used to split a node, the model computes how much that split reduced impurity. For example, suppose a classification node has a Gini impurity of 0.48 before the split and weighted child impurity of 0.31 after the split. The impurity decrease attributable to that split is 0.17, usually weighted by the number of samples reaching the node.
The forest adds together all impurity reductions for each feature across all trees. A feature that repeatedly creates strong splits receives a larger cumulative value. To compare variables on a common scale, these sums are often normalized by the total impurity decrease from all features.
The core idea is:
- Find every split in every tree that uses the feature.
- Compute impurity decrease at each of those splits.
- Weight the decrease by the number or proportion of observations passing through the node.
- Sum the weighted decreases across the forest.
- Optionally divide by the total decrease across all features to convert to a percentage.
A simplified formula for a single feature is:
MDI(feature) = sum of weighted impurity decreases for that feature / sum of weighted impurity decreases for all features
This is what many software packages report as built in feature importance. It is quick because the forest already computed these split statistics during training.
| Feature | Total Weighted Impurity Decrease | Normalized MDI | Interpretation |
|---|---|---|---|
| Median Income | 125.4 | 14.9% | Strong and frequent splitter across many trees |
| Population Density | 92.1 | 11.0% | Moderate contribution to node purity |
| Education Rate | 66.7 | 7.9% | Useful but less dominant than top variables |
| Latitude | 154.2 | 18.4% | Very influential structure variable in this example |
Why MDI can be useful
- It is computed immediately from the fitted forest.
- It captures how often and how effectively a feature splits the data.
- It works naturally for both classification and regression forests.
- It is convenient for fast model inspection and rough feature ranking.
Limitations of mean decrease in impurity
MDI is not a perfect measure of true predictive value. It can be biased toward variables with many possible split points, such as continuous variables or high cardinality categorical variables. It can also distribute importance unevenly among correlated predictors. If two features carry similar information, the forest may prefer one in some trees and the other in different trees, causing unstable rankings. A feature can therefore look less important than it really is simply because another correlated variable absorbed some of its splitting power.
This is why many practitioners complement or replace MDI with permutation importance, especially when interpretability matters more than speed.
Method 2: Permutation importance
Permutation importance asks a more intuitive question: how much does model performance deteriorate if one feature is destroyed? To calculate it, you first evaluate the trained random forest on a validation set or out of bag sample and record a baseline score. Then you randomly shuffle the values of one feature, breaking the relationship between that feature and the target while leaving all other columns unchanged. Finally, you score the model again.
If the feature matters, the score will drop. If the feature carries little unique information, the score will barely change. The larger the drop, the more important the feature.
The calculation is:
- Measure baseline performance on a holdout or out of bag sample.
- Permute one feature column at random.
- Recompute model performance with the same fitted forest.
- Subtract permuted score from baseline score.
- Repeat several times and average the score drop for stability.
A simple formula is:
Permutation Importance = Baseline Score – Permuted Score
Many analysts also convert the score drop into a percentage relative to the baseline:
Relative Importance % = ((Baseline Score – Permuted Score) / Baseline Score) x 100
| Feature | Baseline Accuracy | Accuracy After Permutation | Absolute Drop | Relative Drop |
|---|---|---|---|---|
| Median Income | 0.860 | 0.790 | 0.070 | 8.14% |
| Population Density | 0.860 | 0.822 | 0.038 | 4.42% |
| Education Rate | 0.860 | 0.841 | 0.019 | 2.21% |
| Latitude | 0.860 | 0.771 | 0.089 | 10.35% |
Why permutation importance is often stronger for interpretation
Permutation importance directly evaluates predictive reliance. Instead of asking how useful a variable was during tree construction, it asks how much the fitted model depends on that variable to make accurate predictions. That makes it easier to explain to business stakeholders, researchers, and auditors. If shuffling a feature causes a large drop in score, then the model genuinely relied on that feature.
It also reduces some of the split selection bias that affects impurity based importance. However, permutation importance is not bias free either. Correlated features remain a challenge. If two variables contain nearly identical information, permuting one may not lower performance much because the other still acts as a substitute. As a result, both variables may appear less important than expected even though the pair is collectively critical.
Comparing the two methods
- MDI measures contribution to impurity reduction during tree building.
- Permutation importance measures impact on predictive performance after training.
- MDI is faster and built in.
- Permutation importance is slower but often more faithful for interpretation.
- MDI can favor continuous or high cardinality variables.
- Permutation importance is sensitive to feature correlation and evaluation sample choice.
Classification versus regression
The overall logic is the same in both settings, but the underlying score differs. In classification, the baseline and permuted scores may be accuracy, AUC, log loss, or F1 score. In regression, they may be R-squared, root mean squared error, or mean absolute error. For impurity based importance, classification trees often use Gini impurity or entropy, while regression trees use variance reduction or squared error reduction.
If you compare importances across studies, make sure the metric is comparable. A feature with a 0.04 drop in AUC is not directly equivalent to a feature with a 0.04 drop in R-squared.
How out of bag samples fit in
Random forests naturally leave some observations out of each bootstrap sample. These are called out of bag observations. Many implementations use out of bag data to estimate permutation importance without needing a separate validation set. This is computationally efficient and often works well, though a dedicated validation or test set can still be preferable when you need a final, unbiased estimate.
How to interpret a large or small importance value
A large importance value means the feature either created meaningful impurity reductions across the forest, strongly affected predictive performance when permuted, or both. A small value means the feature had limited unique influence in the context of the other predictors. But a small value does not automatically mean the feature is useless in the real world. It may simply be redundant with another variable, affected by limited sample size, or masked by interactions.
Feature importance is therefore best used as a diagnostic tool, not as definitive proof of causal impact. A random forest can tell you that a feature improved prediction, but it cannot by itself prove why that relationship exists.
Best practices for trustworthy variable importance
- Prefer permutation importance when interpretability matters.
- Use repeated permutations and average the results.
- Evaluate on out of sample data, not only training data.
- Inspect correlation among predictors before drawing conclusions.
- Compare MDI and permutation rankings to spot disagreements.
- Use partial dependence, ICE plots, or SHAP style methods for deeper interpretation.
- Report the metric used, the dataset split, and whether values were normalized.
Real world perspective on magnitude
In many practical classification tasks, a permutation importance drop of 1% to 3% in accuracy may already be meaningful, especially in highly competitive models where overall gains are hard won. A drop above 5% often indicates a major feature. For impurity based metrics, normalized MDI shares are relative. If one variable contributes 18% of total impurity decrease, that is substantial in a model with dozens of predictors, but less remarkable in a model with only three.
Common mistakes analysts make
- Assuming feature importance implies causation.
- Comparing raw impurity values across unrelated models.
- Ignoring the effects of correlated predictors.
- Using training set permutation instead of validation or out of bag estimates.
- Ranking variables without checking importance stability across runs.
Authoritative resources for further study
If you want to dig deeper into the statistical foundations and practical cautions, these sources are highly useful:
- National Institutes of Health: Bias in random forest variable importance measures
- Penn State University: Tree based methods and random forests overview
- University of California, Berkeley: Leo Breiman random forest documentation
Bottom line
So, how is variable importance calculated in random forest? Most often, it is calculated either by summing the impurity reduction created by each feature across all trees or by measuring the loss in model performance after permuting that feature. The first method is fast and built into the training process. The second is generally more interpretable because it quantifies how much the trained model depends on the feature for prediction.
In professional workflows, the strongest approach is rarely to trust only one number. Instead, compare impurity based importance and permutation importance, inspect correlation structure, and supplement rankings with visual diagnostics. That gives you a richer and more defensible understanding of what the forest is actually learning.