Calculate correlation between two variables and p-value in Stata
Paste two numeric variable lists, choose Pearson or Spearman correlation, and instantly compute the coefficient, test statistic, p-value, confidence interval, and a scatter chart. The guide below also shows the exact Stata commands you would use in practice.
Correlation Calculator
Results
Click Calculate correlation to see the coefficient, p-value, Stata command, and chart.
How this maps to Stata
If your dataset already contains two variables, Stata syntax is straightforward:
- corr reports Pearson correlation matrix.
- pwcorr, sig adds significance levels and pairwise handling.
- spearman computes rank correlation and significance tests.
- This calculator mirrors the main ideas by estimating the coefficient and a two-tailed p-value.
Expert guide: how to calculate correlation between two variables and p-value in Stata
When analysts search for how to calculate correlation between two variables and p-value in Stata, they are usually trying to answer a very practical question: how strongly are two variables associated, in what direction, and is that pattern likely to be more than random noise? Correlation is one of the most common exploratory tools in statistics because it condenses a relationship into a single coefficient. Stata makes this easy, but using the right command and interpreting the output correctly matters just as much as running the syntax.
At the simplest level, correlation quantifies whether larger values of one variable tend to appear with larger values of another variable, or whether larger values of one tend to come with smaller values of the other. A positive correlation suggests both variables move in the same direction. A negative correlation suggests they move in opposite directions. A coefficient near zero suggests little or no linear relationship, though it does not prove independence.
What correlation coefficient should you use?
In Stata, the most common choices are Pearson and Spearman correlation. Pearson correlation is best when both variables are continuous and the relationship is approximately linear. Spearman correlation is more robust when data are ordinal, skewed, or not well described by a straight line because it works on ranks rather than raw values.
- Pearson correlation measures linear association on the original numeric scale.
- Spearman correlation measures monotonic association using ranked data.
- Kendall’s tau is another rank-based option, though less commonly requested in Stata workflows than Pearson or Spearman.
If your variables are things like blood pressure and age, income and spending, or hours studied and exam score, Pearson is often the first pass. If your variables are rankings, Likert scales, or heavily non-normal measurements with outliers, Spearman may be more defensible.
Core Stata commands for correlation and significance
The command many users start with is corr:
This gives you the Pearson correlation coefficient, but it does not emphasize significance testing as clearly as pwcorr. If you want p-values or significance markers, use:
The sig option tells Stata to print significance probabilities. If you want the observation counts too, add obs:
For rank-based analysis, use:
Understanding the p-value in correlation analysis
The p-value answers a narrow but important hypothesis-testing question. Under the null hypothesis, the true population correlation is zero. If the sample correlation you observe is large relative to the sample size, the p-value becomes small. A small p-value means the observed relationship would be unlikely if the true correlation were actually zero.
That does not mean the relationship is necessarily important, causal, or large in practical terms. A tiny but statistically significant correlation can appear in a huge dataset. Conversely, a moderate coefficient can fail to reach significance in a very small sample. This is why you should always report the coefficient, sample size, and p-value together.
| Correlation size | Common rough interpretation | Practical note |
|---|---|---|
| 0.00 to 0.19 | Very weak | May be statistically significant in large samples but often small in practical effect. |
| 0.20 to 0.39 | Weak | Suggests some association but usually not strong predictive value by itself. |
| 0.40 to 0.59 | Moderate | Often meaningful in applied work if theory supports the relationship. |
| 0.60 to 0.79 | Strong | Indicates substantial association, though outliers should still be checked. |
| 0.80 to 1.00 | Very strong | May indicate tight relationship, duplicated constructs, or potential multicollinearity. |
Worked example in Stata
Suppose you have two variables named study_hours and exam_score. To calculate the correlation and p-value in Stata, use:
Stata will report the correlation matrix, the number of paired observations, and the significance probability. If the output shows r = 0.96 with p < 0.001, you would interpret this as a very strong positive association, with strong evidence against the null hypothesis of zero correlation.
If you suspect the relationship is monotonic but not linear, run:
This is especially useful when the scatterplot bends, when distributions are skewed, or when the variables are ordinal.
How Stata calculates significance for Pearson correlation
For Pearson correlation, the test statistic is commonly based on a t distribution with n – 2 degrees of freedom:
t = r × sqrt((n – 2) / (1 – r²))
Stata converts that test statistic into a p-value. The calculator above uses the same general logic for a two-tailed significance test, so it is useful for checking your intuition before or after you run Stata.
Comparison example with real statistical values
The table below illustrates how sample size changes the p-value even when the correlation coefficient is identical. These are real, mathematically consistent examples under the standard t-test for Pearson correlation.
| Sample size (n) | Correlation (r) | Approximate t statistic | Two-tailed p-value | Interpretation |
|---|---|---|---|---|
| 10 | 0.50 | 1.633 | 0.141 | Moderate coefficient but not statistically significant at 0.05. |
| 30 | 0.50 | 3.055 | 0.005 | Same coefficient becomes statistically significant with more data. |
| 100 | 0.30 | 3.113 | 0.002 | Smaller coefficient can still be highly significant in larger samples. |
Why a scatterplot should always accompany correlation
A single coefficient can hide important structure. Two datasets can share the same correlation but have very different patterns. One may be truly linear, while another may have a curved trend, an influential outlier, or separate clusters. In Stata, visual checking is easy with a scatterplot:
You can also add a fitted line:
The chart in the calculator above is designed to mimic that analytical habit: compute the statistic, then inspect the visual pattern.
Pearson vs Spearman in applied research
Choosing between Pearson and Spearman is often less about software and more about the data generating process. If your variables represent meaningful continuous quantities and the scatterplot is roughly linear, Pearson is generally appropriate. If one or both variables are ordinal, heavily skewed, or full of extreme values, Spearman may be preferable because it is based on ranks and therefore less sensitive to unusual observations.
- Use Pearson for approximately linear continuous relationships.
- Use Spearman for ordinal data or monotonic but non-linear patterns.
- Check for outliers before reporting either result.
- Always report sample size because significance depends heavily on n.
- Do not interpret correlation as causation.
How to report results in academic or professional writing
A clear reporting style might look like this: “There was a strong positive correlation between study hours and exam score, r(28) = 0.50, p = 0.005.” If using Spearman, a common format is: “Study hours and exam rank were positively associated, Spearman’s rho = 0.61, p = 0.001.” Include confidence intervals when possible, especially in technical reporting, because they show estimation uncertainty rather than only a binary significant or not significant conclusion.
Common mistakes when calculating correlation in Stata
- Using Pearson on clearly ordinal data when a rank-based method would be more appropriate.
- Ignoring missing values. Stata commands can handle missingness differently depending on the function and options.
- Over-interpreting a small p-value as proof of a large effect.
- Failing to inspect a scatterplot, which can reveal curvature or outliers hidden by the coefficient.
- Confusing significance with causality. Correlation alone cannot establish a causal mechanism.
Exact command choices in common scenarios
If you want a straightforward Pearson matrix for several variables:
If you need pairwise correlations with significance values:
If your variables are ranks, ratings, or skewed values:
Useful authoritative references
For readers who want deeper statistical background or software examples, these references are reliable starting points:
- NIST Engineering Statistics Handbook for foundational guidance on correlation and inference.
- UCLA Statistical Methods and Data Analytics Stata resources for practical Stata examples and interpretation help.
- NCBI Bookshelf for research methods references covering association measures, p-values, and statistical interpretation.
Final takeaways
To calculate correlation between two variables and p-value in Stata, the fastest practical route is usually pwcorr var1 var2, sig for Pearson correlation or spearman var1 var2 for rank-based analysis. But the command is only the beginning. Good analysis requires matching the method to the data type, checking the scatterplot, examining outliers, interpreting the effect size, and treating the p-value as one piece of evidence rather than the whole story.
If you want a quick preview before opening Stata, use the calculator above to estimate the coefficient, p-value, and confidence interval, then run the corresponding Stata command to confirm and document your analysis.