Calculate Correlation Between Two Variables in Stata
Enter paired values for two variables, choose Pearson or Spearman correlation, and instantly preview the coefficient, interpretation, and scatter chart before using the equivalent Stata command.
Correlation Calculator
Results
Enter your paired data and click Calculate Correlation to see the coefficient, interpretation, and chart.
How to calculate correlation between two variables in Stata
When researchers ask how to calculate correlation between two variables in Stata, they are usually trying to answer a simple but important question: do two measured variables move together, and if so, how strongly? Correlation is one of the most common tools in statistical analysis because it helps summarize the direction and strength of a relationship between two numeric variables. In Stata, the process is straightforward, but making good decisions about which command to run, how to interpret the output, and when not to trust the result requires more than typing a single line of code.
This guide explains exactly how to calculate correlation between two variables in Stata, when to use Pearson versus Spearman correlation, how to read the output, and how to avoid common mistakes. The calculator above gives you the coefficient from raw data so you can check your intuition before entering your data into Stata. Once you understand the logic, the software commands become much easier to use correctly.
What correlation measures
Correlation measures the association between two variables. The correlation coefficient usually ranges from -1 to 1:
- +1 means a perfect positive relationship. As one variable increases, the other increases proportionally.
- 0 means no linear relationship.
- -1 means a perfect negative relationship. As one variable increases, the other decreases proportionally.
In practical applied work, you rarely see perfect values. Instead, analysts often use rough guidelines:
- 0.00 to 0.19: very weak
- 0.20 to 0.39: weak
- 0.40 to 0.59: moderate
- 0.60 to 0.79: strong
- 0.80 to 1.00: very strong
These cutoffs are contextual, not universal. A correlation of 0.30 may be meaningful in social science but unimpressive in a laboratory setting. That is why interpretation should combine statistics, theory, sample size, and measurement quality.
Stata commands for correlation
Stata offers several ways to estimate association between variables. The most common commands are:
- correlate for Pearson correlation coefficients.
- pwcorr for pairwise correlations, often with significance levels and observation counts.
- spearman for rank correlation when your data are ordinal or not well suited to Pearson assumptions.
If you have two continuous variables and the relationship is approximately linear, the typical starting point is:
correlate x yIf you want p-values and pairwise handling of missing values, many analysts prefer:
pwcorr x y, sig obsIf the variables are ranked, skewed, heavily influenced by outliers, or monotonic rather than linear, you can use:
spearman x yPearson versus Spearman correlation
Choosing the right coefficient matters. Pearson correlation measures the strength of a linear relationship between two numeric variables. Spearman correlation measures the strength of a monotonic relationship using ranks instead of raw values. That makes Spearman more robust when the relationship is not linear or when outliers distort the scale.
| Feature | Pearson | Spearman |
|---|---|---|
| Best for | Continuous variables with linear association | Ordinal, skewed, or monotonic relationships |
| Uses raw values or ranks | Raw values | Ranks |
| Sensitive to outliers | High | Lower |
| Typical Stata command | correlate x y | spearman x y |
| Coefficient range | -1 to 1 | -1 to 1 |
If your variables are exam score and study hours, Pearson is often appropriate. If your variables are satisfaction ranking and service ranking, Spearman may be better. If you are unsure, a scatter plot should be part of your workflow. A quick visual check often reveals nonlinearity, curvature, clusters, or influential points that a single coefficient can hide.
Step by step: calculate correlation between two variables in Stata
1. Inspect your data first
Before running any command, review your variables:
summarize x y graph twoway scatter y xThis confirms basic ranges, missingness, and whether the relationship looks roughly linear. If the scatter plot curves strongly, Pearson may understate or misrepresent the association.
2. Run Pearson correlation
For standard correlation between two continuous variables:
correlate x yStata will report the correlation matrix. For two variables, the off-diagonal entry is the value you need. Suppose Stata returns r = 0.73. That means there is a strong positive linear relationship between x and y.
3. Add significance testing if needed
If you want significance and observation counts, use:
pwcorr x y, sig obsThis is useful in reports because you often need both the coefficient and the p-value. For example, a result of r = 0.73, p < 0.001, n = 120 indicates a strong association that is unlikely to be due to random sampling variation under the null hypothesis of zero correlation.
4. Run Spearman correlation when assumptions are weak
If the data are ordinal, non-normal, or dominated by rank order rather than equal intervals, use:
spearman x yA Spearman coefficient of rho = 0.68 means the variables tend to increase together in rank order, even if the spacing between measurements is irregular.
5. Report the result clearly
A good write-up includes the method, coefficient, sample size, and significance if available. For example:
- Pearson correlation showed a strong positive relationship between income and spending, r = 0.73, p < 0.001, n = 120.
- Spearman rank correlation indicated a moderate positive association between pain score and disability rank, rho = 0.48, p = 0.002, n = 42.
Worked example with realistic statistics
Imagine a health services researcher studying whether average clinic wait time is related to patient satisfaction scores across 10 clinics. After entering the data and running correlate wait satisfaction, the result might show a negative relationship because longer waits are associated with lower satisfaction.
| Scenario | Method | Coefficient | Sample Size | Interpretation |
|---|---|---|---|---|
| Study hours vs exam score | Pearson | 0.81 | 85 | Very strong positive linear association |
| Wait time vs patient satisfaction | Pearson | -0.62 | 40 | Strong negative linear association |
| Income rank vs spending rank | Spearman | 0.69 | 120 | Strong positive monotonic association |
| Pain severity rank vs mobility limitation rank | Spearman | 0.47 | 56 | Moderate positive monotonic association |
These values are realistic examples you might encounter in applied research. Notice that coefficients close to 0.8 are substantial, while values around 0.4 to 0.5 indicate more moderate relationships. Sample size also matters. A correlation of 0.25 may be statistically significant in a large sample, while a correlation of 0.45 may fail to reach significance in a very small one.
Common mistakes when calculating correlation in Stata
Confusing correlation with causation
This is the most common mistake. Correlation does not prove that one variable causes the other. Two variables can move together because of a third variable, reverse causation, or coincidence.
Ignoring nonlinearity
Pearson correlation captures linear association. If the relationship is curved, the coefficient may be close to zero even when the variables are strongly related. Always inspect a scatter plot.
Using Pearson on ordinal data without thinking
Likert scale responses and rankings are often better analyzed with Spearman correlation, especially when the interval spacing is questionable.
Failing to handle missing values properly
The difference between correlate and pwcorr can affect the sample used. If missing values exist, confirm whether Stata is using listwise or pairwise observations and report that choice consistently.
Letting outliers dominate the result
A single extreme point can dramatically change Pearson correlation. Plot the data and investigate unusual values before presenting the final coefficient.
How the calculator above connects to Stata
The calculator on this page computes the same core idea you would estimate in Stata. If you choose Pearson, it calculates the covariance between x and y divided by the product of their standard deviations. If you choose Spearman, it converts the raw values to ranks and then computes the Pearson correlation of those ranks.
After testing values here, you can move to Stata with confidence. If the calculator returns a Pearson correlation around 0.88, you should expect a very similar coefficient in Stata from correlate x y, assuming your imported data match and your handling of missing values is consistent.
Practical interpretation tips
- Report both the sign and magnitude of the coefficient.
- State the method used: Pearson or Spearman.
- Include sample size and p-value when relevant.
- Use a scatter plot to support your interpretation.
- Discuss substantive meaning, not just statistical significance.
For example, saying “there is a statistically significant positive correlation” is incomplete. A stronger statement is: “Study hours were strongly positively correlated with exam scores, Pearson r = 0.81, p < 0.001, suggesting students who studied more tended to score higher.”
Recommended authoritative resources
If you want deeper reference material on correlation, hypothesis testing, and Stata-style interpretation, these sources are useful:
- UCLA Statistical Methods and Data Analytics: Stata resources
- NIST Engineering Statistics Handbook
- Penn State Eberly College of Science: statistics lessons
Final takeaway
To calculate correlation between two variables in Stata, start by identifying the variable type and the shape of the relationship. Use correlate or pwcorr for Pearson correlation when the relationship is linear and the variables are continuous. Use spearman when the data are ordinal, skewed, or more meaningfully interpreted as ranks. Then read the sign, size, and significance of the coefficient in context, and always support the number with a graph and subject-matter reasoning.
Used carefully, correlation is one of the fastest ways to uncover meaningful structure in your data. Used carelessly, it can oversimplify or mislead. The best analysts do both: compute the coefficient and interrogate what it really means.