How to Calculate a Correlation Between Two Variables in R
Paste two equal-length numeric vectors, choose Pearson, Spearman, or Kendall, and instantly calculate the correlation coefficient, direction, strength, summary statistics, and ready-to-use R code.
Ready to calculate. Enter your two variables and click Calculate Correlation.
What correlation means in R and why it matters
When analysts ask how to calculate a correlation between two variables in R, they are usually trying to answer a very practical question: do two measured quantities tend to move together? Correlation summarizes the degree and direction of association between variables such as study time and exam score, temperature and electricity demand, or blood pressure and age. In R, this is straightforward because the language includes built-in tools for both simple correlation coefficients and formal significance tests.
The most common starting point is the cor() function. For hypothesis testing and confidence intervals, many users then move to cor.test(). The core idea is simple. You provide two vectors of equal length, choose a method such as Pearson, Spearman, or Kendall, and let R return the coefficient. That coefficient typically ranges from -1 to 1. Values near 1 indicate a strong positive relationship, values near -1 indicate a strong negative relationship, and values near 0 suggest weak or no monotonic or linear relationship depending on the method chosen.
It is important to remember that correlation does not prove causation. A high correlation between two variables may arise because one affects the other, because both are influenced by a third factor, or because the pattern is partly driven by chance or outliers. Good statistical practice requires you to examine the raw data, produce a scatter plot, and consider the measurement context before drawing conclusions.
The three main correlation methods in R
Pearson correlation
Pearson correlation is the default method in many analyses. It measures the strength of a linear relationship between two numeric variables. If the association is approximately linear and the data do not contain severe outliers, Pearson is often appropriate. In R, the typical code is cor(x, y, method = “pearson”).
Spearman correlation
Spearman correlation is rank-based. Instead of using the raw values directly, it converts them to ranks and measures how consistently the rankings move together. This makes it useful when the relationship is monotonic but not necessarily linear, or when the data contain outliers or are ordinal in nature. In R, use cor(x, y, method = “spearman”).
Kendall correlation
Kendall’s tau is another rank-based measure. It is especially useful with smaller samples or when you want an interpretable concordance-based measure. It often produces smaller absolute values than Spearman for the same data, but many statisticians appreciate its robustness and theoretical properties. In R, the command is cor(x, y, method = “kendall”).
| Method | What it measures | Best use case | Typical R syntax |
|---|---|---|---|
| Pearson | Linear association between continuous variables | Approximately linear relationship with limited outlier influence | cor(x, y, method = “pearson”) |
| Spearman | Monotonic association using ranks | Ordinal data, non-normal data, outliers, non-linear monotonic trends | cor(x, y, method = “spearman”) |
| Kendall | Rank concordance and discordance | Small samples, tied ranks, robust nonparametric comparison | cor(x, y, method = “kendall”) |
Step-by-step: how to calculate a correlation between two variables in R
- Create or import your data. Your two variables must have the same number of observations. In R, they may be vectors or columns inside a data frame.
- Inspect the data. Check for missing values, impossible values, and outliers. Plotting the data first is one of the smartest habits in statistics.
- Choose the method. Use Pearson for linear relationships, Spearman for ranked monotonic patterns, and Kendall when you want a rank concordance measure.
- Run cor(). This returns the correlation coefficient.
- Optionally run cor.test(). This provides a p-value and often a confidence interval.
- Interpret the sign and magnitude. Positive means both variables tend to increase together, negative means one tends to decrease as the other increases, and values closer to the extremes indicate stronger associations.
Basic R example with vectors
Suppose you collected weekly study hours and exam scores for several students. In R, you could write:
x <- c(2, 4, 5, 6, 8, 9, 10) y <- c(55, 62, 68, 71, 82, 88, 91) cor(x, y, method = “pearson”) cor.test(x, y, method = “pearson”)
This returns the Pearson correlation coefficient and a formal statistical test. If you believe the pattern is monotonic but not strictly linear, swap the method to Spearman.
Basic R example with a data frame
Many real datasets are stored as tables. Assume a data frame named df with columns hours and score. You can write:
cor(df$hours, df$score, method = “pearson”) cor.test(df$hours, df$score, method = “pearson”)
How to handle missing values correctly
One of the most common reasons beginners get confused in R is missing data. If either variable contains NA values, a simple correlation call may return NA unless you specify what to do. The safest approach is usually to restrict the calculation to complete pairs. In R, that often looks like this:
cor(x, y, use = “complete.obs”, method = “pearson”)
You can also clean the data first using indexing or a function such as na.omit(). The key principle is that both variables must refer to the same matched observations. If record 5 is missing in one variable, you should not silently pair it with record 6 from the other.
Interpreting correlation size in practice
Analysts often want a quick verbal description of the coefficient. Although there is no universal rule that fits every field, the rough guidelines below are commonly used in introductory work. Context matters. In medicine, education, psychology, economics, and engineering, what counts as large can differ substantially.
| Absolute correlation | Common interpretation | Example scenario | Practical takeaway |
|---|---|---|---|
| 0.00 to 0.19 | Very weak | Day-to-day caffeine intake and typing speed in a mixed office sample | Usually limited predictive value on its own |
| 0.20 to 0.39 | Weak | Neighborhood walkability and weekly exercise minutes | May matter when combined with other variables |
| 0.40 to 0.59 | Moderate | Study hours and exam performance | Often substantively meaningful |
| 0.60 to 0.79 | Strong | Height measured by two calibrated methods | Relationship is clear and usually visible in a plot |
| 0.80 to 1.00 | Very strong | Daily temperature in Celsius and Fahrenheit | Variables move together very closely |
Real statistics examples
Using real-world style numbers helps make the interpretation concrete. Consider height and weight among adults. In many population samples, height and weight show a positive correlation, but not a perfect one, because body composition, sex, age, and lifestyle create variability. In educational research, study time and test scores often show a moderate positive correlation, though motivation, prior knowledge, and test design affect the strength. In public health, age and systolic blood pressure may have a positive relationship in broad adult samples, yet medication use and health status complicate the pattern.
| Variable pair | Illustrative sample size | Illustrative correlation | Interpretation |
|---|---|---|---|
| Study hours vs exam score | 120 students | r = 0.58 | Moderate positive linear association |
| Age vs systolic blood pressure | 250 adults | r = 0.41 | Moderate positive relationship with substantial variability |
| Outdoor temperature vs home heating demand | 365 days | r = -0.74 | Strong negative relationship |
| Ranked customer satisfaction vs repeat purchase rank | 90 customers | Spearman rho = 0.63 | Strong monotonic association |
Pearson versus Spearman versus Kendall
A major decision in R is selecting the right method. Pearson assumes you care about linearity in the raw values. If a scatter plot forms a roughly straight-line cloud, Pearson is usually a natural fit. Spearman and Kendall are more flexible when the pattern is monotonic but curved, or when outliers make Pearson unstable. For example, if income rises rapidly with experience early in a career and then levels off, a rank-based correlation may better describe the pattern than Pearson.
Another distinction is interpretability. Pearson is easy to connect to linear regression and variance explained. Spearman is intuitive when you care about rank order. Kendall is often preferred in some statistical traditions because it has a clean concordance interpretation. There is no single best method in every scenario. The correct method depends on your scale of measurement, research question, sample size, and data quality.
How to test significance with cor.test()
The cor() function gives the coefficient, but many analysts also want a p-value. That is where cor.test() becomes useful. Example:
cor.test(x, y, method = “pearson”)
This function typically returns:
- The estimated correlation coefficient
- A test statistic
- A p-value
- A confidence interval for Pearson correlation
If the p-value is very small, the observed correlation is less likely to be due to random sampling variation under the null hypothesis of no association. Still, statistical significance is not the same as practical importance. Large datasets can make tiny correlations statistically significant, while small datasets may fail to detect meaningful effects.
Recommended workflow in R
- Plot your data with plot(x, y).
- Look for curvature, clusters, and outliers.
- Choose Pearson, Spearman, or Kendall based on the pattern and measurement scale.
- Use cor() to compute the coefficient.
- Use cor.test() for inferential output.
- Report the coefficient, sample size, method, and interpretation.
Common mistakes to avoid
- Using unequal vector lengths. R requires matched observations.
- Ignoring missing values. Handle NA values explicitly.
- Choosing Pearson automatically. A rank-based method may be better for non-linear monotonic patterns.
- Failing to inspect a plot. The same correlation can come from very different shapes of data.
- Assuming causation. Correlation identifies association, not causal proof.
- Letting outliers dominate the result. A single extreme point can change Pearson substantially.
How to report correlation in a professional way
A clear report typically names the method, the coefficient, sample size, and the substantive interpretation. For example: “A Pearson correlation showed a moderate positive association between weekly study hours and exam score, r(118) = 0.58, p < .001.” If you use Spearman, report it as rho, and if you use Kendall, report tau. Mention any data cleaning decisions, such as using complete cases only.
Authoritative references and further reading
For readers who want to deepen their understanding of correlation, statistical inference, and data interpretation, the following sources are reliable and worth bookmarking:
- National Institute of Mental Health statistics resources
- Penn State STAT 200 online statistics materials
- CDC National Center for Health Statistics
Final takeaway
If you want to know how to calculate a correlation between two variables in R, the practical answer is this: organize your paired data, inspect it visually, choose the appropriate method, and compute the coefficient with cor(). If you need significance testing, confidence intervals, or a formal hypothesis test, use cor.test(). Pearson is the default for linear numeric relationships, while Spearman and Kendall are strong alternatives for ranked, non-normal, or monotonic data. When used carefully, correlation is one of the fastest and most informative tools for exploring relationships in data analysis.