Calculate The Median Of A Variable In Stata

Calculate the Median of a Variable in Stata

Use this interactive calculator to find the median from your sample values, preview the exact Stata commands you need, and visualize the sorted distribution with the median highlighted. It is designed for students, analysts, researchers, and policy teams who want a fast and accurate way to understand how median calculations work in Stata.

Results

Enter values and click Calculate Median to see the median, sorted values, and the Stata command templates.

How to calculate the median of a variable in Stata

The median is one of the most important descriptive statistics in data analysis because it identifies the middle value of an ordered distribution. When you calculate the median of a variable in Stata, you are finding the point where half of the observations fall below and half fall above. This is especially useful when a variable is skewed, includes outliers, or represents income, house prices, medical spending, wages, time-to-event measures, or any other measure where extreme observations can distort the mean.

In practical Stata work, there is more than one way to get the median. The best command depends on what you need. If you want a quick summary for one variable, summarize, detail is often the fastest option. If you need the median by group, egen or collapse may be better. If you are producing tables for reports, tabstat can be more flexible. Understanding these differences helps you choose the right tool and avoid inefficient workflows.

What the median tells you

The median is robust. That means it is far less sensitive to extreme values than the mean. Suppose you are analyzing household income. If most households earn between $45,000 and $90,000 but a few observations exceed $1 million, the mean can rise sharply, giving the impression that the typical household earns more than it actually does. The median remains focused on the midpoint of the distribution, which is why economists, public health researchers, and policy analysts rely on it so often.

  • The median is ideal for skewed data.
  • It is less affected by outliers than the average.
  • It works well for income, rent, age, healthcare costs, and waiting times.
  • It is easy to explain to non technical audiences.
  • It is often the preferred measure of central tendency in official statistics.

Quick Stata commands to get the median

Here are the most common ways to calculate the median of a variable in Stata. Assume your variable is called income.

summarize income, detail tabstat income, stat(median p25 p75 mean n) egen med_income = median(income) bysort region: egen region_med_income = median(income) collapse (median) income, by(region)

Each command serves a different use case:

  1. summarize income, detail shows the median in the percentile section as p50.
  2. tabstat is convenient when you want a cleaner summary table with multiple statistics.
  3. egen median() creates a variable containing the median for each observation or within groups.
  4. collapse reduces the dataset to grouped medians, which is useful for final reporting files.

Best starting point: summarize, detail

If your goal is simply to inspect the median for a single variable, the standard approach is:

summarize income, detail

Stata will display summary statistics including mean, standard deviation, and percentiles. The median appears as the 50th percentile, often labeled 50% or interpreted as p50. This method is fast and requires no data transformation. It is ideal during exploratory data analysis when you are trying to understand a variable before modeling it.

However, this command does not create a new variable and is not always the most efficient way to automate group based analysis. If your project involves repeated workflow steps, a command that stores or generates the median may be more useful.

Creating a median variable with egen

If you want the median available inside your dataset, use egen:

egen med_income = median(income)

This creates a new variable called med_income with the same median repeated for every observation. That may seem redundant at first, but it becomes very useful in programming, graphing, and conditional comparisons. For example, you can create an indicator for whether each observation is above or below the median:

gen above_median = income > med_income

You can also calculate medians within groups:

bysort region: egen region_med_income = median(income)

Now each observation receives the median income for its region. This is a common workflow in labor economics, epidemiology, and education research where analysts compare observations relative to their local or institutional distribution.

Using tabstat for cleaner reporting

When you want a more polished descriptive table, tabstat is excellent:

tabstat income, stat(n mean median p25 p75 sd)

This command is convenient because it shows the median directly instead of requiring you to interpret percentile output. It also lets you report several statistics at once. If you need medians by subgroup, add the by() option:

tabstat income, stat(n median p25 p75) by(region)

This is often one of the best choices for publication ready exploratory summaries, especially when you need to compare multiple groups.

Grouped medians with collapse

Suppose you need a compact dataset containing one row per state, school, hospital, or year, with the median of a variable for each group. In that case:

collapse (median) income, by(region)

This replaces the current dataset with a grouped summary dataset. That is powerful, but it is also destructive because the original data structure is lost unless you save a copy first. A common safe workflow is:

preserve collapse (median) income, by(region) list restore

This pattern allows you to inspect grouped medians without permanently changing your working data.

How Stata handles missing values

Stata generally excludes system missing values from summary statistics, including the median. That is helpful, but you still need to understand whether your data contain coded missing values such as 999, 9999, or negative placeholders like -1. Those are not treated as missing unless you recode them.

replace income = . if income == 999999 replace income = . if income < 0

Before computing the median, always check data quality. A mistaken placeholder can dramatically shift the result, especially in smaller samples.

Tip: If your median looks suspiciously high or low, inspect the sorted data with sort income and list income in 1/20 or list income in -20/l to review the tails of the distribution.

Worked example with odd and even sample sizes

The mathematics behind the median are simple but important. Stata follows the standard rule:

  • If the number of observations is odd, the median is the middle ordered value.
  • If the number of observations is even, the median is the average of the two middle ordered values.

Example with 9 ordered values:

24 26 31 35 35 42 48 50 71

The fifth value is 35, so the median is 35.

Example with 8 ordered values:

24 26 31 35 42 48 50 71

The two middle values are 35 and 42, so the median is 38.5.

Comparison table of Stata methods

Method Example command Best use case Returns or creates
Quick inspection summarize income, detail Single variable exploration Median shown as p50 in output
Reporting table tabstat income, stat(median p25 p75) Readable descriptive summaries Compact table in results window
Variable creation egen med_income = median(income) Comparisons, flags, programmatic use New variable in dataset
By group summary bysort region: egen region_med_income = median(income) Regional or subgroup medians Group specific median variable
Collapsed dataset collapse (median) income, by(region) Final grouped summary dataset Reduced dataset with one row per group

Why median matters in real world statistics

The median is not just a textbook concept. It appears constantly in official releases and public data products because it often describes a typical case better than the mean. For example, the U.S. Census Bureau regularly reports median household income because income distributions are strongly right skewed. Similarly, real estate reports often emphasize median home value or median rent because luxury observations can pull the average upward.

Statistic Recent real world figure Why median is used Source type
U.S. median household income $80,610 in 2023 Income is highly skewed, so the midpoint is more representative than the mean U.S. Census Bureau
U.S. median age About 39 years in recent population summaries The median identifies the age at the center of the population distribution U.S. Census Bureau
Median usual weekly earnings Frequently reported by the BLS in quarterly earnings releases Earnings distributions contain high end outliers that can distort averages U.S. Bureau of Labor Statistics

Those examples show why analysts often choose the median when communicating results to decision makers. It describes the middle of the data in a way that is both intuitive and resilient to extreme observations.

Common mistakes when calculating the median in Stata

  1. Using the mean instead of the median. This happens when analysts rely only on summarize without the detail option.
  2. Forgetting to clean coded missing values. Values like 99999 are not automatically excluded.
  3. Confusing percentiles with the median. The median is the 50th percentile, not the 25th or 75th.
  4. Collapsing data too early. If you run collapse without preserving the dataset, you may lose your original structure.
  5. Ignoring grouping requirements. If the research question asks for medians by state, school, or year, a single overall median is not enough.

When to combine the median with other statistics

The median is powerful, but it should rarely stand alone. A stronger descriptive profile usually includes the interquartile range, minimum and maximum, or the 25th and 75th percentiles. That gives readers a sense of spread as well as center. In Stata, this is easy with tabstat or summarize, detail. For example:

tabstat income, stat(n median p25 p75 min max)

If the median and mean are far apart, that is a strong hint that the distribution is skewed. In that case, report both and explain the difference. If the median is close to the mean, the data may be relatively symmetric.

Recommended workflow for analysts

  1. Inspect the variable and identify possible invalid codes.
  2. Recode special values to missing if needed.
  3. Use summarize, detail for a quick check.
  4. Use tabstat if you need a cleaner summary table.
  5. Use egen when the median must exist as a variable.
  6. Use collapse only when you intentionally want a reduced summary dataset.
  7. Pair the median with quartiles for stronger interpretation.

Authoritative references for statistical practice and Stata learning

Final takeaway

If you want to calculate the median of a variable in Stata, the simplest command is usually summarize variable, detail. If you need a reusable variable, use egen. If you need grouped medians in a final summary dataset, use collapse. The right command depends on whether your priority is exploration, reporting, or data transformation. In all cases, clean your data first, check missing values carefully, and consider reporting quartiles alongside the median for a more complete picture.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top