How to Calculate Distribution of a Variable in Stata
Paste your numeric values, choose whether you want a frequency distribution or grouped histogram-style distribution, and generate an instant summary with frequencies, percentages, cumulative percentages, and a chart. The guide below also shows the exact Stata commands professionals use.
- Calculates sample size, mean, median, standard deviation, minimum, and maximum
- Builds either an exact value frequency table or grouped bins
- Displays percentages and cumulative percentages for quick interpretation
- Includes practical Stata syntax for tabulate, histogram, summarize, pctile, and detail output
Results will appear here
Enter your data and click Calculate Distribution to generate a summary and chart.
Expert Guide: How to Calculate Distribution of a Variable in Stata
Understanding the distribution of a variable is one of the first and most important steps in statistical analysis. In Stata, calculating a distribution can mean several related tasks: building a frequency table, measuring central tendency and spread, visualizing the pattern with a histogram, and checking whether the variable is approximately symmetric, skewed, or concentrated in a few categories. If you skip this step, you risk applying methods that do not match the data. If you do it well, you immediately learn whether a variable is continuous or discrete, whether outliers may influence your model, and whether transformations or nonparametric methods may be appropriate.
When people ask how to calculate the distribution of a variable in Stata, they usually mean one of two things. First, they may want a frequency distribution for a categorical or discrete variable, such as age group, education level, or the number of hospital visits. Second, they may want a numeric distribution summary for a continuous variable, such as income, test scores, blood pressure, or time to event. Stata supports both cases with straightforward commands, and the best analysts usually combine numeric summaries with a visual plot.
What “distribution” means in Stata
A variable’s distribution describes how values are spread across observations. In practical terms, you want answers to questions like these:
- How many observations do you have?
- Which values occur most often?
- What share of the sample falls in each category or interval?
- Where is the center of the data: mean or median?
- How dispersed are values around the center?
- Are there outliers, extreme skewness, or heavy tails?
In Stata, the command you choose depends on the variable type. For categorical or integer-valued variables, tabulate is often the fastest route. For continuous variables, summarize, summarize, detail, and histogram are usually more informative. In more advanced work, you may also use kdensity, pctile, or normality tests.
Core Stata commands for calculating a distribution
1. Frequency distribution with tabulate
If your variable is categorical or has a manageable number of unique values, use:
tabulate varname
This produces counts, percentages, and cumulative percentages. For example, if your variable is education_level, Stata shows the number of records in each education category. If your variable is discrete, such as the number of children in a household, tabulate gives an exact value distribution.
2. Basic numeric distribution with summarize
For continuous variables, start with:
summarize income
This gives the number of observations, mean, standard deviation, minimum, and maximum. It is a good first look, but it does not tell you enough about skewness or percentiles.
3. Detailed distribution with summarize, detail
A stronger option is:
summarize income, detail
This expands the output to include percentiles, variance, skewness, and kurtosis. It is especially useful when the mean and median are far apart, suggesting right or left skew.
4. Histogram for a visual distribution
To graph the variable, use:
histogram income, percent
This shows how observations are grouped across value ranges. You can also overlay a normal curve:
histogram income, normal percent
If the bars differ sharply from the overlaid curve, your variable may not be approximately normal.
5. Kernel density estimate
For a smoother visual shape than a histogram, use:
kdensity income
This is helpful when you want to compare the overall shape of two groups or avoid arbitrary histogram bin choices.
Step by step workflow in Stata
- Inspect the variable type. Use describe to confirm whether the variable is numeric and whether value labels are attached.
- Check missing values. Missing observations can distort percentages and summary statistics. Commands like misstable summarize or count if missing(varname) are useful.
- Run summarize. Start with the basic descriptive statistics.
- Run summarize, detail. Review percentiles, skewness, and kurtosis.
- Run tabulate or histogram. Use tabulate for exact frequencies or histogram for grouped frequency patterns.
- Interpret the distribution. Compare mean versus median, look for outliers, and examine whether values cluster in a narrow or broad range.
- Decide on next steps. If the variable is strongly skewed, consider transformations such as logs or use robust or nonparametric methods.
Example: Exact value distribution vs grouped distribution
Suppose you have a variable named visits that records how many clinic visits each patient had in the last year. Because this is a count variable with a limited set of integer values, tabulate visits works well. Now suppose you are analyzing annual_income. Exact-value tabulation is less useful because nearly every observation may have a distinct number. In that case, a histogram or grouped bins gives a more interpretable distribution.
| Measure | Discrete variable example: Clinic visits | Continuous variable example: Annual income |
|---|---|---|
| Best first command | tabulate visits | summarize income |
| Best visual | bar chart or exact-value frequency table | histogram income, percent |
| Useful detail command | tabulate visits, missing | summarize income, detail |
| Interpretation focus | Most common exact counts | Shape, skewness, spread, outliers |
How to interpret distribution statistics
Mean and median
If the mean and median are close, the distribution may be roughly symmetric. If the mean is much larger than the median, the variable is often right-skewed. Income, health expenditures, and waiting times commonly show this pattern.
Standard deviation
The standard deviation tells you how spread out the data are around the mean. A high standard deviation relative to the mean suggests wide variability. In Stata output, this is a quick warning that a variable may contain substantial heterogeneity.
Percentiles
Percentiles are often more informative than the mean for skewed data. The 25th percentile, median, and 75th percentile describe the middle half of the distribution. In policy and applied research, analysts often report the 10th, 50th, and 90th percentiles to show inequality or dispersion.
Skewness and kurtosis
Stata’s detailed summary reports skewness and kurtosis. Positive skewness means a longer right tail; negative skewness means a longer left tail. Kurtosis describes tail heaviness and peakedness. These moments are useful, but they should be interpreted alongside a histogram rather than in isolation.
Real-world statistics example
The table below illustrates how distribution summaries can differ across two common social science variables. These values are realistic examples designed to show interpretation.
| Statistic | Exam score | Household income ($) |
|---|---|---|
| Observations | 1,200 | 1,200 |
| Mean | 74.8 | 68,400 |
| Median | 75.5 | 52,100 |
| Standard deviation | 10.9 | 57,300 |
| 25th percentile | 67.0 | 31,800 |
| 75th percentile | 82.0 | 81,900 |
| Skewness | -0.18 | 2.41 |
Notice the contrast. The exam score variable is close to symmetric: the mean and median are similar, and skewness is near zero. The income variable is strongly right-skewed: the mean is much larger than the median, the standard deviation is large, and skewness is high. In Stata, this would immediately suggest looking at a histogram and considering whether a log transformation is appropriate before modeling income.
Practical Stata syntax you can use immediately
- describe income to inspect the variable and storage type
- count if missing(income) to see how many values are missing
- summarize income for basic summary statistics
- summarize income, detail for percentiles, skewness, and kurtosis
- tabulate visits for exact frequencies of a discrete variable
- histogram income, percent normal to visualize the distribution with a normal overlay
- kdensity income for a smooth density plot
- graph box income to quickly identify outliers and spread
Common mistakes when calculating distributions in Stata
- Using tabulate on a highly continuous variable. This creates a long, unhelpful table. Use a histogram or grouped summary instead.
- Ignoring missing values. Stata excludes them from many calculations, which can change interpretation if missingness is substantial.
- Relying only on the mean. In skewed distributions, the mean can be misleading without the median and percentiles.
- Using too few or too many histogram bins. Poor bin choices can hide or exaggerate patterns.
- Assuming normality without checking. Many applied variables, especially cost and income measures, are far from normal.
When to report a table, a graph, or both
If your audience needs exact percentages for each category, report a frequency table. If your audience needs to understand shape, clustering, tails, or outliers, report a histogram or density plot. In formal analysis, the strongest presentation is usually both: a concise table of summary statistics and a graph. This combination gives precision and visual intuition at the same time.
How this calculator relates to Stata
The calculator above mirrors the logic you would use in Stata. If you choose an exact value distribution, it behaves like a simplified tabulate command by counting distinct values and calculating percentages. If you choose grouped bins, it approximates what you inspect with a histogram, showing how values are distributed across intervals. The summary metrics, such as mean, median, standard deviation, and range, correspond to the output you would expect from summarize and summarize, detail.
Authoritative resources for deeper learning
If you want a stronger methodological foundation, these sources are reliable places to learn more about descriptive statistics, distributions, and applied data analysis:
- NIST Engineering Statistics Handbook (.gov)
- UCLA Statistical Methods and Data Analytics: Stata Resources (.edu)
- CDC Principles of Epidemiology: Measures of Central Location and Spread (.gov)
Final takeaway
To calculate the distribution of a variable in Stata, first identify whether the variable is categorical, discrete, or continuous. Then use the command that matches the data structure: tabulate for exact frequencies, summarize and summarize, detail for descriptive statistics, and histogram or kdensity for visual shape. Always compare mean and median, review percentiles, check for missing values, and look at a graph before making modeling decisions. That workflow is simple, fast, and robust, and it prevents many common statistical mistakes.