Histograms - Data at Reed

Histograms are a useful data visualization for examining a single numerical variable and getting a sense of how values that variable spread or cluster over specified bins.

Note: By default, Stata creates density histograms. In a density histogram, like in a probability distribution, the area under the curve equals 1.0 . For most people, this is not the most intuitive visualization of data, and either a frequency histogram or a fractional histogram makes a bit more sense.

Load your data and run the following histograms; note how the shape changes (or does not) across the different graphs. Think about which strikes you as the most clear data visualization, and why.

sysuse cancer

hist age

hist age, freq

hist age, frac

The number of bins you use in a histogram can greatly affect the resulting visual. When you build a histogram, try a variety of numbers of bins and see what is the most informative binning scheme given your data.

hist age, frac

hist age, frac bin(4)

hist age, frac bin(10)

hist age, frac bin(20)

hist age, frac bin(30)

For many statistical tests, an underlying assuming is that the underlying data are normally distributed. There are rigorous ways to test this assumption; you can also take a quick glance at the shape of your data by using the norm option within the histogram command. Try the below and see what you think.

hist age, frac norm

hist age, frac bin(4) norm

hist age, frac bin(10) norm

hist age, frac bin(20) norm

hist age, frac bin(30) norm

For more on histograms, see the Stata documentation for graph twoway histogram and histogram (different documents), the Stata graphics tutorial from UCLA or a related FAQ on how to overlay histograms.

Data @ Reed