Data @ Reed

A first glance at your data

Once you have loaded your data in Stata, you need to get a sense of your dataset before you move forward with analysis or visualization. (I invite you to re-read that sentence for emphasis.)

Before you do anything else, look at your data.

browse

In the data browser, numeric variables will be black. String variables (text/non-numeric) variables will be red. Any data that is blue has been labeled, meaning that Stata "sees" the underlying (usually numeric) data but you as user see the more human-friendly labeled data. (Example: variable for school_year may have values 1,2,3,4 but labeled as "first-year", "sophomore", "junior", "senior".)

Look at your data and make sure that the values make sense to you. Are there extra variables or cases because of a data import issue? Are your data formatted as you would expect? Once you have given your data a once-over, you may want to look at some basic summary statistics. (Note: to save keystrokes, you can type "br" instead of the full word "browse".)

All of the below examples use the built-in "cancer" dataset. Load this dataset with the command sysuse cancer.

sysuse cancer
browse

summarize

returns number of observations, mean, standard deviation, minimum and maximum values for either a variable or the whole dataset. Specify "detail" option for percentiles, variance, skewness, and kurtosis.

sysuse cancer

summarize

summarize age

summarize age, detail

describe

returns variable name, type of variable (storage type), display format, and information on labels (value, variable). Can be used on one variable or the entire dataset.

describe

describe age

inspect

returns an extremely rough histogram of data (recommendation: use a separate command, hist, for a more clear graph) as well as counts of the number of observations and how many observations are integer, non-inteteger, and missing values. Can be used on one variable or the entire dataset.

describe

describe age

tabulate [variable]

returns the variable name, frequency of each value, and percentage of the dataset represented by that value

sysuse cancer

tab drug

tabulate [variable1] [variable2]

a cross-tabulation of your data can be useful to see how variables are related/spread across categories. Example

sysuse cancer

tab died drug

codebook

if a codebook is associated with your dataset (possible if you are working with a well-curated Stata dataset), this command will tell you the data type, range, what units are being used, how many unique values your dataset contains, the number of missing values, as well as the mean, standard deviation, and percentiles for your dataset. Can be used on one variable or the entire dataset.

codebook

codebook drug