Data @ Reed

Missing data in Stata

Note: When working with missing data, you need to consider why that data is missing. In survey data, missing values may mean that the surveyor did not ask the question, that the respondent did not answer the question, or that the data are truly missing. (Some datasets have these three cases coded differently; others lump them together. Check your metadata/codebook to make sure you know what you are working with!) For numeric data, keep in mind that missing data are not the same as a value of zero. (This may seem obvious, but I have had many students nonchalantly say "oh, so we can just replace those with zeros..." Nope.) Consider this in the context of gas mileage. MPG = 0 is very different from MPG = "I'm not sure."

Different statistical software code missing data differently. In Stata, if your variable is numeric and you are missing data, you will see . [period] in your dataset. If you are working with string variables, the data will appear as       [blank].

Missing data values will affect how Stata handles your data. Some common procedures are below; for others, check the Stata documentation.

  • Summarize - uses only non-missing values
  • Tabulate - missing values excluded by default; use missing option within tab to include missing values.
  • Correlations - calculated on pairs with non-missing data by default (pairwise deletion of missing data); use pwcorr for listwise deletion of missing data.
  • Regression - if an observation is missing data for a variable in the regression model, that observation is excluded from the regression (listwise deletion of missing data)

Looking for missing values

When you load data into Stata, you will likely look at descriptive statistics or some other data summary. The command summarize will list how many missing values you have. Additional resources you can use to investigate missing values are the packages mdesc, mvpatters, and misschk. These packages do not come with Stata, but can be downloaded by typing findit mdesc at the Stata command line. (More on findit and installing packages)

Dropping missing data

Use Stata's drop command, combined with a logical / conditional statement, to drop missing values. Examples:

Drop cases missing string data (for variable "important_string_variable")

drop if important_variable == "" 

Drop cases missing numeric data (for variable "important_numeric_variable")

drop if important_variable == .

Drop cases missing data (string or numeric, for variable "important_either_kind_of_variable")

drop if missing(important_either_kind_of_variable) 

 

You may also wish to recode or replace missing values; see below for more details on those operations.

Missing values: Some relevant documentation

Data Management FAQ: Replacing missing values (stata.com)
Learning Module: Working with missing data in Stata (UCLA)
Recoding missing data (UCLA)
Missing values: patterns, counts, and more (UCLA)