Data @ Reed

Analyzing & Visualizing Data > Stata > Generating and replacing variables

Generating variables (generate)

You may need to add a new variable to your dataset.

This may be as simple as specifying that all observations occurred in STATE of "Oregon"

generate STATE = "Oregon"

(Oregon is in quotes because it is a string; we use one equals sign to compute a value.)

...or identifying that all observations occurred in YEAR of 2015

generate YEAR = 2015

(No quotes because 2015 is stored as a numeric value.)

You can also create variables based on other values in your dataset. For example, perhaps I need to calculate income for a 6 month period based off of annual data.

generate 6mo_income = annual_income * 0.5

[See the generate documentation for full details and explanation.]

Replacing variables

Replacing variables is a variation on the theme of generating new ones, and you can find documentation on replace under the generate documentation.

Some basic examples of replacing variables:

Replace all values of variable STATE with the string Utah

replace STATE = "Utah"

Replace values of variable STATE with the string Utah only in the first 10 cases in your dataset

replace STATE = "Utah" in 1/10

Replace values of variable STATE with the string Utah only if the value of variable FIPS is equal to 049. (FIPS codes are often stored with leading zeros, meaning they are stored as string variables.) Note that we use two equals signs when making a comparison, and one when making a calculation.

replace STATE = "Utah" if FIPS == "049"

(more) Generating variables (egen)

Egen (extended generate) offers additional options for variable creation. If you have something more complex that you would like to do with your data that you can't seem to accomplish with generate, egen may be a good option. Some examples of this might be creating a value based on sub-groups in the data ("What is the median value of x for each group, as defined by the variable GroupID?"), generating a running total of a variable, calculating variables based on row or column values (minimum, std dev, mean, sum), and much more. See the egen documentation for full details and explanation.