Stata Help

Changing String Variables to Categorical Variables and Vice Verse

Sometimes for whatever reason, string variables need to be categorical and categorical variables need to be strings. In Stata this is often true because Stata treats string-encoded variables as missing and will not use them in analyses. However, anticipating that this may be problematic, Stata offers various commands to change string variables into categorical variables and vice versa.

The first case most often occurs when importing data from another source. Sometimes, for whatever reason, Stata incorrectly calls a categorical variable a string variable. The easiest way to tell if this is the case is to look at the Variables window. If a variable is a string, the Type will be str followed by some number. If, for example, you had a gender variable consisting of ones and zeroes that encoded as str1 and was therefore all numbers, you could use the destring command. If you want to replace the existing variable, the command is simply destring [varname] This will replace the existing specified variable with the same data but now in a nonstring format. If you prefer to retain the existing variable, you can generate a new variable that is a nonstring version of the existing variable. To do this type generate [new variable name]=real([string]) In my example, this would look like generate sex2=real(sex) This command would create a new variable called sex2 that contained the numeric data from my original variable (sex) stored in a numeric format.

Both of these commands have a reverse: in the first case destring will revert the format to a string, and generate name=string([numeric variable]) will generate a new string variable with the same data as the numeric variable specified, but not saved in a numeric format.

The above will only work if all of the data is numeric. However, sometimes it's not. In a case where your string variables are in fact strings (e.g., "female" instead of "1") you have to tell Stata to encode [varname] the string data. Running this command will cause Stata to make a new numeric categorical variable wherein the data has labels that correspond to the old string values. If you do this, be aware that Stata is cap sensitive; female, Female and FEmale will be treated as three different types of data. Encode is a slightly more complicated command, requiring a subcommand, generate([newvariablename]) Continuing the gender example, the full command would look something like this encode gender, generate(sex) This would cause Stata to generate a new variable called "sex" that contains numeric categories based off the old variable (called" gender"). However if you browse the new variable it will look the same, because Stata displays the labels (not the raw numbers). The only visual clue that something is different is that the text will now be blue instead of black. The opposite of encode is decode The decode command has the same syntax as the encode command, but generates a string variable based on the labels of a numeric categorical variable.

The most complicated cases are those in which you import data with numeric and nonnumeric characters. Google Books offers some useful information on the subject here

Back