Small Questions, Big Data

In the age of the internet, it can be hard to know where to begin when trying to answer broad questions about the world: part of this dilemma is that there seems to be too much data! For example, the World Health Organization (WHO), the United Nations (UN), and the U.S. Census together have data about 200 countries, on topics from reproductive health, to viral infection rates, to education and much more. How can we ask a small, specific, geographically narrow question of all this data? This blog post will walk through this process, using health data from the WHO and the US Census, and education data from the UN. By the end, we will have a useful roadmap for asking specific questions of multiple large and difficult to navigate datasets.

Now that we have a good handle on the data we have at our hands, let’s dive in. The United Nations has some very informative data open to the public, including data on education and population of different countries. This data is great, yet it is far from tidy or easy to work with. We can see this by taking a look at a few columns from education:
T07 Enrolment in primary, secondary and tertiary education levels X3 X4 X5 X6 X7
Region/Country/Area NA Year Series Value Footnotes Source
204 Benin 2016 Students enrolled in secondary education (thousands) 992.9990 NA United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019.
454 Malawi 2017 Students enrolled in secondary education (thousands) 998.9400 NA United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019.
642 Romania 2010 Students enrolled in tertiary education (thousands) 999.5230 NA United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019.
768 Togo 2005 Students enrolled in primary education (thousands) 996.7070 NA United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019.

As we can see, this data is far from tidy. To help make this data easy to work with, we first use dplyr and tidyr:

education_tidy <- education %>%
  rename("region" = "T07",
         "country" = "Enrolment in primary, secondary and tertiary education levels",
         "year" = "X3") %>%
  select(-X6, -X7) %>%
  pivot_wider(names_from = "X4",
              values_from = "X5") %>%
  select(-Series) %>%
  filter(year != "Year")


pop_density_tidy <- pop_density %>%
   rename("region" = "T02",
         "country" = "Population, density and surface area",
         "year" = "X3") %>%
  select(-X6, -X7) %>%
  pivot_wider(names_from = "X4",
              values_from = "X5") %>%
  select(-Series) %>%
  filter(year != "Year")

Here we renamed and deleted some columns, and used pivot_wider to correctly format the data. An insightful way to use this data is to join it with some data on HIV from WHO, the World Health Organization. We can use the WHO data package to retrieve this data:

hiv <- get_data("HIV_0000000001")

Next, we will join the datasets and use functions from stringr to parse our strings correctly. First, we will only consider data from 2010, and then we will join all the data together. After this, we can look at a table of the data, looking at a few rows of some interesting columns:

# Step 1: Make some initial joins and filter the data to only include data from 2010
edu_2010 <- education_tidy %>%
  filter(year == 2010)

pop_2010 <- pop_density_tidy %>%
  filter(year == 2010)

edu_pop <- left_join(edu_2010, pop_2010) 

hiv_2010 <- hiv %>%
  filter(year == 2010) %>%
  select(-region, -year, -publishstate, -gho)

# Step 2: Join all the data together, and then extract the relevant information from the strings regarding the number of people with HIV
data_2010 <- left_join(edu_pop, hiv_2010, by = "country") %>%
  rename("people_with_hiv" = "value") %>%
  filter(people_with_hiv != "No data" | NA) %>%
  mutate(hiv_count = str_replace(people_with_hiv, " \\[.*\\]", "")) %>%
  mutate(hiv_count = str_replace_all(hiv_count, fixed(" "), "")) %>%
  mutate(hiv_count = str_replace_all(hiv_count, fixed("<"), "")) %>%
  mutate(hiv_count = as.numeric(hiv_count))

# Step 3: Make a table
data_2010 %>%
  select("country", "year", "Students enrolled in secondary education (thousands)",
         "people_with_hiv", "hiv_count", "Population density") %>%
  rename("enrollment" = "Students enrolled in secondary education (thousands)") %>%
  filter(country %in% c("United States of America", "Australia", "New Zealand", "Congo",
                        "Germany", "Brazil", "Mexico")) %>%
  kable(align = "c") %>%
  kable_styling()
country year enrollment people_with_hiv hiv_count Population density
Australia 2010 NA 21 000 [17 000–23 000] 21000 2.8839
Brazil 2010 23,538.7170 670 000 [520 000–830 000] 670000 23.4159
Congo 2010 NA 82 000 [69 000–95 000] 82000 12.5146
Germany 2010 7,663.7550 69 000 [57 000–81 000] 69000 231.8883
Mexico 2010 11,681.5300 180 000 [150 000–210 000] 180000 58.6913
New Zealand 2010 512.1950 2500 [2100–2800] 2500 16.5966
United States of America 2010 24,192.7860 990 000 [880 000–1 100 000] 990000 33.7813

This table can teach us a few things! First off, we see that the data was joined nicely and we have relevant columns from each initial data set. Second, we can see what really happened in Step 2. We were able to get rid of everything in the square brackets and any spaces in the numbers. Finally, we see that there is more work to be done as the enrollment column has some pesky commas in it. To deal with this, we can write a slick function that takes commas out of each column in our dataframe that has them where they shouldn’t be:

comma_killer <- function(column){
  column <- enquo(column)
  data_2010 <<- data_2010 %>%
    mutate(as.numeric(str_replace_all(data_2010[[!!column]], fixed(","), "")))
}

comma_killer(4)
comma_killer(7)
comma_killer(10)

To fix the class of most of the variables in this data.frame, we can run a for loop. Then we can do some basic calculations to give us relevant columns:

# Make numeric
for (i in 2:18)
{
  data_2010[[i]] <- as.numeric(data_2010[[i]])
}

# Relevant columns
data_2010 <- data_2010 %>%
  mutate(total_enrolled = 1000*(ifelse(is.na(primary_enrolled_thousands), 0, primary_enrolled_thousands) +
          ifelse(is.na(secondary_enrolled_thousands), 0, secondary_enrolled_thousands) +
          ifelse(is.na(tertiary_enrolled_thousands), 0, tertiary_enrolled_thousands)),
          total_pop = 1000000*population_millions,
          enrolled_ratio = total_enrolled / total_pop,
          hiv_ratio = hiv_count / total_pop)

Now, we’ve suppressed some code where we renamed many columns and selected only the relevant ones to us after our data join, but after those steps and what we have done above, we have a nice, tidy, dataframe that we can easily extract information from! Here is a glimpse of the dataset with some relevant columns included:

country sex_ratio population_density total_enrolled total_pop enrolled_ratio hiv_ratio hiv_count
Tunisia 99.4528 68.4555 2564000 10635200 0.2410862 0.0001316 1400
Uganda 96.3755 162.2950 8495294 32428200 0.2619724 0.0370048 1200000
Ukraine 85.7294 79.0446 7308258 45792100 0.1595965 0.0050227 230000
United States of America 97.5770 33.7813 69013497 309011500 0.2233363 0.0032038 990000
Uruguay 93.0187 19.1937 795680 3359300 0.2368589 0.0028577 9600
Uzbekistan 98.9431 67.0332 6709108 28515900 0.2352760 0.0010520 30000
Viet Nam 98.9260 283.7026 8943037 87967700 0.1016627 0.0025009 220000
Yemen 101.6399 43.8564 5260458 23154900 0.2271855 0.0002203 5100
Zambia 97.5468 18.3026 2899131 13606000 0.2130774 0.0734970 1000000
Zimbabwe 90.9805 32.8234 94611 12697700 0.0074510 0.0945053 1200000

Now that we’ve examined the UN dataset and cleaned it up significantly, let’s return to the WHO and US Census data we originally found. We can see that the WHO dataset has a column called “category” with only 9 unique values, which is much easier to examine than the 3,308 individual variables being recorded!

label display url category
MDG_0000000001 Infant mortality rate (probability of dying between birth and age 1 per 1000 live births) https://www.who.int/data/gho/indicator-metadata-registry/imr-details/1 Mortality and global health estimates
MDG_0000000003 Adolescent birth rate (per 1000 women aged 15-19 years) https://www.who.int/data/gho/indicator-metadata-registry/imr-details/4669 Sustainable development goals
MDG_0000000005 Contraceptive prevalence (%) https://www.who.int/data/gho/indicator-metadata-registry/imr-details/5 Millennium Development Goals (MDGs)

The category “Millennium Development Goals (MDGs)” seems interesting; WHO lists these goals on their website. Looking at this webpage, these all seem like valuable goals. Let’s try and figure out how much progress the world is making towards them!

In the first few dozen rows of the dataset, there are some goals related to maternal health: “Contraceptive prevalence (%)” and “Unmet need for family planning (%).” Later, we see data related to rates of caesarean section. Let’s grab these data frames, and combine them based on the variables they have in common: year, country, region (e.g. continent or subregion of a continent), and “worldbankincomegroup” for the contraceptive and family planning datasets, which is a measure of income level by country as determined by the World Bank.

# Grabbing the data
contraceptive <- get_data("MDG_0000000005")
fam_plan_need <- get_data("MDG_0000000006")
csection      <- get_data("csection5")

# Cleaning and tidying up the c-section data
c_sect <- csection %>%
  filter(value != "No data") %>%
  separate(value, into = c("value", "sd"), sep = " ") %>%
  mutate(value = as.numeric(value)) %>%
  mutate(lower_bound = str_extract(sd, "\\d*\\.\\d*"),
         upper_bound = substr(str_extract(sd, "\\-\\d*\\.\\d*"), 2, nchar(.))) %>%
  dplyr::select(1:8, 13:14, 10:12)

# Combining the data
fam_plan <- contraceptive %>%
  inner_join(fam_plan_need, by = c("year" = "year", 
                                   "country" = "country",
                                   "region" = "region",
                                   "worldbankincomegroup" = "worldbankincomegroup")) %>%
  dplyr::select(-gho.x, -gho.y) %>%
  rename("contra_rate" = "value.x",
         "unmet_fam_plan_need" = "value.y",
         "contra_pub" = "publishstate.x",
         "famplan_pub" = "publishstate.y") %>%
  inner_join(c_sect, by = c("year" = "year",
                            "country" = "country",
                            "region" = "region")) %>%
  dplyr::select(-gho, -lower_bound, -upper_bound) %>%
  rename("csec_source" = "datasource",
         "csec_perc"   = "value",
         "csec_pub"    = "publishstate")

fam_plan[1:3, 1:9] %>% 
  top_n(3) %>%
  kable(align = "c") %>%
  kable_styling()
year region contra_pub contra_rate worldbankincomegroup country famplan_pub unmet_fam_plan_need csec_pub
2007 Americas Published 72.9 NA Dominican Republic Published 11.1 Published
2007 Americas Published 72.9 NA Dominican Republic Published 11.1 Published
2007 Americas Published 72.9 NA Dominican Republic Published 11.1 Published

Note: additional columns were cut off for readability.

What patterns can we find here? Let’s see how the proportion of unmet need for family planning resources is correlated with the proportion of births by caesarean section:

fam_plan %>%
  group_by(unmet_fam_plan_need) %>%
  summarise(meanCSec = mean(csec_perc)) %>%
  ggplot(aes(x = unmet_fam_plan_need, y = meanCSec)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Correlation of Unmet Need for Family Planning and Rate of C-Section Births",
       x = "Unmet Need for Family Planning (%)",
       y = "Proportion of Births by Caesarean Sections")

There seems to be a pattern here: when there are fewer family planning resources available, there are also fewer c-sections, and we can probably assume that women in these countries don’t simply need fewer c-sections.

Now let’s compare something more directly related: family planning resources and contraceptive use.

fam_plan %>%
  ggplot(aes(x = unmet_fam_plan_need, y = contra_rate)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Correlation of Unmet Need for Family Planning and Contraceptive Use",
       x = "Unmet Need for Family Planning (%)",
       y = "Use of Contraceptives (%)")

As we can see, there also seems to be a pattern in these variables. With fewer resources for family planning, people use fewer contraceptives.

Now let’s move on to the US census data. Similar to the WHO data, it’s broken down into “concepts.” Out of curiosity, let’s search for “birth” in the concepts, to maybe find something related to the concepts we analyzed in the WHO data.

# Enter your API key before pulling any data!
census_api_key("put your key here!")
var_labels <- load_variables(2018, "acs5", cache = TRUE)

unique(var_labels[which(grepl("BIRTH", var_labels$concept)), ])[1:3, ] %>%
  kable(align = "c") %>%
  kable_styling()
name label concept
B05002_001 Estimate!!Total PLACE OF BIRTH BY NATIVITY AND CITIZENSHIP STATUS
B05002_002 Estimate!!Total!!Native PLACE OF BIRTH BY NATIVITY AND CITIZENSHIP STATUS
B05002_003 Estimate!!Total!!Native!!Born in state of residence PLACE OF BIRTH BY NATIVITY AND CITIZENSHIP STATUS

Only a few rows are shown here, but scrolling down, you could find “WOMEN 15 TO 50 YEARS WHO HAD A BIRTH IN THE PAST 12 MONTHS….” Let’s grab this data set, and add variables to indicate age, marital status, and whether or not the person in question gave birth in the past 12 months, since the current labels don’t provide this information directly. We should also clean up the location to keep county and state separate, and add variables that summarize the data by location.

birth <- get_acs("county",
                 survey = "acs5",
                 table = "B13002",
                 cache_table = TRUE)

# This next part looks long, but it's just categorizing the different rows based on the information from the census!
birth_age_marital <- birth %>%
  mutate(Marriage = case_when(grepl("_004", birth$variable) |
                              grepl("_005", birth$variable) |
                              grepl("_006", birth$variable) |
                              grepl("_013", birth$variable) |
                              grepl("_014", birth$variable) |
                              grepl("_015", birth$variable) ~ "Married",
                              grepl("_008", birth$variable) |
                              grepl("_009", birth$variable) |
                              grepl("_010", birth$variable) |
                              grepl("_017", birth$variable) |
                              grepl("_018", birth$variable) |
                              grepl("_019", birth$variable) ~ "Unmarried"),
              Age = case_when(grepl("_004", birth$variable) |
                              grepl("_008", birth$variable) |
                              grepl("_013", birth$variable) |
                              grepl("_017", birth$variable) ~ "15-19",
                              grepl("_005", birth$variable) |
                              grepl("_009", birth$variable) |
                              grepl("_014", birth$variable) |
                              grepl("_018", birth$variable) ~ "20-34",
                              grepl("_006", birth$variable) |
                              grepl("_010", birth$variable) |
                              grepl("_015", birth$variable) |
                              grepl("_019", birth$variable) ~ "35-50"),
            Birth = case_when(grepl("_004", birth$variable) |
                              grepl("_005", birth$variable) |
                              grepl("_006", birth$variable) |
                              grepl("_008", birth$variable) |
                              grepl("_009", birth$variable) |
                              grepl("_010", birth$variable) ~ "Y",
                              grepl("_013", birth$variable) |
                              grepl("_014", birth$variable) |
                              grepl("_015", birth$variable) |
                              grepl("_017", birth$variable) |
                              grepl("_018", birth$variable) |
                              grepl("_019", birth$variable) ~ "N")) %>%
  dplyr::select(-variable) %>%
  filter(!is.na(Marriage) & !is.na(Age)) %>%
  rename("Count" = "estimate", "MargOfError" = "moe") %>%
  mutate(COUNTY = substr(str_extract(NAME, ".*\\,"), 1, nchar(.) - 1),
         STATE  = substr(str_extract(NAME, "\\,.*"), 3, nchar(.))) %>%
  dplyr::select(1, 8:9, 3:7) %>%
  group_by(GEOID) %>%
  mutate(PropByGEOID = Count / sum(Count)) %>%
  group_by(GEOID, Marriage, Age) %>%
  mutate(BirthProp = Count / sum(Count))

birth_age_marital[1:3, ] %>% 
  kable(align = "c") %>%
  kable_styling()
GEOID COUNTY STATE Count MargOfError Marriage Age Birth PropByGEOID BirthProp
01001 Autauga County, Alabama 0 28 Married 15-19 Y 0.0000000 0.0000000
01001 Autauga County, Alabama 462 157 Married 20-34 Y 0.0340206 0.1693548
01001 Autauga County, Alabama 131 89 Married 35-50 Y 0.0096465 0.0277778

Finally, let’s visualize this information:

birth_age_marital %>%
  group_by(Marriage, Age, Birth) %>%
  summarise(meanProp = mean(PropByGEOID, na.rm = TRUE)) %>%
  filter(Birth == "Y") %>%
  ggplot(aes(x = Age, y = meanProp, fill = Marriage)) +
  geom_bar(stat = "identity") +
  labs(x = "Marital Status", y = "Mean Proportion of County", 
       title = "Demographics of mothers who gave birth in the past 12 months",
       subtitle = "Mean proportion across all US counties")

As we can see, the majority of women who gave birth in the United States in this 12-month period were married women between the ages of 20 and 34. One could dive deeper to find patterns by state, and join this data with other datasets that give information on family planning resources and caesarean section rates. For now, however, this information seems hopeful; comparatively, not many teenagers are giving birth.

It is clear now that there are so many directions that one could take in their own data journey. Here, we have gained insight into how education level and rates of HIV might relate to one another, a question that the CDC has also taken up in their concerns about how socioeconomic status impacts one’s health. We also explored whether the WHO is meeting its stated goals specifically in the United States, taking advantage of additional data from the US Census. There are nearly infinite questions that could be asked of these datasets, and hopefully this walk through the process of asking a question, finding data to answer it, and configuring the data in an informative way has shown how to ask small questions of big datasets.

Sources

expersso/WHO (https://github.com/expersso/WHO, accessed March, 2020)

The UN (https://data.un.org/, accessed March, 2020)

tidycensus (https://cran.r-project.org/web/packages/tidycensus/index.html, accessed March, 2020)