We can use the rvest package to scrape information from the internet into R. For example, this page on Reed College’s Institutional Research website contains a large table with data that we may want to analyze. Instead of trying to copy this data into Excel or having to manually recreate it, we can use rvest to pull the information directly into R.

# load packages 
suppressMessages(library(rvest))
suppressMessages(library(dplyr))
suppressMessages(library(reshape2))
suppressMessages(library(googleVis))

# helpful resources for using rvest 
  # vignette("selectorgadget")
  # http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/
# download html file
webpage <- html("http://www.reed.edu/ir/geographic_states.html")

# the data we want is in the first table on this page
# the html_table() command coerces the data into a data frame
webpage %>%
  html_nodes("table") %>%
  .[[1]] %>%
  html_table()
##              State 2007 2008 2009 2010 2011 2012 2013 2014
## 1          Alabama    0    5    1    0    0    0    1    0
## 2           Alaska    2    0    3    1    3    3    2    2
## 3          Arizona   12    7    3    3    5    8    5    5
## 4         Arkansas    0    0    0    2    0    0    0    0
## 5       California   71   65   94   96   97   85   87   99
## 6         Colorado    8   14    5   11   13    7    7    3
## 7      Connecticut    5    7    5    9   13    7    3    4
## 8         Delaware    0    1    1    0    0    0    0    0
## 9   Washington, DC    3    7    5    6    2    1    3    4
## 10         Florida    8    9    5   10    4    4    9    2
## 11         Georgia    1    7    4    3    7    2    0    1
## 12          Hawaii    5    1    3    1    2    2    3    7
## 13           Idaho    1    2    2    3    3    2    4    3
## 14        Illinois   12    5    5   12    4    3   11    9
## 15         Indiana    1    2    2    3    2    1    0    2
## 16            Iowa    1    1    2    2    1    2    2    4
## 17          Kansas    0    0    1    1    1    0    0    1
## 18        Kentucky    0    0    1    0    0    1    0    1
## 19       Louisiana    0    4    1    2    0    2    3    2
## 20           Maine    2    1    2    3    4    2    2    1
## 21        Maryland    5    3    7    7    2    5    5   10
## 22   Massachusetts   15   20   23   18   19   15   17   11
## 23        Michigan    3    2    4    2    3    3    2    2
## 24       Minnesota    9   11    9    6    6    2    7    6
## 25     Mississippi    0    0    0    0    0    1    0    0
## 26        Missouri    5    2    7    4    3    5    0    5
## 27         Montana    2    1    1    2    1    1    1    2
## 28        Nebraska    1    1    0    1    1    0    0    0
## 29          Nevada    3    0    2    0    2    3    1    0
## 30   New Hampshire    4    1    5    3    2    6    8    3
## 31      New Jersey   13    8    7    8    3    9    7    6
## 32      New Mexico    2    4    2    6    5    6    7    3
## 33        New York   26   27   25   23   27   21   24   17
## 34  North Carolina    3    4    3    3    3    1    2    4
## 35    North Dakota    0    0    0    0    0    0    0    0
## 36            Ohio    7    3    1    2    3    1    2    4
## 37        Oklahoma    1    0    1    1    0    1    5    1
## 38          Oregon   20   28   30   24   26   28   28   22
## 39    Pennsylvania    8    5    4    6    4    6    8    6
## 40    Rhode Island    4    2    3    1    2    0    1    0
## 41  South Carolina    3    0    0    1    1    0    1    1
## 42    South Dakota    0    0    0    1    1    0    0    0
## 43       Tennessee    2    2    2    1    3    1    4    2
## 44           Texas   16   14   16   19   12   14   14   17
## 45            Utah    1    1    0    2    1    4    4    1
## 46         Vermont    1    2    5    1    3    1    5    1
## 47        Virginia    2    1    5    4   12    1    5    5
## 48      Washington   28   22   32   30   32   19    8   26
## 49   West Virginia    0    0    0    1    1    1    0    0
## 50       Wisconsin    2    4    0    2    2    5    5    3
## 51         Wyoming    0    1    1    1    0    0    0    0
## 52 Foreign Schools   29   23   28   25   33   28   43   39
## 53           Total  347  330  368  373  374  320  356  347
# repeat above code but store results in a data frame
data <- 
webpage %>%
  html_nodes("table") %>%
  .[[1]] %>%
  html_table()
# we can now work with this data from the web as a data frame in R
# remove total row from data 
data <- 
  data %>% 
  filter(State!='Total')

# reshape data for plotting 
data_long <- melt(data, id='State')

# rename columns in long data frame 
colnames(data_long) <- c('State', 'Year', 'Matriculants')

# create and manipulate variables for plotting
data_long$Year <- as.numeric(as.character(data_long$Year))
data_long$year <- data_long$Year
data_long$state <- data_long$State
# plot data 
gvisMotionChart(data_long, "state", "year",
                yvar="Matriculants", xvar="Year",
                colorvar="State")
MotionChartID22035c576bf2
Data: data_long • Chart ID: MotionChartID22035c576bf2googleVis-0.5.6
R version 3.1.0 (2014-04-10) • Google Terms of UseDocumentation and Data Policy