rvest
)If you find data that you would like to use on a website, package rvest
can bring that table into R. (Scraping data both saves you the process/frustration of copying/pasting data from a webpage, and also makes your work less error-prone and more reproducible.)
Before you take data from a website, always make sure you are allowed to scrape and analyze the data. You can sometimes find this information by digging into the webpage, but luckily there’s an R-package that will do this for you. It is called robotstxt
and while it has a great number of functions, the one that we’ll be using is called paths_allowed()
.
First, install the robotstxt
package:
install.packages("robotstxt")
and then use the function paths_allowed()
with the url/link to the website to check if scraping is allowed.
robotstxt::paths_allowed("url_to_website")
This will return either TRUE or FALSE.
You can read more about scraping data from the web here. If you still have questions about the legality of a web-scraping workflow after reading this documentation, you can contact Reed’s data librarian.
Once we know that we are allowed to scrape data from our webpage, we can use a package called rvest
to actually do the scraping.
First, install and load the rvest
package:
install.packages("rvest")
library(rvest)
This example will work from a page reporting on the different occupations of Reed alumni.
First, save the URL of this site under a variable name so that it is easy to use later:
url <- "https://www.reed.edu/ir/success.html"
In order to access the data, R needs to know not only the URL of the website that contains the table, but also where on that page the table is located. To point R to the data: