From the internet (with rvest)

If you find data that you would like to use on a website, package rvest can bring that table into R. (Scraping data both saves you the process/frustration of copying/pasting data from a webpage, and also makes your work less error-prone and more reproducible.)

Checking for Permission

Before you take data from a website, always make sure you are allowed to scrape and analyze the data. You can sometimes find this information by digging into the webpage, but luckily there’s an R-package that will do this for you. It is called robotstxt and while it has a great number of functions, the one that we’ll be using is called paths_allowed().

First, install the robotstxt package:

install.packages("robotstxt")

and then use the function paths_allowed() with the url/link to the website to check if scraping is allowed.

robotstxt::paths_allowed("url_to_website")

This will return either TRUE or FALSE.

You can read more about scraping data from the web here. If you still have questions about the legality of a web-scraping workflow after reading this documentation, you can contact Reed’s data librarian.

Scraping the Data

Once we know that we are allowed to scrape data from our webpage, we can use a package called rvest to actually do the scraping.

First, install and load the rvest package:

install.packages("rvest")
library(rvest)

This example will work from a page reporting on the different occupations of Reed alumni.

First, save the URL of this site under a variable name so that it is easy to use later:

url <- "https://www.reed.edu/ir/success.html"

In order to access the data, R needs to know not only the URL of the website that contains the table, but also where on that page the table is located. To point R to the data:

  1. Right-click (control + click on Mac Desktops) on the table you want to bring into R, and click “inspect” in the menu that comes up. This will open a bar on the right of the window.