Chapter 5 Web Scraping and Functions

Previous sections looked at how to obtain data that was already loaded into R or how to use pre-made functions to retrieve data from websites like FanGraphs and Baseball Savant. In this section, we will explore how to scrape data (using the rvest package) from other websites you find and how to write your own functions to improve this process.

5.1 Basic Web Scraping

5.1.1 Example: Scraping Baseball-Reference Draft Data

Here is code that allows us to scrape data from the first round of the 2004 draft from baseball reference. The url refers to this webpage.

library(rvest)

url <- "https://www.baseball-reference.com/draft/?year_ID=2004&draft_round=1&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0"
html <- read_html(url)

first_2004 <- html %>%
  html_element("table") %>%
  html_table()

The rvest includes the functions we are using here. This package is designed with functions that make data scraping easier to do in R.

In the code above, we are doing the following:

  • assign the link of the url to an object named url. This allows us to refer to the url in code without always needing to copy and paste it.
  • Using the read_html() function to store the html code from the url. This function will require an internet connection to grab the html code from the website.
  • The last three lines of code are processing the html code to get the data we want.
    • We are naming this data first_2004 and starting with our html object.
    • The html code is being piped into the html_element() function and told to find the desired outputs (which in our case is tables).
    • This is then piped into the html_table() function, which converts the html code for our table into a data frame in R.

After running this code, you will have an object in your data frame containing data from the entirety of the first round of the 2004 MLB draft.

Here is what the first few rows of the dataset should look like:

Year Rnd DT OvPck FrRnd RdPck Tm Signed Bonus Name Pos WAR G_Hitter AB HR BA OPS G_Pitcher W L ERA WHIP SV Type Drafted.Out.of
2004 1 NA 1 FrRnd 1 Padres Y $3,150,000 Matt Bush (minors) SS 1.7 8 0 0 NA NA 217 12 11 3.75 1.20 15 HS Mission Bay HS (San Diego, CA)
2004 1 NA 2 FrRnd 2 Tigers Y $3,120,000 Justin Verlander (minors) RHP 80.9 24 50 0 0.100 0.200 509 257 141 3.24 1.12 0 4Yr Old Dominion University (Norfolk, VA)
2004 1 NA 3 FrRnd 3 Mets Y $3,000,000 Philip Humber (minors) RHP 0.9 9 11 0 0.091 0.182 97 16 23 5.31 1.42 0 4Yr Rice University (Houston, TX)
2004 1 NA 4 FrRnd 4 Devil Rays Y $3,200,000 Jeff Niemann (minors) RHP 4.3 9 13 0 0.077 0.154 97 40 26 4.08 1.29 0 4Yr Rice University (Houston, TX)
2004 1 NA 5 FrRnd 5 Brewers Y $2,200,000 Mark Rogers (minors) RHP 1.1 12 16 0 0.250 0.625 11 3 1 3.49 1.12 0 HS Mount Ararat School (Topsham, ME)
2004 1 NA 6 FrRnd 6 Indians Y $2,475,000 Jeremy Sowers (minors) LHP 1.6 4 4 0 0.250 0.750 72 18 30 5.18 1.44 0 4Yr Vanderbilt University (Nashville, TN)
2004 1 NA 7 FrRnd 7 Reds Y $2,300,000 Homer Bailey (minors) RHP 6.2 208 373 0 0.164 0.375 245 81 86 4.56 1.37 0 HS La Grange HS (La Grange, TX)
2004 1 NA 8 FrRnd 8 Orioles N Wade Townsend (minors) RHP NA NA NA NA NA NA NA NA NA NA NA NA 4Yr Rice University (Houston, TX)
2004 1 NA 9 FrRnd 9 Rockies Y $2,150,000 Chris Nelson (minors) SS -2.6 282 834 16 0.265 0.699 NA NA NA NA NA NA HS Redan HS (Stone Mountain, GA)
2004 1 NA 10 FrRnd 10 Rangers Y $2,025,000 Thomas Diamond (minors) RHP -0.5 16 7 0 0.000 0.125 16 1 3 6.83 1.76 0 4Yr University of New Orleans (New Orleans, LA)

5.2 Writing Functions

The code above allowed us to scrape data for a single round from a single year’s draft. If we wanted first round data from multiple years or multiple rounds from a single year, we would end up with very repetitive code.

One option is to write our own function to eliminate some of that repetitive code and to make the process quicker. In the url from before, there were two specific parts that controlled what round and what year we gather our data from.

https://www.baseball-reference.com/draft/?year_ID=2004&draft_round=1&draft_type=junreg

To verify this, you could try replacing the two highlighted parts and going to the webpage in your browser. (Note: The end of the url is cropped out above in order to fit it on the page.)

To get data from any draft year/round we want, we could write a function that replaces the two highlighted parts with user inputted values.

Here is some code that would do just that:

scrape_draft <- function(year, round) {
  
  require(rvest)
  
  url <- paste0("https://www.baseball-reference.com/draft/?year_ID=",
                year,
                "&draft_round=",
                round,
                "&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0")
  
  data <- url %>%
    read_html()
  
  draft_data <- data %>%
    html_element("table") %>%
    html_table()
  draft_data
}

Inside the function() function, we are specifying that our function will have two arguments (year and round). These will correspond to the highlighted parts of the url from before. The paste0() function puts these into the url in the right place and stores them as our url object. From here, we can do the same thing we did before to scrape data for a chosen year/round.

The code below is an example of how we can use the function we wrote. Remember that our new scrape_draft() function has two arguments: year and round. Therefore, the code below uses the function to scrape Baseball Reference for the 1st round of 2006 draft.

first_2006 <- scrape_draft(year = 2006, round = 1)

Below are the first few rows to show you that everything worked properly.

Year Rnd DT OvPck FrRnd RdPck Tm Signed Bonus Name Pos WAR G_Hitter AB HR BA OPS G_Pitcher W L ERA WHIP SV Type Drafted.Out.of
2006 1 NA 1 FrRnd 1 Royals Y $3,500,000 Luke Hochevar (minors) RHP 3.7 18 16 0 0.063 0.125 279 46 65 4.98 1.34 3
2006 1 NA 2 FrRnd 2 Rockies Y $3,250,000 Greg Reynolds (minors) RHP -1.5 31 30 0 0.167 0.460 33 6 11 7.01 1.65 0 4Yr Stanford University (Palo Alto, CA)
2006 1 NA 3 FrRnd 3 Devil Rays Y $3,000,000 Evan Longoria (minors) 3B 58.6 1986 7306 342 0.264 0.804 NA NA NA NA NA NA 4Yr California State University, Long Beach (Long Beach, CA)
2006 1 NA 4 FrRnd 4 Pirates Y $2,750,000 Brad Lincoln (minors) RHP 0.4 53 38 0 0.237 0.520 99 9 11 4.74 1.39 1 4Yr University of Houston (Houston, TX)
2006 1 NA 5 FrRnd 5 Mariners Y $2,450,000 Brandon Morrow (minors) RHP 11.1 115 24 0 0.000 0.040 334 51 43 3.96 1.31 40 4Yr University of California, Berkeley (Berkeley, CA)
2006 1 NA 6 FrRnd 6 Tigers Y $3,550,000 Andrew Miller (minors) LHP 7.8 185 74 0 0.054 0.108 612 55 55 4.03 1.34 63 4Yr University of North Carolina at Chapel Hill (Chapel Hill, NC)
2006 1 NA 7 FrRnd 7 Dodgers Y $2,300,000 Clayton Kershaw (minors) LHP 79.9 357 698 1 0.162 0.390 425 210 92 2.48 1.00 0 HS Highland Park HS (Dallas, TX)
2006 1 NA 8 FrRnd 8 Reds Y $2,000,000 Drew Stubbs (minors) OF 7.9 911 2834 92 0.242 0.704 NA NA NA NA NA NA 4Yr University of Texas at Austin (Austin, TX)
2006 1 NA 9 FrRnd 9 Orioles Y $2,100,000 Billy Rowell (minors) 3B NA NA NA NA NA NA NA NA NA NA NA NA HS Bishop Eustace Preparatory School (Pennsauken, NJ)
2006 1 NA 10 FrRnd 10 Giants Y $2,025,000 Tim Lincecum (minors) RHP 19.5 262 474 0 0.112 0.300 278 110 89 3.74 1.29 1 4Yr University of Washington (Seattle, WA)

Just like the code we made at the beginning of this section, we are able to obtain a dataset containing all of the players drafted in the 1st round of the 2006 draft.

5.2.1 Example: Completing the rest of the Draft Dataset

Now, let’s finalize the Draft Dataset we used in the first section of this chapter. We are looking to use every first round pick from the years 2004-2013. To do this, we can use the function we created to easily select the years of our choice. Then, we can use the rbind() function to put it all together.

first_2004 <- scrape_draft(year = 2004, round = 1)
first_2005 <- scrape_draft(year = 2005, round = 1)
first_2006 <- scrape_draft(year = 2006, round = 1)
first_2007 <- scrape_draft(year = 2007, round = 1)
first_2008 <- scrape_draft(year = 2008, round = 1)
first_2009 <- scrape_draft(year = 2009, round = 1)
first_2010 <- scrape_draft(year = 2010, round = 1)
first_2011 <- scrape_draft(year = 2011, round = 1)
first_2012 <- scrape_draft(year = 2012, round = 1)
first_2013 <- scrape_draft(year = 2013, round = 1)

all_draft <- rbind(first_2004, first_2005, first_2006, first_2007, first_2008,
                   first_2009, first_2010, first_2011, first_2012, first_2013)

As you can see, each first round pick from 2004-2013 has been combined into a single dataset. In a later section, we will talk about how to create loops which will make this process even faster.

Note This data is currently being stored in the github link here.

5.3 More Difficult Web Scraping

In the first example about MLB drafts, finding the table was easy because there was only one table on the web page. However, you may find websites that have multiple tables on a page which will cause troubles when trying to get the correct one into R.

Let’s explore the Korean Baseball Organization (KBO), Wikipedia page to learn more.

As you can see on this page, there are plenty of tables throughout the page. Let’s say that we want to use the table with each team’s stadium, capacity and year founded.

This Stanford resource is very informative which gives a more detailed example on using the CSS Selector. We will work through the example on the KBO web page here.

Like the previous example, we need to store the url as an object in R. However, we will also need to find a CSS Selector that corresponds to our specific table. To find the CSS Selector we have to examine the html code within the web page. Below are the steps taken to do this:

  • First, right-click on the table we decided to work with.
  • Navigate to the inspect option and click it.
  • We will then see the code that has created the webpage and this is where we will find what we need.
  • Hover over each line of code in the new window until the line of code you are hovering over highlights the entire table needed. Note It is likely that this line starts with the word “table”.
  • Then, right click this code and choose the “Copy Selector” option.
  • Paste the copied code into an object as shown below:
url <- "https://en.wikipedia.org/wiki/KBO_League"

css_selector <- "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(75)"

So far, we have put our url, and our copied selector into objects above. At this point, our code will look like the draft data scraping example shown above. This code comes from the rvest package. In this chunk, the main change from the prior example is what is being put inside the html_element() function.

Instead of “table”, we will put in our CSS Selector as shown below.

library(rvest)

KBO_data <- url %>% 
    read_html() %>% 
    html_element(css = css_selector) %>% 
    html_table()

Using this process, we were able to choose a specific table on a webpage instead of using the defaulted first table. Below you can see the data we scraped from the Wikipedia page:

Team City Stadium Capacity Founded Joined
Doosan Bears Seoul Jamsil Baseball Stadium 25,000 1982 1982
Hanwha Eagles Daejeon Hanwha Life Eagles Park 13,000 1985 1986
Kia Tigers Gwangju Gwangju-Kia Champions Field 20,500 1982 1982
Kiwoom Heroes Seoul Gocheok Sky Dome 16,744 2008 2008
KT Wiz Suwon Suwon kt wiz Park 20,000 2013 2015
LG Twins Seoul Jamsil Baseball Stadium 25,000 1982 1982
Lotte Giants Busan Sajik Baseball Stadium 24,500 1975 1982
NC Dinos Changwon Changwon NC Park 22,112 2011 2013
Samsung Lions Daegu Daegu Samsung Lions Park 24,000 1982 1982
SSG Landers Incheon Incheon SSG Landers Field 23,000 2000 2000

5.4 Data Scraping Ethics

Now that we’ve finished the section on web scraping, it is important to note some ethics in this section. To see a more comprehensive look into the ethics of web scraping, R for Data Science is a great resource.

The legality and ethics behind web scraping is quite complicated. However, it is a good rule of thumb to make sure that the data you are scraping is:

  • Public
  • Non-personal
  • Accurate

When accessing a website, Terms and Conditions often pop-up. These are a way for pages to have some sort of legal claim to the data on their page. We must respect these pages as much as possible and should not proceed with scraping.

Additionally, websites with personal data on their pages should not be scraped at any time. While not totally illegal, the ethics surrounding this is very hazy and should be avoided at all costs.

Finally, we must also be careful of overloading any websites’ servers. Because scraping often involves accessing a page several times, their servers were likely not meant to be accessed as such. We must remain courteous in that we are careful in how much/how often we are accessing a webpage.