3 Data Importing and Tidying
Before we begin, the following packages are loaded to help us with all stages of this project.
We also write this function which will later on help us with table formatting.
mykable <- function(df){ #takes in the data frame and prints out nice table output
kable(df) %>%
kable_styling("striped", full_width = FALSE)
}
The data we’ll be using for this project will be harvested from the website Basketball-Reference.com. Basketball Reference is a website containing data on NBA players, scores, team statistics, historical data, and many other NBA-related information. It is a great resource for any kind of NBA statistical analysis, and is very popular among the NBA world, as many NBA analysts, writers, podcasters, and fans across America go to this site in the first place for information. David Leonhardt of the New York Times once praised Basketball-Reference’s database as “a gift straight from the basketball gods.”
For this particular project, this website is extremely useful, since it has pretty much every information about an NBA player, from his statistics and awards to his height, weight, nickname(s), or even his twitter username. We are interested in looking at looking at Russell Westbrook’s statistics, in particular his triple-doubles, during his last three seasons - 2015-16, 2016-17, and 2017-18 - with the Oklahoma City Thunder. To that end, the technique of web scraping is utilized to get Russ’s statistics over his last 3 seasons with the Thunder. We take advantage of scraping functions from the rvest
package in R and write a function to grab the statistics table of each of Russ’ last 3 OKC campaigns.
This function takes in the ending year of the season, executes some data-harvesting tasks to read in the data from the webpage, and performs some table transformations, including renaming, creating new variables and selecting columns that are needed in further steps.
RussData <- function(Year){
# get url
RussURL <- paste("https://www.basketball-reference.com/players/w/westbru01/gamelog/", Year, "/", sep = "")
# grab all content from the raw html file
RussHTML <- read_html(RussURL)
# parse html code into data tables, use index to get our desired table
RussStats <- html_table(RussHTML, fill = TRUE)[[8]]
# rename the columns
names(RussStats)[6] <- "side"
names(RussStats)[8] <- "result"
RussStats <- filter(RussStats, GS == "1") # filter out the games he didn't play
RussStats[RussStats == "CHO"] <- "CHA" # correct abbreviation for Charlotte
# create a vector of teams from the Western Conference for further usage
WestTeams <- c("LAL", "UTA", "LAC", "DEN", "DAL", "HOU", "OKC", "MEM",
"SAS", "PHO", "POR", "NOP", "SAC", "MIN", "GSW")
# make sure all the stats are of numeric type
for (i in c(11:ncol(RussStats))){
RussStats[,i] <- as.numeric(RussStats[,i])
}
# transforming the table
RussStats <- RussStats %>%
separate(MP, into = c("Min", "Sec")) %>% # separate the minutes variable (min:sec)
mutate(Season = paste(Year - 1, "-", Year, sep = ""), # create season variable
Side = ifelse(side == "@", "Away", "Home"), # which side OKC is
Result = ifelse(grepl("W", result), "Win", "Loss"), # game result for OKC
OppConf = ifelse(Opp %in% WestTeams, "West", "East"), # opponent's conference
Minutes = round(as.integer(Min) + as.integer(Sec)/60, 2), # Russ' playing time in minutes
TripDbl = ifelse(PTS >= 10 & AST >= 10 & TRB >= 10, "Yes", "No")) %>% # triple-double
select(Season, Result, Side, Opp, OppConf, Minutes,
FG, FGA, `FG%`, `3P`, `3PA`, `3P%`, FT, FTA, `FT%`,
ORB, DRB, TRB, PTS, AST, TripDbl,
STL, BLK, TOV, PF, `+/-`)
}
Now we want to combine all the data from every year into one big data table. To do this, we use a for
loop, which allows us to iterate over the year range, then join the tables together. A full_join
is used here to append the tables, after the full dataset is initialize by getting the data from the first season.
RussStats <- RussData(2017) #get data from first year
for (Year in 2018:2019) {
RussStats <- RussStats %>%
full_join(RussData(Year)) # full_join to combine all the tables
}
The table tidying and transforming tasks are complete, and here’s a quick glimpse of our data table:
Season | Result | Side | Opp | OppConf | Minutes | FG | FGA | FG% | 3P | 3PA | 3P% | FT | FTA | FT% | ORB | DRB | TRB | PTS | AST | TripDbl | STL | BLK | TOV | PF | +/- |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2016-2017 | Win | Away | PHI | East | 36.38 | 11 | 21 | 0.524 | 1 | 2 | 0.500 | 9 | 11 | 0.818 | 1 | 11 | 12 | 32 | 9 | No | 0 | 0 | 2 | 2 | 10 |
2016-2017 | Win | Home | PHO | West | 45.32 | 17 | 44 | 0.386 | 2 | 10 | 0.200 | 15 | 20 | 0.750 | 3 | 10 | 13 | 51 | 10 | Yes | 2 | 0 | 5 | 3 | 7 |
2016-2017 | Win | Home | LAL | West | 33.65 | 11 | 21 | 0.524 | 5 | 6 | 0.833 | 6 | 6 | 1.000 | 4 | 7 | 11 | 33 | 16 | Yes | 1 | 1 | 7 | 3 | 10 |
The attributes of this tables are:
- Season: 2016-2017, 2017-2018, 2018-2019
- Result: whether Russ’ team (OKC) recorded a win or a loss
- Side: whether OKC is playing at home or on the road
- Opp: opponent - the team that is playing against OKC
- OppConf: the conference that the opponent is belong to. The NBA consists of 30 teams, divided into 2 conferences, the Eastern Conference and Western Conference
- Minutes: Russ’ total game playing time, in minutes
- FG: number of Field Goals Russ made in a game. In basketball, a “field goal” made is just a basket scored on any shot, excluding the free throws.
- FGA: number of Field Goals attempted by Russ
- FG%: Russ’ Field Goal Percentage in a game (FG/FGA)
- 3P, 3PA, 3P%: number of 3-pointers made, attempted, and 3-point percentage in a game
- FT, FTA, FT%: number of free throws made, attempted, and free throw percentage in a game
- ORB, DRB, TRB: number of Offensive Rebounds, Defensive Rebounds, and Total Rebounds (ORB + DRB) in a game
- PTS: number of Points Russ scored
- AST: number of Assists Russ had
- TripDbl: whether Russ recorded a triple-double
- STL: number of Steals
- BLK: number of Blocks
- TOV: number of Turnovers (losing possession of the ball to the opposing team before the ball hits the rim)
- PF: number of Personal Fouls
- +/- (plus/minus): a measure of a player’s contribution to his team in a game. It is calculated by counting up points scored by the player’s team and points scored against the player’s team when that player is on the floor, then subtracting points against from points for. A positive +/- value for a player means his team outscored the opponent by that many points while he was on the court, whereas a negative +/- indicates that the opposing team outscored his team by 3 points while he was playing.
After finishing our data wrangling jobs, we are now ready to explore what’s interesting behind our data.