Data Science Pipelines

“Programs must be written for people to read, and only incidentally for machines to execute”.

– Harold Abelson, Structure and Interpretation of Computer Programs

Setup: Packages

# Package
library(tidyverse)
## -- Attaching packages ------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts --------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(fgeo.data)

Setup: Data

tree <- read_csv("data/tree.csv")
## Parsed with column specification:
## cols(
##   treeID = col_integer(),
##   stemID = col_integer(),
##   tag = col_integer(),
##   StemTag = col_integer(),
##   sp = col_character(),
##   quadrat = col_integer(),
##   gx = col_double(),
##   gy = col_double(),
##   MeasureID = col_integer(),
##   CensusID = col_integer(),
##   dbh = col_double(),
##   pom = col_double(),
##   hom = col_double(),
##   ExactDate = col_integer(),
##   DFstatus = col_character(),
##   codes = col_character(),
##   nostems = col_integer(),
##   status = col_character(),
##   date = col_character()
## )

Setup: Data

tree
## # A tibble: 1,004 x 19
##    treeID stemID    tag StemTag sp    quadrat    gx    gy MeasureID
##     <int>  <int>  <int>   <int> <chr>   <int> <dbl> <dbl>     <int>
##  1    104    143  10009   10009 DACE~     113  10.3  245.    439947
##  2    119    158 100104  100104 MYRS~    1021 183.   410.    466597
##  3    180    225 100171  100174 CASA~     921 165.   410.    466623
##  4    602    736 100649  100649 GUAG~     821 149.   414.    466727
##  5    631    775  10069   10069 PREM~     213  38.3  245.    439989
##  6    647    793 100708  100708 SCHM~     821 143.   411.    466743
##  7   1086   1339  10122   10122 DRYG~     413  68.9  253.    440021
##  8   1144   1410 101285  101285 SCHM~     920 161.   395.    466889
##  9   1168   1438  10131   10131 DACE~     413  70.6  252.    440031
## 10   1380 114352 101560  149529 CASA~     820 142.   386.    466957
## # ... with 994 more rows, and 10 more variables: CensusID <int>,
## #   dbh <dbl>, pom <dbl>, hom <dbl>, ExactDate <int>, DFstatus <chr>,
## #   codes <chr>, nostems <int>, status <chr>, date <chr>

Piping

What problem it solves?

a1 <- group_by(tree, sp)
a2 <- select(a1, sp, quadrat, dbh)
a3 <- summarise(a2, dbh_mean = mean(dbh, na.rm = TRUE))
a4 <- filter(a3, dbh_mean > 350)
a4
## # A tibble: 2 x 2
##   sp     dbh_mean
##   <chr>     <dbl>
## 1 CYRRAC      357
## 2 ROYBOR      364

Piping

What problem it solves?

filter(
  summarise(
    select(
      group_by(tree, sp), 
      sp, quadrat, dbh
    ), 
    dbh_mean = mean(dbh, na.rm = TRUE)
  ), 
  dbh_mean > 350
)
## # A tibble: 2 x 2
##   sp     dbh_mean
##   <chr>     <dbl>
## 1 CYRRAC      357
## 2 ROYBOR      364

Piping

What problem it solves?

tree %>% 
  group_by(sp) %>% 
  select(sp, quadrat, dbh) %>% 
  summarise(dbh_mean = mean(dbh, na.rm = TRUE)) %>% 
  filter(dbh_mean > 350)
## # A tibble: 2 x 2
##   sp     dbh_mean
##   <chr>     <dbl>
## 1 CYRRAC      357
## 2 ROYBOR      364

“Programs must be written for people to read, and only incidentally for machines to execute”.

Setup: Packages

Setup: Data

Setup: Data

Piping

What problem it solves?

Piping

What problem it solves?

Piping

What problem it solves?

Adapted from dplyr

https://dplyr.tidyverse.org/articles/dplyr.html#piping

Learn more at http://r4ds.had.co.nz/pipes.html