Chapter 3 Data Cleaning
Shannon
janitor/clean_names, parsing/stringr
3.1 Data Import
We use these three packages in nearly every script.
library(tidyverse)
library(here)
library(janitor)
With all data import, follow a few steps:
- Stick with the functions from readr, a tidyverse package. readr functions (e.g.,
read_csv()
vs.read.csv()
from base R) are faster, more reproducible, and handle certain variable types better. - Use the
here::here()
function for readability and greater compatibility across computers - Especially for source data, pipe in
janitor::clean_names()
to tidy column names. This function removes special characters and white space from all column names.
Note, for certain file types you may need additional packages. Google sheets require googlesheets4, excel requires readxl
<- read_csv(here::here("data", "link2feed-2021-08-19.csv")) %>%
raw_df clean_names()
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## `Visit Date` = col_date(format = ""),
## `Client ID` = col_double(),
## `Client Status` = col_character(),
## `Household ID` = col_double(),
## `Household Size` = col_double(),
## `Visited Agency` = col_character(),
## `Program Name` = col_character(),
## Delivery = col_character(),
## `Individuals Served` = col_double()
## )
## Warning: 8412 parsing failures.
## row col expected actual file
## 1 -- 9 columns 10 columns '/Users/shannoncarter/Documents/JanuaryAdvisors/ja_ds_cookbook/data/link2feed-2021-08-19.csv'
## 2 -- 9 columns 10 columns '/Users/shannoncarter/Documents/JanuaryAdvisors/ja_ds_cookbook/data/link2feed-2021-08-19.csv'
## 3 -- 9 columns 10 columns '/Users/shannoncarter/Documents/JanuaryAdvisors/ja_ds_cookbook/data/link2feed-2021-08-19.csv'
## 4 -- 9 columns 10 columns '/Users/shannoncarter/Documents/JanuaryAdvisors/ja_ds_cookbook/data/link2feed-2021-08-19.csv'
## 5 -- 9 columns 10 columns '/Users/shannoncarter/Documents/JanuaryAdvisors/ja_ds_cookbook/data/link2feed-2021-08-19.csv'
## ... ... ......... .......... .............................................................................................
## See problems(...) for more details.
For basic inspection, I start with two functions, View()
and str()
.
View(raw_df)
View()
opens your entire dataset in a new window and allows sorting and filtering. I do a couple checks here:
* note number of rows and columns
* note missing data - where is data missing? Is this expected?
* check min/max of numeric columns - are they in expected ranges? are columns actually numeric?
str(raw_df)
## spec_tbl_df [8,412 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ visit_date : Date[1:8412], format: "2021-01-06" "2021-01-06" ...
## $ client_id : num [1:8412] 2745118 3512281 5919937 5350496 1733779 ...
## $ client_status : chr [1:8412] "active" "active" "active" "active" ...
## $ household_id : num [1:8412] 1424341 1875857 3260606 2937631 864430 ...
## $ household_size : num [1:8412] 1 2 4 1 2 1 1 4 1 1 ...
## $ visited_agency : chr [1:8412] "Wesley Community Center" "Wesley Community Center" "Wesley Community Center" "Wesley Community Center" ...
## $ program_name : chr [1:8412] "TEFAP Pantry Visit" "TEFAP Pantry Visit" "TEFAP Pantry Visit" "TEFAP Pantry Visit" ...
## $ delivery : chr [1:8412] "No" "No" "No" "No" ...
## $ individuals_served: num [1:8412] 1 2 4 1 2 1 1 4 1 1 ...
## - attr(*, "problems")= tibble [8,412 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:8412] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ col : chr [1:8412] NA NA NA NA ...
## ..$ expected: chr [1:8412] "9 columns" "9 columns" "9 columns" "9 columns" ...
## ..$ actual : chr [1:8412] "10 columns" "10 columns" "10 columns" "10 columns" ...
## ..$ file : chr [1:8412] "'/Users/shannoncarter/Documents/JanuaryAdvisors/ja_ds_cookbook/data/link2feed-2021-08-19.csv'" "'/Users/shannoncarter/Documents/JanuaryAdvisors/ja_ds_cookbook/data/link2feed-2021-08-19.csv'" "'/Users/shannoncarter/Documents/JanuaryAdvisors/ja_ds_cookbook/data/link2feed-2021-08-19.csv'" "'/Users/shannoncarter/Documents/JanuaryAdvisors/ja_ds_cookbook/data/link2feed-2021-08-19.csv'" ...
## - attr(*, "spec")=
## .. cols(
## .. `Visit Date` = col_date(format = ""),
## .. `Client ID` = col_double(),
## .. `Client Status` = col_character(),
## .. `Household ID` = col_double(),
## .. `Household Size` = col_double(),
## .. `Visited Agency` = col_character(),
## .. `Program Name` = col_character(),
## .. Delivery = col_character(),
## .. `Individuals Served` = col_double()
## .. )
str()
gives a compact view of your data structure and some summary metrics. Here, check:
* Are dates in date format?
* Should any character variables be changed to factors with defined levels?
* It can be helpful for IDs to be character variables, so you can search for them with View()
Once you’ve got a feel for the data, take inventory of steps needed to get to the desired output, including: * strings that need to be parsed or separated * final shape of the data - do any columns need to be pivoted longer or wider? * joins with other data -
3.2 Data Cleaning
library(stringr)