Photo by Fabio Bracht (@bracht) on Unsplash.
We load a range of libraries for general data wrangling and general visualisation together with more specialised tools for time series forecasting.
# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('patchwork') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('vroom') # input/output
library('skimr') # overview
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('purrr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
library('fuzzyjoin') # data wrangling
# specific visualisation
library('alluvial') # visualisation
library('ggrepel') # visualisation
library('ggforce') # visualisation
library('ggridges') # visualisation
library('gganimate') # animations
library('GGally') # visualisation
library('ggthemes') # visualisation
library('wesanderson') # visualisation
library('kableExtra') # display
# Date + forecast
library('lubridate') # date and time
library('forecast') # time series analysis
#library('prophet') # time series analysis
library('timetk') # time series analysis
# Interactivity
library('crosstalk')
library('plotly')
# parallel
library('foreach')
library('doParallel')
A short helper function to compute binomial confidence intervals.
We use R’s new vroom package which reads data as fast as its onomatopoeic name implies:
train <- vroom(str_c(path,'sales_train_validation.csv'), delim = ",", col_types = cols())
prices <- vroom(str_c(path,'sell_prices.csv'), delim = ",", col_types = cols())
calendar <- read_csv(str_c(path,'calendar.csv'), col_types = cols())
sample_submit <- vroom(str_c(path,'sample_submission.csv'), delim = ",", col_types = cols())
As a first step let’s have a quick look of the data sets using the head
, summary
, and glimpse
tools where appropriate.
Here are the first 10 columns and rows of the our training sales data:
id | item_id | dept_id | cat_id | store_id | state_id | d_1 | d_2 | d_3 | d_4 |
---|---|---|---|---|---|---|---|---|---|
HOBBIES_1_001_CA_1_validation | HOBBIES_1_001 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
HOBBIES_1_002_CA_1_validation | HOBBIES_1_002 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
HOBBIES_1_003_CA_1_validation | HOBBIES_1_003 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
HOBBIES_1_004_CA_1_validation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
HOBBIES_1_005_CA_1_validation | HOBBIES_1_005 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
HOBBIES_1_006_CA_1_validation | HOBBIES_1_006 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
HOBBIES_1_007_CA_1_validation | HOBBIES_1_007 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
HOBBIES_1_008_CA_1_validation | HOBBIES_1_008 | HOBBIES_1 | HOBBIES | CA_1 | CA | 12 | 15 | 0 | 0 |
HOBBIES_1_009_CA_1_validation | HOBBIES_1_009 | HOBBIES_1 | HOBBIES | CA_1 | CA | 2 | 0 | 7 | 3 |
HOBBIES_1_010_CA_1_validation | HOBBIES_1_010 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 1 | 0 |
We find:
There is one column each for the IDs of item, department, category, store, and state; plus a general ID that is a combination of the other IDs plus a flag for validation.
The sales per date are encoded as columns starting with the prefix d_
. Those are the number of units sold per day (not the total amount of dollars).
We already see that there are quite a lot of zero values.
This data set has too many columns and rows to display them all:
## [1] 1919 30490
All IDs are marked as “validation”. This refers to the intial validation testing period of the competition, before we ultimately predict a different 28-days period.
train %>%
mutate(dset = if_else(str_detect(id, "validation"), "validation", "training")) %>%
count(dset)
## # A tibble: 1 x 2
## dset n
## <chr> <int>
## 1 validation 30490
This data set gives us the weekly price changes per item:
store_id | item_id | wm_yr_wk | sell_price |
---|---|---|---|
CA_1 | HOBBIES_1_001 | 11325 | 9.58 |
CA_1 | HOBBIES_1_001 | 11326 | 9.58 |
CA_1 | HOBBIES_1_001 | 11327 | 8.26 |
CA_1 | HOBBIES_1_001 | 11328 | 8.26 |
CA_1 | HOBBIES_1_001 | 11329 | 8.26 |
CA_1 | HOBBIES_1_001 | 11330 | 8.26 |
CA_1 | HOBBIES_1_001 | 11331 | 8.26 |
CA_1 | HOBBIES_1_001 | 11332 | 8.26 |
CA_1 | HOBBIES_1_001 | 11333 | 8.26 |
CA_1 | HOBBIES_1_001 | 11334 | 8.26 |
## store_id item_id wm_yr_wk sell_price
## Length:6841121 Length:6841121 Min. :11101 Min. : 0.010
## Class :character Class :character 1st Qu.:11247 1st Qu.: 2.180
## Mode :character Mode :character Median :11411 Median : 3.470
## Mean :11383 Mean : 4.411
## 3rd Qu.:11517 3rd Qu.: 5.840
## Max. :11621 Max. :107.320
We find:
We have the store_id
and item_id
to link this data to our training and validation data.
Prices range from $0.10 to a bit more than 100 dollars.
The calendar data gives us date features such as weekday, month, or year; alongside 2 different event features and a SNAP food stamps flag:
date | wm_yr_wk | weekday | wday | month | year | d | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2011-01-29 | 11101 | Saturday | 1 | 1 | 2011 | d_1 | NA | NA | NA | NA | 0 | 0 | 0 |
2011-01-30 | 11101 | Sunday | 2 | 1 | 2011 | d_2 | NA | NA | NA | NA | 0 | 0 | 0 |
2011-01-31 | 11101 | Monday | 3 | 1 | 2011 | d_3 | NA | NA | NA | NA | 0 | 0 | 0 |
2011-02-01 | 11101 | Tuesday | 4 | 2 | 2011 | d_4 | NA | NA | NA | NA | 1 | 1 | 0 |
2011-02-02 | 11101 | Wednesday | 5 | 2 | 2011 | d_5 | NA | NA | NA | NA | 1 | 0 | 1 |
2011-02-03 | 11101 | Thursday | 6 | 2 | 2011 | d_6 | NA | NA | NA | NA | 1 | 1 | 1 |
2011-02-04 | 11101 | Friday | 7 | 2 | 2011 | d_7 | NA | NA | NA | NA | 1 | 0 | 0 |
2011-02-05 | 11102 | Saturday | 1 | 2 | 2011 | d_8 | NA | NA | NA | NA | 1 | 1 | 1 |
## Observations: 1,969
## Variables: 14
## $ date <date> 2011-01-29, 2011-01-30, 2011-01-31, 2011-02-01, 2011-...
## $ wm_yr_wk <dbl> 11101, 11101, 11101, 11101, 11101, 11101, 11101, 11102...
## $ weekday <chr> "Saturday", "Sunday", "Monday", "Tuesday", "Wednesday"...
## $ wday <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, ...
## $ month <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ year <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, ...
## $ d <chr> "d_1", "d_2", "d_3", "d_4", "d_5", "d_6", "d_7", "d_8"...
## $ event_name_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "SuperBowl", NA, NA, N...
## $ event_type_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Sporting", NA, NA, NA...
## $ event_name_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ event_type_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ snap_CA <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ...
## $ snap_TX <dbl> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, ...
## $ snap_WI <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, ...
## date wm_yr_wk weekday wday
## Min. :2011-01-29 Min. :11101 Length:1969 Min. :1.000
## 1st Qu.:2012-06-04 1st Qu.:11219 Class :character 1st Qu.:2.000
## Median :2013-10-09 Median :11337 Mode :character Median :4.000
## Mean :2013-10-09 Mean :11347 Mean :3.997
## 3rd Qu.:2015-02-13 3rd Qu.:11502 3rd Qu.:6.000
## Max. :2016-06-19 Max. :11621 Max. :7.000
## month year d event_name_1
## Min. : 1.000 Min. :2011 Length:1969 Length:1969
## 1st Qu.: 3.000 1st Qu.:2012 Class :character Class :character
## Median : 6.000 Median :2013 Mode :character Mode :character
## Mean : 6.326 Mean :2013
## 3rd Qu.: 9.000 3rd Qu.:2015
## Max. :12.000 Max. :2016
## event_type_1 event_name_2 event_type_2 snap_CA
## Length:1969 Length:1969 Length:1969 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Mode :character Median :0.0000
## Mean :0.3301
## 3rd Qu.:1.0000
## Max. :1.0000
## snap_TX snap_WI
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.3301 Mean :0.3301
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
We find:
The calendar has all the relevant dates, weekdays, months plus snap binary flags and logical event columns.
There are only 5 non-NA rows in the event_name_2
column; i.e. only 5 (out of 1969) instances where there is more than 1 event on a particular day.
There are no missing values in our sales training data:
## [1] 0
However, there are a lot of zero values, here we plot the distribution of zero percentages among all time series:
bar <- train %>%
select(-contains("id")) %>%
na_if(0) %>%
is.na() %>%
as_tibble() %>%
mutate(sum = pmap_dbl(select(., everything()), sum)) %>%
mutate(mean = sum/(ncol(train) - 1)) %>%
select(sum, mean)
bar %>%
ggplot(aes(mean)) +
geom_density(fill = "blue") +
scale_x_continuous(labels = scales::percent) +
coord_cartesian(xlim = c(0, 1)) +
theme_hc() +
theme(axis.text.y = element_blank()) +
labs(x = "", y = "", title = "Density for percentage of zero values - all time series")
Fig. 1
This means that only a minority of time series have less than 50% of zero values. The peak is rather close to 100%
We will start our visual exploration by investigating a number of time series plots on different aggregation levels. Here is a short helper function to transform our wide data into a long format with a dates
column in date format:
extract_ts <- function(df){
min_date <- date("2011-01-29")
df %>%
select(id, starts_with("d_")) %>%
pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
mutate(dates = as.integer(str_remove(dates, "d_"))) %>%
mutate(dates = min_date + dates - 1) %>%
mutate(id = str_remove(id, "_validation"))
}
set.seed(4321)
foo <- train %>%
sample_n(50)
ts_out <- extract_ts(foo)
cols <- ts_out %>%
distinct(id) %>%
mutate(cols = rep_len(brewer.pal(7, "Set2"), length.out = n_distinct(ts_out$id)))
ts_out <- ts_out %>%
left_join(cols, by = "id")
pal <- cols$cols %>%
setNames(cols$id)
Here we will sample 50 random time series from our training sample to get an idea about the data.
In the following plot, you can select the id of a time series (which is a concatenation of store and department) to display only the graphs that you’re interested in.
Currently, I don’t see a way to avoid having all time series plotted at the beginning. Select 1 from the input field to begin:
shared_ts <- highlight_key(ts_out)
palette(brewer.pal(9, "Set3"))
gg <- shared_ts %>%
ggplot(aes(dates, sales, col = id, group = id)) +
geom_line() +
scale_color_manual(values = pal) +
labs(x = "Date", y = "Sales") +
theme_tufte() +
NULL
filter <- bscols(
filter_select("ids", "Sales over time: Select a time series ID (remove with backspace key, navigate with arrow keys):", shared_ts, ~id, multiple = TRUE),
ggplotly(gg, dynamicTicks = TRUE),
widths = c(12, 12)
)
bscols(filter)