Photo by Fabio Bracht (@bracht) on Unsplash.

1 Preparations

1.2 Helper functions

A short helper function to compute binomial confidence intervals.

2 Quick Look: File structure and content

As a first step let’s have a quick look of the data sets using the head, summary, and glimpse tools where appropriate.

2.1 Training sales data

Here are the first 10 columns and rows of the our training sales data:

id item_id dept_id cat_id store_id state_id d_1 d_2 d_3 d_4
HOBBIES_1_001_CA_1_validation HOBBIES_1_001 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_002_CA_1_validation HOBBIES_1_002 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_003_CA_1_validation HOBBIES_1_003 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_004_CA_1_validation HOBBIES_1_004 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_005_CA_1_validation HOBBIES_1_005 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_006_CA_1_validation HOBBIES_1_006 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_007_CA_1_validation HOBBIES_1_007 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_008_CA_1_validation HOBBIES_1_008 HOBBIES_1 HOBBIES CA_1 CA 12 15 0 0
HOBBIES_1_009_CA_1_validation HOBBIES_1_009 HOBBIES_1 HOBBIES CA_1 CA 2 0 7 3
HOBBIES_1_010_CA_1_validation HOBBIES_1_010 HOBBIES_1 HOBBIES CA_1 CA 0 0 1 0

We find:

  • There is one column each for the IDs of item, department, category, store, and state; plus a general ID that is a combination of the other IDs plus a flag for validation.

  • The sales per date are encoded as columns starting with the prefix d_. Those are the number of units sold per day (not the total amount of dollars).

  • We already see that there are quite a lot of zero values.

This data set has too many columns and rows to display them all:

## [1]  1919 30490

All IDs are marked as “validation”. This refers to the intial validation testing period of the competition, before we ultimately predict a different 28-days period.

## # A tibble: 1 x 2
##   dset           n
##   <chr>      <int>
## 1 validation 30490

2.2 Sales prices

This data set gives us the weekly price changes per item:

store_id item_id wm_yr_wk sell_price
CA_1 HOBBIES_1_001 11325 9.58
CA_1 HOBBIES_1_001 11326 9.58
CA_1 HOBBIES_1_001 11327 8.26
CA_1 HOBBIES_1_001 11328 8.26
CA_1 HOBBIES_1_001 11329 8.26
CA_1 HOBBIES_1_001 11330 8.26
CA_1 HOBBIES_1_001 11331 8.26
CA_1 HOBBIES_1_001 11332 8.26
CA_1 HOBBIES_1_001 11333 8.26
CA_1 HOBBIES_1_001 11334 8.26
##    store_id           item_id             wm_yr_wk       sell_price     
##  Length:6841121     Length:6841121     Min.   :11101   Min.   :  0.010  
##  Class :character   Class :character   1st Qu.:11247   1st Qu.:  2.180  
##  Mode  :character   Mode  :character   Median :11411   Median :  3.470  
##                                        Mean   :11383   Mean   :  4.411  
##                                        3rd Qu.:11517   3rd Qu.:  5.840  
##                                        Max.   :11621   Max.   :107.320

We find:

  • We have the store_id and item_id to link this data to our training and validation data.

  • Prices range from $0.10 to a bit more than 100 dollars.

2.3 Calendar

The calendar data gives us date features such as weekday, month, or year; alongside 2 different event features and a SNAP food stamps flag:

date wm_yr_wk weekday wday month year d event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI
2011-01-29 11101 Saturday 1 1 2011 d_1 NA NA NA NA 0 0 0
2011-01-30 11101 Sunday 2 1 2011 d_2 NA NA NA NA 0 0 0
2011-01-31 11101 Monday 3 1 2011 d_3 NA NA NA NA 0 0 0
2011-02-01 11101 Tuesday 4 2 2011 d_4 NA NA NA NA 1 1 0
2011-02-02 11101 Wednesday 5 2 2011 d_5 NA NA NA NA 1 0 1
2011-02-03 11101 Thursday 6 2 2011 d_6 NA NA NA NA 1 1 1
2011-02-04 11101 Friday 7 2 2011 d_7 NA NA NA NA 1 0 0
2011-02-05 11102 Saturday 1 2 2011 d_8 NA NA NA NA 1 1 1
## Observations: 1,969
## Variables: 14
## $ date         <date> 2011-01-29, 2011-01-30, 2011-01-31, 2011-02-01, 2011-...
## $ wm_yr_wk     <dbl> 11101, 11101, 11101, 11101, 11101, 11101, 11101, 11102...
## $ weekday      <chr> "Saturday", "Sunday", "Monday", "Tuesday", "Wednesday"...
## $ wday         <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, ...
## $ month        <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ year         <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, ...
## $ d            <chr> "d_1", "d_2", "d_3", "d_4", "d_5", "d_6", "d_7", "d_8"...
## $ event_name_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "SuperBowl", NA, NA, N...
## $ event_type_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Sporting", NA, NA, NA...
## $ event_name_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ event_type_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ snap_CA      <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ...
## $ snap_TX      <dbl> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, ...
## $ snap_WI      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, ...
##       date               wm_yr_wk       weekday               wday      
##  Min.   :2011-01-29   Min.   :11101   Length:1969        Min.   :1.000  
##  1st Qu.:2012-06-04   1st Qu.:11219   Class :character   1st Qu.:2.000  
##  Median :2013-10-09   Median :11337   Mode  :character   Median :4.000  
##  Mean   :2013-10-09   Mean   :11347                      Mean   :3.997  
##  3rd Qu.:2015-02-13   3rd Qu.:11502                      3rd Qu.:6.000  
##  Max.   :2016-06-19   Max.   :11621                      Max.   :7.000  
##      month             year           d             event_name_1      
##  Min.   : 1.000   Min.   :2011   Length:1969        Length:1969       
##  1st Qu.: 3.000   1st Qu.:2012   Class :character   Class :character  
##  Median : 6.000   Median :2013   Mode  :character   Mode  :character  
##  Mean   : 6.326   Mean   :2013                                        
##  3rd Qu.: 9.000   3rd Qu.:2015                                        
##  Max.   :12.000   Max.   :2016                                        
##  event_type_1       event_name_2       event_type_2          snap_CA      
##  Length:1969        Length:1969        Length:1969        Min.   :0.0000  
##  Class :character   Class :character   Class :character   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.0000  
##                                                           Mean   :0.3301  
##                                                           3rd Qu.:1.0000  
##                                                           Max.   :1.0000  
##     snap_TX          snap_WI      
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.3301   Mean   :0.3301  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000

We find:

  • The calendar has all the relevant dates, weekdays, months plus snap binary flags and logical event columns.

  • There are only 5 non-NA rows in the event_name_2 column; i.e. only 5 (out of 1969) instances where there is more than 1 event on a particular day.

3 Visual Overview: Interactive time series plots

We will start our visual exploration by investigating a number of time series plots on different aggregation levels. Here is a short helper function to transform our wide data into a long format with a dates column in date format:

3.1 Individual item-level time series - random sample

Here we will sample 50 random time series from our training sample to get an idea about the data.

In the following plot, you can select the id of a time series (which is a concatenation of store and department) to display only the graphs that you’re interested in.

Currently, I don’t see a way to avoid having all time series plotted at the beginning. Select 1 from the input field to begin:

We find:

  • Most time series have pretty low daily count statistics, alongside the large percentage of zero numbers we already noticed. On the one hand, this suggests that spikes are not going to be overly pronounced. But it also indicates that accurate forecasts will have to deal with quite a lot of noise.

  • Some of our sample time series start in the middle of the time range, and some have long gaps in between. This is an additional challenge.

Notes:

  • I have assigned one of 7 colour brewer colour-blind friendly colours (“Set2”) to each time series (cycling through the selection sequentially). Those will typically provide a good contrast for a reasonable number of simultaneous plots, but sometimes two time series might have the same colour.

  • I got the inspiration to use the crosstalk package for building this plot from the excellent dashboard kernel by Saba Tavoosi. Make sure to check it out.

3.2 All aggregate sales

After peeking at some of the individual time series and finding a lot of zero values, we will now do some aggregation to get some decent statistics.

First off, here we plot the aggregate time series over all items, stores, categories, departments and sales. This is an interactive plot and you can use the usual plotly tools (upper right corner) to zoom, pan, and scale the view.

Fig. 3

We find:

  • The sales are generally going up, which I suppose is good news for Walmart. We can make out some yearly seasonality, and a dip at Christmas, which is the only day of the year when the stores are closed.

  • Zooming in, we can see strong weekly seasonality plus possibly some additional overlaying patterns with shorter periods than yearly.

  • The most recent 2016 sales numbers appear to grow a bit faster than in previous years.

3.3 Sales per State

To get a bit more out of these big picture views we will look at the sales per state on a monthly aggregate level. This is another interactive ggplotly graph:

Fig. 4

We find:

  • California (CA) sells more items in general, while Wisconsin (WI) was slowly catching up to Texas (TX) and eventually surpassed it in the last months of our training data.

  • CA has pronounced dips in 2013 and 2015 that appear to be present in the other states as well, just less severe. These dips and peaks don’t appear to always occur (see 2012) but they might primarily reflect the yearly seasonality we noticed already.

3.4 Sales per Store & Category

There are 10 stores, 4 in CA and 3 each in TX and WI, and 3 categories: Foods, Hobbies, and Household. Here we’re switching up our visualisation strategy a bit by including all of those in the same, non-interactive layout. We will again use monthly aggregates to keep the plots clean.

This visual uses the a facetted view for the stores (1 facet per state) and the great patchwork package to arrange the individual plots, including a count of all category occurences:

Fig. 5

Fig. 5

We find:

  • “Foods” are the most common category, followed by “Household” which is still quite a bit above “Hobbies”. The number of “Household” rows is closer to the number of “Foods” rows than the corresponding sales figures, indicating that more “Foods” units are sold than “Household” ones.

  • In terms of stores, we see that the TX stores are quite close together in sales; with “TX_3” rising from the levels of “TX_1” to the level of “TX_2” over the time of our training data. The WI stores “WI_1” and “WI_2” show a curious jump in sales in 2012, while “WI_3” shows a long dip over several year.

  • The CA stores are relatively well separated in store volume. Note “CA_2”, which declines to the “CA_4” level in 2015, only to recover and jump up to “CA_1” sales later in the year.

3.5 Sales per Department

Our data has 7 departments, 3 for “FOODS” and 2 each for “HOBBIES” and “HOUSEHOLD”. Together with the 3 states those are 21 levels. Let’s see whether we can fit them all in a single facet grid plot.

Fig. 5

Fig. 5

We find:

  • “FOODS_3” is clearly driving the majority of “FOODS” category sales in all states. “FOODS_2” is picking up a bit towards the end of the time range, especially in “WI”.

  • Similarly, “HOUSEHOLD_1” is clearly outselling “HOUSEHOLD_2”. “HOBBIES_1” is on a higher average sales level than “HOBBIES_2”, but both are not showing much development over time.

3.6 Seasonalities - global

Moving on from the time series views, at least for the moment, we are changing up our visuals to study seasonalities. Here is a heat map that combines the weekly and yearly seasonalities.

Because of the general increasing trend in sales, we’re not looking at absolute sales values. Instead, we aim to model this trend using a smoothed (LOESS) fit which we then subtract from the data. The heatmap shows the relative changes. Note, that I removed the Christmas dips because they would be distracting for the purpose of this plot:

Fig. 6

Fig. 6

We find:

  • The weekly pattern is strong, with Sat and Sun standing out prominently. Also Monday seems to benefit a bit from the weekend effect.

  • The months of Nov and Dec show clear dips, while the summer months May, Jun, and Jul suggest a milder secondary dip. Certain holidays, like the 4th of July, might somewhat influence these patterns; but over 5 years they should average out reasonably well.

  • I already tweaked the smoothing fit parameters a bit, but they could probably be optimised further. Still, it looks quite good for the first try.

Let’s stay with the seasonalities for a little while longer and go one or two levels deeper.

Next, we’ll look at the weekly and monthly patterns on a state level. We’re using the same smoothing approach which is shown in the upper panels for reference. Here, I will scale each individual time series by its global mean to allow for a better comparison between the three states.

The colours are the same throughout the layout; see the upper panels for reference:

Fig. 7

Fig. 7

We find:

  • After scaling, the weekday vs weekend pattern is very similar for all 3 states, except for an interesting downturn in Sunday sales in WI.

  • The monthly seasonalities are indeed complex. There is a dip in the winter months and a second, generally shallower dip around May. WI is again the odd state out: it sells notably less in the summer compared to TX and especially CA; so much so that the Feb/Aug ratio is inverted for WI vs CA/TX.

The logical next step is to look at the Seasonalities per Category, since you wouldn’t necessarily expect food and household item shopping to follow the same patterns. I will also break up this view by state, following the insights we just saw above.

Again, we are subtracting the trend and scaling by the global mean:

# sum by cat + state, pivoting dates
foo <- train %>%
  group_by(cat_id, state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  ungroup() %>% 
  select(ends_with("id"), starts_with("d_")) %>%  
  pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
  mutate(dates = as.integer(str_remove(dates, "d_"))) %>% 
  mutate(dates = min_date + dates - 1)

# fit loess and subtract
bar <- foo %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(cat_id, state_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 2/3, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales)

# bar %>% 
#   ggplot(aes(dates, sales, col = cat_id)) +
#   geom_line() +
#   geom_line(aes(dates, loess), col = "black") +
#   facet_grid(cat_id ~ state_id) +
#   theme_tufte() +
#   theme(legend.position = "none") +
#   labs(x = "", y = "Sales", title = "Sales per State with Seasonalities")

p1 <- bar %>% 
  ungroup() %>%
  mutate(wday = wday(dates, label = TRUE, week_start = 1)) %>% 
  group_by(cat_id, state_id, wday) %>% 
  summarise(sales = sum(sales_rel)) %>%
  unite(col = id, ends_with("id"), remove = FALSE) %>%  
  ggplot(aes(wday, sales, group = id, col = state_id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  facet_wrap(~cat_id, scales = "free", nrow = 3) +
  theme(legend.position = "top") +
  labs(x = "", y = "Relative Sales", title = "Weekly Seasonality and", col = "State")
  
p2 <- bar %>% 
  mutate(month = month(dates, label = TRUE)) %>% 
  group_by(cat_id, state_id, month) %>% 
  summarise(sales = sum(sales_rel)) %>%
  unite(col = id, ends_with("id"), remove = FALSE) %>% 
  ggplot(aes(month, sales, group = id, col = state_id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  facet_wrap(~cat_id, scales = "free_y", nrow = 3) +
  theme(legend.position = "none") +
  labs(x = "", y = "Relative Sales", title = "Monthly Seasonality by Category & State", col = "State")

layout <- "
AABBB
"

p1 + p2 + plot_layout(design = layout)
Fig. 8

Fig. 8

We find:

  • The weekly patterns for the “foods” category are very close for all 3 states, and WI shows some deviations for “hobbies” and “household”. We also see the Wisconsin’s characteristic Sunday dip for all 3 categories.

  • The monthly patterns show some interesting signatures: For “foods”, CA and TX are pretty close but WI shows that inverted summer/winter ratio. In contrast, the 3 states are much more similar to each other in the “hobbies” category. And when it comes to “household” items, CA doesn’t seem to sell as much of them during the first 3 months of the year but slightly more in the summer; compared to WI and TX.

Before we look at the additional explanatory variables, here is a visual representation of the split between training data vs validation data (public leaderboard) vs evaluation data (eventual private leaderboard):

Fig. 9

Fig. 9

  • This shows the 5+ years of training data together with the 28 days each of validation and evaluation time range.

  • The training range goes from 2011-01-29 to 2016-04-24. The validation range spans the following 28 days from 2016-04-25 to 2016-05-22. This is what we are predicting for the initial public leaderboard. Eventually, we will be given the ground truth for this period to train our model for the final evaluation.

  • The evaluation range goes from 2016-05-23 to 2016-06-19; another 28 days. This is what the ultimate scoring will be based on.

4 Explanatory Variables: Prices and Calendar

This section will focus on the additional explanatory variables we’ve been given: item prices and calendar events. In terms of calendar features, we have already used some parameters like day-of-week or month, derived from the date, in the previous section. After studying the basic properties of these datasets, we will also connect them to the time series data.

4.1 Calendar

We need some calendar features for the item price table, so let’s start with that one.

In the quick view in section 3.3 we see that the calendar data frame contains basic features like day-of-week (as character column weekday and as integer column wday), month, year, and of course date. Alongside the date there is also a d column which links the date to the column names in the training data. (Our time series extraction function has the starting date as a hard-coded value, but you can also convert from column name to date using the calendar file.)

The other features deal with events and food stamps:

  • If you look at section 3.3. you’ll see that the columns event_name_2 and event_type_2 only contain 5 values that are not NAs, so we’re ignoring them here and only focus on the event_*_1 features.

  • The acronym SNAP stands for Supplemental Nutrition Assistance Program. The following is copied from their website: “The Supplemental Nutrition Assistance Program (SNAP) is the largest federal nutrition assistance program. SNAP provides benefits to eligible low-income individuals and families via an Electronic Benefits Transfer card. This card can be used like a debit card to purchase eligible food in authorized retail food stores.”

Fig. 10

Fig. 10

We find:

  • In our calendar coverage (i.e. training + validation + evaluation range) about 8% of days have a special event. Of these events, about 1/3 are Religious (e.g. Orthodox Christmas) and 1/3 are National Holidays (e.g. Independence Day). The remaining third is again split into 2/3 Cultural (e.g. Valentines Day) and 1/3 Sporting events (e.g. SuperBowl).

  • Looking at the percentage of days where purchases with SNAP food stamps are allowed in Walmart stores, we find that it is the exact same for each of the 3 states: 650 days or 33%. This is noteworthy.

Let’s stay with SNAP for a little bit and look at a different style of visual: a calendar view. Here, I’m showing the days for each month in the shape of days-of-week as rows and weeks-of-month as columns; essentially the view a calendar is displayed in a Year view. Then I colour the SNAP days orange and show the same graph for each state:

## tk_augment_timeseries_signature(): Using the following .date_var variable: date
p1 <- foo %>%
  filter(!is.na(day)) %>% 
  filter(year == 2012 & month <= 6 & state == "CA") %>% 
  #mutate(mweek2 = lead(mweek,1)) %>%
  #filter(!is.na(mweek2)) %>%
  #ggplot(aes(mweek2, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  ggplot(aes(mweek, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  geom_tile(colour = "white") +
  geom_text(aes(label = day), size = 2.5) +
  scale_fill_manual(values = c("grey70", "orange")) +
  facet_grid(year~month.lbl) + 
  theme_tufte() +
  theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8)) +
  labs(x="", y="", title = "California")

p2 <- foo %>%
  filter(!is.na(day)) %>% 
  filter(year == 2012 & month <= 6 & state == "TX") %>% 
  # mutate(mweek2 = lead(mweek,1)) %>%
  # filter(!is.na(mweek2)) %>%
  # ggplot(aes(mweek2, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  ggplot(aes(mweek, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  geom_tile(colour = "white") +
  geom_text(aes(label = day), size = 2.5) +
  scale_fill_manual(values = c("grey70", "orange")) +
  facet_grid(year~month.lbl) + 
  theme_tufte() +
  theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8)) +
  labs(x="", y="", title = "Texas")

p3 <- foo %>%
  filter(!is.na(day)) %>% 
  filter(year == 2012 & month <= 6 & state == "WI") %>% 
  # mutate(mweek2 = lead(mweek,1)) %>%
  # filter(!is.na(mweek2)) %>%
  # ggplot(aes(mweek2, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  ggplot(aes(mweek, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  geom_tile(colour = "white") +
  geom_text(aes(label = day), size = 2.5) +
  scale_fill_manual(values = c("grey70", "orange")) +
  facet_grid(year~month.lbl) + 
  theme_tufte() +
  theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8)) +
  labs(x="", y="", title = "Wisconsin")

p1 / p2 / p3 +
   plot_annotation(title = 'The same SNAP days each month, but different from state to state')
Fig. 11

Fig. 11

These are just the first six month of 2012, to keep the plot manageable, but the pattern is always the same:

  • SNAP days are always the same days of the month for each month. There are always 10 of them, and the specific days are different from state to state: days 1-10 for CA, days 1-15 without 2, 4, 8, 10, 14 for TX, and days 2-15 without 4, 7, 10, 13 for WI.

  • The SNAP days for these 3 states also all happen in the first half of each month, no later than the 15th. This certainly helps to measure their impact and make predictions more robust.

4.2 Item Prices

We have item price information for each item ID, which includes the category and department IDs, and its store ID, which includes the state ID. Let’s look at some average overviews first.

Here is a facet grid with overlapping density plots for price distributions within the groups of category, department, and state. Note the logarithmic scale on the x-axes:

Fig. 12

Fig. 12

We find:

  • First of all, the distributions are almost identical between the 3 states. There are some minute differences in the “FOODS” category, but this might be due the smoothing bandwidth size. For all practical purposes, I think that we can treat the price distributions as equal.

  • There are notable differences between the categories: FOODs are on average cheaper than HOUSEHOLD items. And HOBBIES items span a wider range of prices than the other two; even suggesting a second peak at lower prices.

  • Within the categories, we find significant differences, too:

    • Among the three food categories department 3 (i.e. “FOODS_3”) does not contain a high-price tail.

    • The HOBBIES category is the most diverse one, with both departments having quite broad distributions but “HOBBIES_1” accounting for almost all of the items above $10. “HOBBIES_2” has a bimodal structure.

    • The HOUSEHOLD price distributions are quite similar, but “HOUSEHOLD_2” peaks at clearly higher prices than “HOUSEHOLD_1”.

Prices are provided as weekly averages. The week ID wm_yr_wk can be linked to dates (and sales) using the calendar column of the same name.

Here we do just that to extract the item price distributions per category and department for each of the year from 2011 to 2016. We use ridgeline plots provided by the ggridges package to produce stacks of overlapping density graphs:

Fig. 13

Fig. 13

We find:

  • Overall, the price distributions are pretty stable over the years, with only slight increases that are likely due to inflation. For this plot, I’ve left the standard grey backgrounds so that you can use the vertical grid lines to better see the shifts to the right (remember that the scale is logarithmic). This is probably best visible in HOBBIES_1.

  • An interesting evolution is visible in HOBBIES_2, which over time becomes much more bimodal: the second peak at $1 is increasing in importance until it almost reaches the level of the main peak just above 2 dollars. At the same time the small secondary peak at half a dollar in HOBBIES_1 becomes more flat after 2012.

  • The HOUSEHOLD departments are stable. FOODS shows small changes like the relative growth of the $1 peak of FOODS_1.

4.3 Connection to time series data

Now that we have an idea about the properties of our explanatory variables, let’s see how they could influence our time series data.

Here, we’re looking at time series properties on an aggregate level. I plan for the next section to contain views of selected individual time series.

First off, this is a comprehensive view of sales volume during days with special events vs non-events for our 3 categories food, hobbies, and household. I’m showing the daily time series in the background, but it’s more instructive to look at the smoothed representations. I’m doing the smoothing per category and also globally (dashed line). This global fit will be used to compute the relative sales for the two bottom panels. In the lower right panel I only look at days with events and show the median sales for the 4 different types of events. The two bottom panels share the same y-axis labels between them.

Take your time to digest this view:

foo <- train %>%
  group_by(cat_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = cat_id) %>% 
  extract_ts() %>% 
  rename(cat_id = id) %>% 
  left_join(calendar %>% select(date, event_type_1), by = c("dates" = "date")) %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(cat_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales) %>% 
  mutate(is_event = !is.na(event_type_1)) %>% 
  ungroup()


p1 <- foo %>% 
  ggplot(aes(dates, sales/1e3, group = is_event, col = is_event)) +
  geom_line(aes(dates, loess/1e3), col = "black", linetype = 2) +
  geom_line(alpha = 0.3) +
  geom_smooth(method = "loess", formula = 'y ~ x', span = 2/3, se = FALSE) +
  scale_colour_manual(values = c("grey70", "red")) +
  facet_wrap(~ cat_id, scales = "free") +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales [$1k]", col = "Event", title = "Sales per Category during special events vs non-events")

p2 <- foo %>% 
  ggplot(aes(cat_id, sales_rel, fill = is_event)) +
  geom_boxplot() +
  coord_flip() +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none", axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        axis.title.x = element_text(size = 10)) +
  labs(x = "", y = "Relative Sales", fill = "event")

p3 <- foo %>%
  filter(is_event == TRUE) %>% 
  group_by(cat_id, event_type_1) %>% 
  summarise(sales = median(sales_rel)) %>% 
  ggplot(aes(cat_id, sales, fill = event_type_1)) +
  geom_col(position = "dodge") +
  coord_flip() +
  theme_hc() +
  theme(legend.position = "right", axis.title.x = element_text(size = 10)) +
  labs(x = "", y = "Median Relative Sales", fill = "Type")

layout <- '
AAAAAA
AAAAAA
AAAAAA
BBBCCD
BBBCCD
'

p1 + p2 + p3 + guide_area() + plot_layout(design = layout, guides = 'collect')
Fig. 14

Fig. 14

We find:

  • For FOODS the smoothed lines of event vs non-event sales are pretty similar, while for HOBBIES the red event line is consistently below the non-events and for HOUSEHOLD the same is true after 2013. (This is a curious detail, that before 2013 there was no real difference between those sales.)

  • This impression is confirmed by looking at the boxplots in the lower left panel, which show the relative sales after subtracting the global smoothing fit. The FOODS sales are pretty comparable between events and non-events, while the event sales for HOUSEHOLD and especially HOBBIES are notably below the non-event level.

  • In the lower right panel we break out the events by type and look at the medians of the relative sales. The first thing we see is that FOODS sales are notably higher during “Sporting” events. This makes sense, given the food culture associated with big events like the Superbowl. FOODs also have slightly positive net sales during “Cultural” events.

  • In general, “National” and “Religious” events both lead to relative decline in sales volume. “National” events are more depressing for the HOBBIES category, while the other two categories are slightly more affected by “Religious” events. HOBBIES also sees lower sales from “Cultural” events, while for FOODS and HOUSEHOLD the differences are smaller. “Sporting” has a minor impact on HOUSEHOLD and HOBBIES.

We’ve spend quite a bit of time on designing this dashboard-like overview, so let’s get some mileage out of it. Here’s the equivalent plot for the 3 states:

foo <- train %>%
  group_by(state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = state_id) %>% 
  extract_ts() %>% 
  rename(state_id = id) %>% 
  left_join(calendar %>% select(date, event_type_1), by = c("dates" = "date")) %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(state_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales) %>% 
  mutate(is_event = !is.na(event_type_1)) %>% 
  ungroup()


p1 <- foo %>% 
  ggplot(aes(dates, sales/1e3, group = is_event, col = is_event)) +
  geom_line(aes(dates, loess/1e3), col = "black", linetype = 2) +
  geom_line(alpha = 0.3) +
  geom_smooth(method = "loess", formula = 'y ~ x', span = 2/3, se = FALSE) +
  scale_colour_manual(values = c("grey70", "red")) +
  facet_wrap(~ state_id, scales = "free") +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales [$1k]", col = "Event", title = "Sales per State during special events vs non-events")

p2 <- foo %>% 
  ggplot(aes(state_id, sales_rel, fill = is_event)) +
  geom_boxplot() +
  coord_flip() +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none", axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        axis.title.x = element_text(size = 10)) +
  labs(x = "", y = "Relative Sales", fill = "event")

p3 <- foo %>%
  filter(is_event == TRUE) %>% 
  group_by(state_id, event_type_1) %>% 
  summarise(sales = median(sales_rel)) %>% 
  ggplot(aes(state_id, sales, fill = event_type_1)) +
  geom_col(position = "dodge") +
  coord_flip() +
  theme_hc() +
  theme(legend.position = "right", axis.title.x = element_text(size = 10)) +
  labs(x = "", y = "Median Relative Sales", fill = "Type")

layout <- '
AAAAAA
AAAAAA
AAAAAA
BBBCCD
BBBCCD
'

p1 + p2 + p3 + guide_area() + plot_layout(design = layout, guides = 'collect')
Fig. 15

Fig. 15

We find:

  • Special events slightly outsell non-event days in TX before 2014; afterwards they are similar. CA and WI also show a drop around the same time, but here it’s from similar sales to lower sales. This seems to be a common pattern that starts in 2013.

  • As a result the events boxplot for WI is notably shifted toward lower values, while in TX events push the sales figures to somewhat higher values. CA is roughly the same for events vs non-events. Note, that this is the global picture which will be different pre- and post-2014..

  • The view of event types looks interesting, especially for WI where “National” events have a strong negative impact on sales numbers. WI is also the only state where “Cultural” events have lower sales numbers, especially compared to TX. In contrast, “Religious” events have the smallest, but still negative impact in WI. “Sporting” events have a positive influence in each state.

Now, for the states we also have the SNAP dates from Fig. 11. In this plot we show again a smoothed fit for the SNAP days vs other days in the top panel, but then add different plots in the lower panels. The lower left plot shows the daily sales percentages. Since SNAP days make up about 1/3 of the days in each month, we sum the sales for each group (SNAP vs non-SNAP) and then divide by the total number of days.

bar <- calendar %>% 
  select(date, starts_with("snap")) %>% 
  pivot_longer(starts_with("snap"), names_to = "state_id", values_to = "snap") %>% 
  mutate(state_id = str_replace(state_id, "snap_", ""))

foo <- train %>%
  group_by(state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = state_id) %>% 
  extract_ts() %>% 
  rename(state_id = id) %>% 
  left_join(bar, by = c("dates" = "date", "state_id")) %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  mutate(snap = as.logical(snap)) %>% 
  group_by(state_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales) %>% 
  ungroup()


p1 <- foo %>% 
  ggplot(aes(dates, sales/1e3, group = snap, col = snap)) +
  geom_line(aes(dates, loess/1e3), col = "black", linetype = 2) +
  geom_line(alpha = 0.3) +
  geom_smooth(method = "loess", formula = 'y ~ x', span = 2/3, se = FALSE) +
  scale_colour_manual(values = c("grey70", "red")) +
  facet_wrap(~ state_id, scales = "free") +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales [$1k]", col = "SNAP day", title = "Sales per State on SNAP days vs other")

p2 <- foo %>% 
  group_by(state_id, snap) %>% 
  summarise(sales = sum(sales),
            ct = n()) %>% 
  mutate(sales_daily = sales/ct) %>% 
  add_tally(sales_daily, name = "total") %>% 
  mutate(perc = sales_daily/total) %>% 
  ggplot(aes(state_id, perc, fill = snap)) +
  geom_col(position = "dodge") +
  geom_label(aes(label = sprintf("%.1f %%", perc*100), group = snap), position = position_dodge(width = 1)) +
  scale_y_continuous(labels = scales::percent, breaks = c(0, seq(0.1, 0.5, 0.2))) +
  coord_cartesian(ylim = c(0, 0.6)) +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none", axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
  labs(x = "", y = "", title = "Daily Sales Percentage")

layout <- '
AAAA
AAAA
BBBC
'

p1 + p2 + guide_area() + plot_layout(design = layout, guides = 'collect')
Fig. 16

Fig. 16

We find:

  • The SNAP days have clearly higher sales in every state. The largest difference to non-SNAP days is present for WI; this is abundantly clear from both the time series plots and the daily sales percentages. CA sales are closest for the two groups, but SNAP days are still almost 2 percentage points above 50%.

  • There are some slight variations over time, primarily for WI where the two curves appear to reach there biggest difference around 2013. As with all smoothing fits, also these ones have to be taken with a grain of salt near the very edges of the data.

After looking at the last two visuals the question has to be: how does the sales category interact with the SNAP feature?

The “N” in SNAP stands for “Nutrition”, so we would naively expect that those benefits would primarily affect purchases in our FOODS category. However, keep in mind that once someone bases their shopping patterns on SNAP days, they are very likely to purchase other items while they are there.

Let’s see what the data shows. Since the SNAP days are state dependent, we have to join our data by state level (and date) first, before aggregating by category and SNAP. We first look at the daily sales percentage for SNAP vs other, and then plot a heatmap for weekday vs month. The values of the heatmap show the differences between the sum of relative sales on SNAP days minus the sum of relative sales on other days. If the numbers are positive then there are more sales on SNAP days for this weekday and month (normalised by overall volume):

Fig. 17

Fig. 17

p1 <- foo %>% 
  group_by(state_id, cat_id, snap) %>% 
  summarise(sales = sum(sales),
            ct = n()) %>% 
  mutate(sales_daily = sales/ct) %>% 
  add_tally(sales_daily, name = "total") %>% 
  mutate(perc = sales_daily/total) %>% 
  ggplot(aes(cat_id, perc, fill = snap)) +
  geom_col(position = "dodge") +
  facet_wrap(~ state_id, nrow = 1) +
  geom_label(aes(label = sprintf("%.1f %%", perc*100), group = snap), position = position_dodge(width = 1)) +
  scale_y_continuous(labels = scales::percent, breaks = c(0, seq(0.1, 0.5, 0.2))) +
  coord_cartesian(ylim = c(0, 0.6)) +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none", axis.text.x = element_text(size = 8)) + 
  labs(x = "", y = "", title = "Daily Sales Percentage for SNAP per category")
  
p2 <- foo %>% 
  filter(state_id == "CA" & cat_id == "FOODS") %>% 
    mutate(wday = wday(dates, label = TRUE, week_start = 1),
         month = month(dates, label = TRUE),
         year = year(dates)) %>% 
  group_by(wday, month, snap) %>% 
  summarise(sales = sum(sales_rel)) %>% 
  pivot_wider(names_from = "snap", values_from = "sales", names_prefix = "snap") %>% 
  mutate(snap_effect = snapTRUE - snapFALSE) %>% 
  ggplot(aes(month, wday, fill = snap_effect)) +
  geom_tile() +
  labs(x = "Month of the year", y = "Day of the week", fill = "SNAP effect") +
  scale_fill_distiller(palette = "Spectral") +
  theme_hc() +
  theme(legend.position = "bottom") +
  labs(x = "", y = "", title = "SNAP impact by weekday & month",
       subtitle = "Relative sales of SNAP days - other days. Only FOODS category and state CA.")

p1 / p2
Fig. 17

Fig. 17

We find:

  • As expected, the impact is largest for the FOODs category; especially in WI. However, there are indications of slight synergy effects on other categories as well.

  • The heatmap focusses on FOODS and CA (because CA has the overall largest sales numbers). We see that overall the work days Mon-Fri show stronger benefits from SNAP purchases than the weekend Sat/Sun.

  • Thursdays in November stand out. I suspect that this effect is because of SNAP, but because of Thanksgiving. Thanksgiving is celebrated on the 4th Thursday in November every year. Here, this holiday most likely cuts into the “other” purchases and leads to the SNAP effect appearing artificially high.

  • Which reminds us that one caveat we have to keep in mind here is that SNAP days are always during the first half of each month. Ideally, we would want to control for the possibility that the effect we see is 1st half of month vs 2nd half of month rather than SNAP vs other. Or the possibility that we see the impact of holidays like Thanksgiving instead.

5 Individual time series with explanatory variables

Now that we know the global impact of the explanatory variables, let’s look at some example time series to get an idea about individual effects.

We will pick 3 random items from our interactive Fig. 2. Then we extract their sales numbers and join calendar events together with SNAP flags. I’ve spent some thoughts on how to display this in an efficient way, and here is my current preferred strategy: we plot the sales numbers as line charts on top of background rectangles that show the (regular) periods of SNAP days. Then we add event indicators as black points. Let me know if you have any other suggestions on how to plot this more elegantly.

To keep the focus on the SNAP periods, this plot will zoom into the period of May - Sep 2015. The three different items cover the 3 states and 3 sales categories. Following Fig. 11 we know that different states have different SNAP days; with only CA having a continuous 10-day range in our data:

Fig. 18

Fig. 18

We find:

We can look a price changes in a similar way. Here, I’m first extracting the intervals of constant prices by identifying the change points.

Then I’ll join the those intervals, and their price numbers, to the sales information for our 3 example items. Since price changes happen over a longer time range, here we will look at the entire training data time range, not just focus on a few months as above. The visualisation is another experiment in which I encode the price as a background colour behind the time series. The idea is to be able to identify changes in price, whether it’s an increase or a decrease, with changes in sales numbers.

In the following plots, lighter colours mean lower prices. Note the different price ranges for each item:

p1 <- example %>% 
  filter(id == "FOODS_2_092_CA_1") %>% 
  select(id, dates, sales) %>% 
  left_join(price_intervals, by = c("id", "dates" = "price_start")) %>% 
  mutate(price_start = if_else(is.na(price_end), date(NA_character_), dates)) %>% 
  ggplot(aes(dates, sales, group = id)) +
  geom_rect(aes(xmin = price_start, xmax = price_end, ymin = 0, ymax = Inf, fill = sell_price), na.rm = TRUE) +
  #geom_line(col = scales::hue_pal()(3)[1], na.rm = TRUE) +
  geom_line(col = "grey30", na.rm = TRUE) +
  #scale_colour_hue(guide = FALSE) +
  #scale_fill_gradient(low = "grey90", high = "grey70") +
  scale_fill_viridis_c(begin = 1, end = 0.4, alpha = 0.7) +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales", fill = "Price [$]", title = "FOODS_2_092_CA_1")

p2 <- example %>% 
  filter(id == "HOUSEHOLD_2_071_TX_2") %>% 
  select(id, dates, sales) %>% 
  left_join(price_intervals, by = c("id", "dates" = "price_start")) %>% 
  mutate(price_start = if_else(is.na(price_end), date(NA_character_), dates)) %>% 
  ggplot(aes(dates, sales, group = id)) +
  geom_rect(aes(xmin = price_start, xmax = price_end, ymin = 0, ymax = Inf, fill = sell_price), na.rm = TRUE) +
  # geom_line(col = scales::hue_pal()(3)[2], na.rm = TRUE) +
  geom_line(col = "grey30", na.rm = TRUE) +
  #scale_colour_hue(guide = FALSE) +
  # scale_fill_gradient(low = "grey90", high = "grey70") +
  scale_fill_viridis_c(begin = 1, end = 0.4, alpha = 0.7) +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales", fill = "Price [$]", title = "HOUSEHOLD_2_071_TX_2")

p3 <- example %>% 
  filter(id == "HOBBIES_1_348_WI_3") %>% 
  select(id, dates, sales) %>% 
  left_join(price_intervals, by = c("id", "dates" = "price_start")) %>% 
  mutate(price_start = if_else(is.na(price_end), date(NA_character_), dates)) %>% 
  ggplot(aes(dates, sales, group = id)) +
  geom_rect(aes(xmin = price_start, xmax = price_end, ymin = 0, ymax = Inf, fill = sell_price), na.rm = TRUE) +
  # geom_line(col = scales::hue_pal()(3)[3], na.rm = TRUE) +
  geom_line(col = "grey30", na.rm = TRUE) +
  #scale_colour_hue(guide = FALSE) +
  # scale_fill_gradient(low = "grey90", high = "grey70") +
  scale_fill_viridis_c(begin = 1, end = 0.4, alpha = 0.7) +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales", fill = "Price [$]", title = "HOBBIES_1_348_WI_3")

p1 / p2 / p3 + plot_annotation(title = 'Price changes for 3 random items over full training period',
                               subtitle = "Line charts = sales curves. Background colour = sell price. Lighter colours = lower prices")
Fig. 19

Fig. 19

We find:

Alright, let’s try to put everything together in another interactive plot: This one is essentially the setup of Fig. 18 for the entire training time range, with the prices from Fig. 19 overlayed as thick orange lines. The plotly tools allow you to zoom and pan each panel individually. In this way, you can explore each time series in different time ranges and resolutions.

Here, the vertical grey background stripes indicate SNAP dates. The black points are events, like in Fig. 18. The thick orange line shows sell prices scaled to relative values that display well in the y-axis range of the sales numbers.

Fig. 20

Those are only 3 example items, so feel free to take this code and use it to explore other time series that you are interested in or that might be causing issues in your models. I’d be curious to know of any specific examples for which you find odd or unexpected patterns.

6 Summary statistics

Let’s make a big jump from the detailed view of individual time series to a collection of overview statistics. Here we attempt to parametrise a sample of time series via a comprehensive set of fundamental parameters.

For the sake of keeping this analysis nimble we will only look at a random 5,000 time series. We can see that the statistics are reasonably representative by comparing the distribution of zero sales to the corresponding full distribution in Fig. 1 (Section 3.4).

Fig. 21

Fig. 21

We find:

Let’s spend a bit more time on the zeros sales. Far from being an unimportant feature, these instances can tell us quite a lot about the properties of our time series. Here we break it down by year and look at the percentage of zeros (same as the plot above) and the year of the first non-zero instance:

Fig. 22

Fig. 22

We find:

A similar analysis can be done for the months of the year. Here our focus is slightly different: instead of looking at the starting year to determine how (in)complete a time series is, we now study which months show the lowest vs highest sales activities.

Fig. 23

Fig. 23

We find:

This calls for a heatmap to clear up the picture:

Fig. 24

Fig. 24

We find:

We can turn this question on its head, of course, and look at it from the perspective of last effective dates. A “Heads or Tails” view, if you like ;-)

Practically, this means simply looking for the maximum non-zero year/month instead of the minimum one. Everything else stays the same. Now I also add some numbers on top of the tile colours to

Fig. 25

Fig. 25

We find:

Finally, let’s have a look at price changes. In Fig. 21 we visualised the magnitude of price variations, here we’ll compare their frequency and direction.

First, we will look at the number of price changes per category. This counts the number of times an item changed its price, and then plots the distribution over our three categories FOODS, HOBBIES, and HOUSEHOLD. Then, we study the direction of these price changes: we count how often the price increased and plot this number as a percentage of the total price changed. A price increase percentage below 50% indicates that there were more price drops and rises.

To display the price increase percentages we choose violin plots, which give us the global quartile measures while also preserving the shape of the distribution. Note, that since many items only changed price a few times, the underlying data is somewhat discrete; but the violin densities do a pretty good job at visualising differences in distribution shapes. We compare price increase percentages by category and by state:

foo <- stat_prices %>% 
  distinct(id, sell_price) %>% 
  group_by(id) %>% 
  summarise(mean_price = mean(sell_price),
            min_price = min(sell_price),
            max_price = max(sell_price),
            ct = n()) %>%
  mutate(var_price = (max_price - min_price)/ mean_price) %>% 
  separate(id, into = c("cat", "dept", "item", "state", "store"), sep = "_")

bar <- stat_prices %>% 
  distinct(id, sell_price) %>% 
  group_by(id) %>% 
  mutate(rise = (lead(sell_price, 1) - sell_price) > 0) %>% 
  filter(!is.na(rise)) %>% 
  summarise(rise_frac = mean(rise)) %>% 
  separate(id, into = c("cat", "a", "b", "state", "c"), sep = "_") %>% 
  select(-a, -b, -c)


p1 <- foo %>% 
  ggplot(aes(ct, group = cat, fill = cat)) +
#  geom_boxplot() +
#  geom_jitter(height = 0.1) +
  scale_x_log10(breaks = c(1, 2, 5, 10, 20)) +
  geom_density(bw = 0.2, alpha = 0.5) +
  theme_hc() +
  theme(legend.position = "bottom") +
  labs(x = "", y = "", fill = "", title = "# Price changes by Category")

p2 <- bar %>% 
  ggplot(aes(cat, rise_frac, fill = cat)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75), bw = 0.1) +
  scale_y_continuous(labels = scales::percent) +
  theme_hc() +
  theme(legend.position = "none") +
  labs(x = "", y = "", fill = "", title = "% price increases by Category",
       subtitle = "Violin density plot with quartiles")

p3 <- bar %>% 
  mutate(is_tx = state == "TX") %>% 
  ggplot(aes(state, rise_frac, fill = is_tx)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75), bw = 0.1) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none") +
  labs(x = "", y = "", fill = "", title = "% price increases by State",
       subtitle = "TX is more balanced than CA/WI")

layout <- "
AB
AB
AB
AC
DC
"

p1 + p3 + p2 + guide_area() + plot_layout(design = layout, guides = "collect")
Fig. 26

Fig. 26

We find:


Thanks !