5.7 Creating new columns
As we see above, you can create columns with new variables when you use group_by()
and summarize()
. Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in
two columns. For this you can use the mutate()
function. The transmute()
function is similar, but replaces old columns with the new one.
Each row in our dataset represents a week of data. It might be better if we can look at yearly trends.
So, let’s try separating out the year from PeriodStartDate into its own column. In addition to mutate()
, we will also use the year()
function from lubridate
#Get year in separate column from PeriodStartDate
<-
non_cumulative_period %>%
non_cumulative_period mutate(Year = year(PeriodStartDate))
Now we can do our grouping and summarizing again as we did above.
#Get totals for each state each year.
<-
non_cumulative_year %>% group_by(State, Year) %>% summarise(TotalCount = sum(TotalCount)) non_cumulative_period
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
non_cumulative_year
## # A tibble: 4,210 x 3
## # Groups: State [56]
## State Year TotalCount
## <chr> <dbl> <dbl>
## 1 ALABAMA 1909 1
## 2 ALABAMA 1910 606
## 3 ALABAMA 1911 587
## 4 ALABAMA 1912 109
## 5 ALABAMA 1913 7
## 6 ALABAMA 1914 4
## 7 ALABAMA 1915 2
## 8 ALABAMA 1916 66
## 9 ALABAMA 1917 5134
## 10 ALABAMA 1918 1651
## # … with 4,200 more rows
We could use a similar method to find the average number of cases per year.
#Get the avg number of counts across all states, each year.
<-
avg_count_by_year %>%
non_cumulative_year group_by(Year) %>%
summarise(Avg = mean(TotalCount))
avg_count_by_year
## # A tibble: 96 x 2
## Year Avg
## <dbl> <dbl>
## 1 1906 112.
## 2 1907 1100.
## 3 1908 1686.
## 4 1909 1609.
## 5 1910 2564.
## 6 1911 1833.
## 7 1912 1897.
## 8 1913 3266.
## 9 1914 1550.
## 10 1915 2329.
## # … with 86 more rows