Introduction & Research Question
In order to deepen practical knowledge in R programming language the following exploratory data analysis (EDA) was implemented. Because of the recent global coronavirus challenges we decided to investigate Germany COVID-19 statistics over the past years. Our research questions were to find out if there is a correlation between covid death rate and the amount of COVID-19 cases overall and where exactly in Germany was seen the highest amount of covid cases and track this dynamics.
The dataset we have used was obtained from Kaggle. The dataset consists of three main csv files: covid_de.csv, demgraphics_de.csv, and covid_de_vaccines.csv.
Let us start with the setup of the working space in R.
Setup
We need to get rid of any possible data in our cache
Load all the libraries which are useful for data analysis
We will use the tidyverse
package to get access to a great variety of packages such as dplyr
and ggplot2
which helps us to clean and visualize the data. The package leaflet
helps us create an interactive clustered map for a better coronavirus cases representation. RColorBrewer
allows us to pick certain color gradients and is an extremely helpful and underrated tool for EDA. Package shiny
is used to create a shiny markdown in order to have a better visualization of our final report. Finally, plotly
helps us create interactive web-based graphs.
Import Data
Let’s import 3 csv.files: covid_de
with all the data regarding covid-19 cases, deaths and recoveries in Germany, demographics_de
with the information about Germany’s population and covid_de_vaccines
with all the vaccination data
Structure of data
Let’s analyse the structure of the 3 datasets. First and foremost, we need to check the structure and extent of data by using the following commands: dim
and str
covid_de_data
## [1] 2102909 8
## tibble [2,102,909 × 8] (S3: tbl_df/tbl/data.frame)
## $ state : Factor w/ 16 levels "Baden-Wuerttemberg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ county : Factor w/ 411 levels "LK Ahrweiler",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ age_group: Factor w/ 6 levels "00-04","05-14",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ date : Factor w/ 856 levels "2020-01-02","2020-01-23",..: 51 52 58 256 260 265 268 272 276 279 ...
## $ cases : int [1:2102909] 1 1 1 1 1 1 3 1 1 1 ...
## $ deaths : int [1:2102909] 0 0 0 0 0 0 0 0 0 0 ...
## $ recovered: int [1:2102909] 1 1 1 1 1 1 3 1 1 1 ...
covid_de_data
dataset contains more then 2Mio observation with specified state, country, age, gender, date, cases, death and recovered numbers. Last theree columns are in numeric format, the rest is character.
covid_de_vaccines
## [1] 531 9
## tibble [531 × 9] (S3: tbl_df/tbl/data.frame)
## $ date : Date[1:531], format: "2020-12-27" "2020-12-28" ...
## $ doses : num [1:531] 24421 18007 50055 64066 49939 ...
## $ doses_first : num [1:531] 24421 18007 50055 64066 49939 ...
## $ doses_second : num [1:531] 0 0 0 0 0 0 0 0 0 0 ...
## $ pfizer_cumul : num [1:531] 24412 42417 92471 156536 206473 ...
## $ moderna_cumul : num [1:531] 8 9 9 10 12 27 28 35 46 50 ...
## $ astrazeneca_cumul : num [1:531] 1 2 3 3 3 7 7 7 8 8 ...
## $ persons_first_cumul: num [1:531] 24421 42428 92483 156549 206488 ...
## $ persons_full_cumul : num [1:531] 0 0 0 0 0 0 0 0 0 0 ...
covid_de_vaccines
dataset provides some information about date of the vaccination, amount of first and second doses, and the amount of people being vaccinated with Moderna/Astra Zeneca/ Pfizer. All but one columns are numeric, date column is in Date format.
demographics_de
## [1] 192 4
## tibble [192 × 4] (S3: tbl_df/tbl/data.frame)
## $ state : chr [1:192] "Baden-Wuerttemberg" "Baden-Wuerttemberg" "Baden-Wuerttemberg" "Baden-Wuerttemberg" ...
## $ gender : chr [1:192] "female" "female" "female" "female" ...
## $ age_group : chr [1:192] "00-04" "05-14" "15-34" "35-59" ...
## $ population: num [1:192] 261674 490822 1293488 1919649 1182736 ...
demographics_de
shows state, gender and age grop of the person as well as the population. First three columns are charachter, whereas the last is numeric.
We can now see, whether the number of columns and rows is what we have expected from the original document. Further, we can already see that most of our columns are of type numeric and character.
Row checking
More importantly, we need to make sure the data is indeed imported in the correct manner, by checking some of the rows. It is known that it is usually enough to check the first, last and some arbirtrary entries of the data:
covid_de_data
## # A tibble: 6 × 8
## state county age_group gender date cases deaths recovered
## <fct> <fct> <fct> <fct> <fct> <int> <int> <int>
## 1 Baden-Wuerttemberg LK Alb-Donau… 00-04 F 2020… 1 0 1
## 2 Baden-Wuerttemberg LK Alb-Donau… 00-04 F 2020… 1 0 1
## 3 Baden-Wuerttemberg LK Alb-Donau… 00-04 F 2020… 1 0 1
## 4 Baden-Wuerttemberg LK Alb-Donau… 00-04 F 2020… 1 0 1
## 5 Baden-Wuerttemberg LK Alb-Donau… 00-04 F 2020… 1 0 1
## 6 Baden-Wuerttemberg LK Alb-Donau… 00-04 F 2020… 1 0 1
## # A tibble: 6 × 8
## state county age_group gender date cases deaths recovered
## <fct> <fct> <fct> <fct> <fct> <int> <int> <int>
## 1 Thueringen SK Weimar 80-99 M 2022-05-31 1 0 0
## 2 Thueringen SK Weimar 80-99 M 2022-06-08 1 0 0
## 3 Thueringen SK Weimar 80-99 M 2022-06-10 1 0 0
## 4 Thueringen SK Weimar <NA> F 2021-12-01 1 0 1
## 5 Thueringen SK Weimar <NA> F 2021-12-23 1 0 1
## 6 Thueringen SK Weimar <NA> F 2022-06-02 1 0 0
## # A tibble: 6 × 8
## state county age_group gender date cases deaths recovered
## <fct> <fct> <fct> <fct> <fct> <int> <int> <int>
## 1 Bayern LK Nuernberger Land 15-34 M 2020-12-12 2 0 2
## 2 Bayern LK Nuernberger Land 15-34 M 2020-12-13 6 0 6
## 3 Bayern LK Nuernberger Land 15-34 M 2020-12-14 6 0 6
## 4 Bayern LK Nuernberger Land 15-34 M 2020-12-15 12 0 12
## 5 Bayern LK Nuernberger Land 15-34 M 2020-12-16 11 0 11
## 6 Bayern LK Nuernberger Land 15-34 M 2020-12-17 6 0 6
covid_de_vaccines
## # A tibble: 6 × 9
## date doses doses_first doses_second pfizer_cumul moderna_cumul
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-12-27 24421 24421 0 24412 8
## 2 2020-12-28 18007 18007 0 42417 9
## 3 2020-12-29 50055 50055 0 92471 9
## 4 2020-12-30 64066 64066 0 156536 10
## 5 2020-12-31 49939 49939 0 206473 12
## 6 2021-01-01 25672 25672 0 232124 27
## # ℹ 3 more variables: astrazeneca_cumul <dbl>, persons_first_cumul <dbl>,
## # persons_full_cumul <dbl>
## # A tibble: 6 × 9
## date doses doses_first doses_second pfizer_cumul moderna_cumul
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2022-06-05 1498 52 130 132885488 30592907
## 2 2022-06-06 1725 60 137 132887059 30593044
## 3 2022-06-07 32279 1047 2146 132915424 30596465
## 4 2022-06-08 40424 1087 2030 132951127 30600894
## 5 2022-06-09 49100 1400 2759 132995444 30605301
## 6 2022-06-10 33180 1068 2095 133024701 30608919
## # ℹ 3 more variables: astrazeneca_cumul <dbl>, persons_first_cumul <dbl>,
## # persons_full_cumul <dbl>
## # A tibble: 6 × 9
## date doses doses_first doses_second pfizer_cumul moderna_cumul
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-07-14 1055928 239752 815962 63780938 7724684
## 2 2021-07-15 785037 200857 583969 64432602 7795034
## 3 2021-07-16 643614 171083 472409 64961052 7859104
## 4 2021-07-17 279090 89451 186259 65142628 7922390
## 5 2021-07-18 178975 68438 110493 65259065 7964155
## 6 2021-07-19 369095 103108 265895 65546689 8014606
## # ℹ 3 more variables: astrazeneca_cumul <dbl>, persons_first_cumul <dbl>,
## # persons_full_cumul <dbl>
demographics_de
## # A tibble: 6 × 4
## state gender age_group population
## <chr> <chr> <chr> <dbl>
## 1 Baden-Wuerttemberg female 00-04 261674
## 2 Baden-Wuerttemberg female 05-14 490822
## 3 Baden-Wuerttemberg female 15-34 1293488
## 4 Baden-Wuerttemberg female 35-59 1919649
## 5 Baden-Wuerttemberg female 60-79 1182736
## 6 Baden-Wuerttemberg female 80-99 419471
## # A tibble: 6 × 4
## state gender age_group population
## <chr> <chr> <chr> <dbl>
## 1 Thueringen male 00-04 47365
## 2 Thueringen male 05-14 92545
## 3 Thueringen male 15-34 214553
## 4 Thueringen male 35-59 384822
## 5 Thueringen male 60-79 264189
## 6 Thueringen male 80-99 57340
## # A tibble: 6 × 4
## state gender age_group population
## <chr> <chr> <chr> <dbl>
## 1 Niedersachsen female 35-59 1392647
## 2 Niedersachsen female 60-79 914560
## 3 Niedersachsen female 80-99 327165
## 4 Niedersachsen male 00-04 190079
## 5 Niedersachsen male 05-14 369705
## 6 Niedersachsen male 15-34 956570
Now we can see that the data is imported in a way we expected it to be
NA Values
Let’s now check our datasets for na values
covid_de_data
There are some missing values, but overall the data set looks very clean.
Let’s now create a nice little table which helps us with an overview of the missingness and uniqueness of values in our columns. First, we start with calculating the percentage of missing values per column
Now, we can summarise our findings in a small data summary table:
## NAs NAs_rel
## state 0 0.00
## county 0 0.00
## age_group 12366 0.59
## gender 93789 4.46
## date 0 0.00
## cases 0 0.00
## deaths 0 0.00
## recovered 0 0.00
Result: There are a tiny 0.59% of missing data in the column “age_group” and a insignificant number of missing values in the column “gender”. So, our covid_de_data
dataset looks clean enough to explore it.
covid_de_vaccines
There are no missing values in
covid_de_vaccines
dataset so we won’t explore data in every column
demographics_de
There are no missing values in
demographics_de
dataset so we won’t explore data in every column
Now we are ready to do exploratory data analysis with regards to our research questions.
EDA
Covid-19 cases and deaths progression by German State
To begin with, we would like to see the dynamics of coronavirus cases and deaths in each German State over time. So let’s create a line graphs with COVID-19 cases (we will draw this graph from November 2020, not the date of the very first recorded COVID-19 case, since the dynamics of cases progression is better seen this way) and deaths over time.
Cases progression
Deaths progression
From the “Cases progression by German State” chart we can conclude that three German States, namely Nordrhein Westfallen, Bayern and Baden Wuerttemberg have far higher coronavirus cases than the rest of Germany. Regarding deaths progression Bayern and Nordrhein Westfallen have the worst situation, however, the number of deaths in Sachsen, which does not have as many infections as Bayern, Nordrhein Westfallen and Baden Wuerttemberg, is almost the same as in Baden Wuerttemberg.
Now we can clearly see that Nordrhein Westfallen, Bayern and Baden Wuerttemberg have the most number of infections and consequently the largest number of covid- related deaths. Although Sachsen registered more than 2 times less coronavirus cases than Baden Wuerttemberg, its death toll turned out to be almost the same.
Let’s now create a table with the total number of cases and deaths by age groups
## # A tibble: 6 × 3
## age_group Cases Deaths
## <fct> <int> <int>
## 1 00-04 866017 32
## 2 05-14 3828668 31
## 3 15-34 8119047 397
## 4 35-59 9845554 7215
## 5 60-79 2920726 42309
## 6 80-99 941766 89579
Death Rate by age group
Let’s calculate the death rate for each age group and depict it
From this graph we may make a conclusion that the older a person is, the more likelihood there is that the person dies from coronavirus. Almost every tenth person aged 80-99 dies from COVID-19 as the diagram suggests
For further analysis we would like to create a table with the total number of cases and deaths among males and females by different age groups and German States
For a better visuals let’s change the data in the column “gender” - specify that “F” is “female” and “M” is “male”
Let’s now look at the relative number of infected people. For this purpose let’s join the column “population” from demographics_de
dataset and calculate the percent of infected people by each age_group
## # A tibble: 192 × 6
## age_group state gender Cases Deaths population
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 00-04 Baden-Wuerttemberg female 58955 4 261674
## 2 00-04 Baden-Wuerttemberg male 62290 4 274882
## 3 00-04 Bayern female 79856 4 306378
## 4 00-04 Bayern male 84430 2 321629
## 5 00-04 Berlin female 16080 0 93990
## 6 00-04 Berlin male 17152 1 99098
## 7 00-04 Brandenburg female 11936 0 52945
## 8 00-04 Brandenburg male 12607 0 55847
## 9 00-04 Bremen female 2898 0 16202
## 10 00-04 Bremen male 3107 0 17380
## # ℹ 182 more rows
From these pie charts we can clearly see that although the number of women, who caught coronavirus, is a few percents larger, the proportion of deaths turned out to be 6.6% higher for males
Comparison of infection rates among male and female of different age groups
It is evident from this grouped bar chart that in the age group younger than 15 years old there are more males infected. From 15 to 59 years old the proportion of females who caught COVID-19 is few percents higher than that of men. In the most vulnerable age group 80-99 the proportions are distributed almost equally.
Let’s show the comparison of the total coronavirus cases in each German State on the map
For this purpose we need to add coordinates of each State.
Vaccine Analysis
Here we are going to explore our third dataset, covid_de_vaccines
, to analyse the speed of vaccination by each of three vacines. Unfortunately, the data given is limited to the cumulated number of vaccines and date of first & secon doses, so we were not able to join the dataset with the previous ones. However, it is still interesting to know which vaccine was known to be the most popular one and how the trend has changed.
From the graph above we may clearly see the popularity of Pfizer vaccine, where more then 130M people used it by June 2022. Only around 30M people were vaccinated with Moderna, which is followed by nearly 13M getting Astra Zeneca vaccine. Notably, we may observe a speed curve of Pfizer vaccination starting from February 2021.
Now we want to check if the target vaccination rate of 90% stated by the German government was achieved
From these graphs we may conclude that 77.6% of the German population received one dose of the vaccine and 76% are fully vaccinated. However, the target of 90% set by the German government was not achieved yet. Thus, at the moment almost every fourth person has not received a jab against COVID-19 yet.