COVID-19 in Germany - 2022

Introduction to R for Social Scientists

Holionko Kateryna & Myronenko Andrii

28 March, 2023

Introduction & Research Question

In order to deepen practical knowledge in R programming language the following exploratory data analysis (EDA) was implemented. Because of the recent global coronavirus challenges we decided to investigate Germany COVID-19 statistics over the past years. Our research questions were to find out if there is a correlation between covid death rate and the amount of COVID-19 cases overall and where exactly in Germany was seen the highest amount of covid cases and track this dynamics.

The dataset we have used was obtained from Kaggle. The dataset consists of three main csv files: covid_de.csv, demgraphics_de.csv, and covid_de_vaccines.csv.

Let us start with the setup of the working space in R.

Setup

  1. We need to get rid of any possible data in our cache

  2. Load all the libraries which are useful for data analysis

We will use the tidyverse package to get access to a great variety of packages such as dplyr and ggplot2 which helps us to clean and visualize the data. The package leaflet helps us create an interactive clustered map for a better coronavirus cases representation. RColorBrewer allows us to pick certain color gradients and is an extremely helpful and underrated tool for EDA. Package shiny is used to create a shiny markdown in order to have a better visualization of our final report. Finally, plotly helps us create interactive web-based graphs.

Import Data

Let’s import 3 csv.files: covid_de with all the data regarding covid-19 cases, deaths and recoveries in Germany, demographics_de with the information about Germany’s population and covid_de_vaccines with all the vaccination data

Structure of data

Let’s analyse the structure of the 3 datasets. First and foremost, we need to check the structure and extent of data by using the following commands: dim and str

covid_de_data

## [1] 2102909       8
## tibble [2,102,909 × 8] (S3: tbl_df/tbl/data.frame)
##  $ state    : Factor w/ 16 levels "Baden-Wuerttemberg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ county   : Factor w/ 411 levels "LK Ahrweiler",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ age_group: Factor w/ 6 levels "00-04","05-14",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender   : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ date     : Factor w/ 856 levels "2020-01-02","2020-01-23",..: 51 52 58 256 260 265 268 272 276 279 ...
##  $ cases    : int [1:2102909] 1 1 1 1 1 1 3 1 1 1 ...
##  $ deaths   : int [1:2102909] 0 0 0 0 0 0 0 0 0 0 ...
##  $ recovered: int [1:2102909] 1 1 1 1 1 1 3 1 1 1 ...

covid_de_data dataset contains more then 2Mio observation with specified state, country, age, gender, date, cases, death and recovered numbers. Last theree columns are in numeric format, the rest is character.

covid_de_vaccines

## [1] 531   9
## tibble [531 × 9] (S3: tbl_df/tbl/data.frame)
##  $ date               : Date[1:531], format: "2020-12-27" "2020-12-28" ...
##  $ doses              : num [1:531] 24421 18007 50055 64066 49939 ...
##  $ doses_first        : num [1:531] 24421 18007 50055 64066 49939 ...
##  $ doses_second       : num [1:531] 0 0 0 0 0 0 0 0 0 0 ...
##  $ pfizer_cumul       : num [1:531] 24412 42417 92471 156536 206473 ...
##  $ moderna_cumul      : num [1:531] 8 9 9 10 12 27 28 35 46 50 ...
##  $ astrazeneca_cumul  : num [1:531] 1 2 3 3 3 7 7 7 8 8 ...
##  $ persons_first_cumul: num [1:531] 24421 42428 92483 156549 206488 ...
##  $ persons_full_cumul : num [1:531] 0 0 0 0 0 0 0 0 0 0 ...

covid_de_vaccines dataset provides some information about date of the vaccination, amount of first and second doses, and the amount of people being vaccinated with Moderna/Astra Zeneca/ Pfizer. All but one columns are numeric, date column is in Date format.

demographics_de

## [1] 192   4
## tibble [192 × 4] (S3: tbl_df/tbl/data.frame)
##  $ state     : chr [1:192] "Baden-Wuerttemberg" "Baden-Wuerttemberg" "Baden-Wuerttemberg" "Baden-Wuerttemberg" ...
##  $ gender    : chr [1:192] "female" "female" "female" "female" ...
##  $ age_group : chr [1:192] "00-04" "05-14" "15-34" "35-59" ...
##  $ population: num [1:192] 261674 490822 1293488 1919649 1182736 ...

demographics_de shows state, gender and age grop of the person as well as the population. First three columns are charachter, whereas the last is numeric.

We can now see, whether the number of columns and rows is what we have expected from the original document. Further, we can already see that most of our columns are of type numeric and character.

Row checking

More importantly, we need to make sure the data is indeed imported in the correct manner, by checking some of the rows. It is known that it is usually enough to check the first, last and some arbirtrary entries of the data:

covid_de_data

## # A tibble: 6 × 8
##   state              county        age_group gender date  cases deaths recovered
##   <fct>              <fct>         <fct>     <fct>  <fct> <int>  <int>     <int>
## 1 Baden-Wuerttemberg LK Alb-Donau… 00-04     F      2020…     1      0         1
## 2 Baden-Wuerttemberg LK Alb-Donau… 00-04     F      2020…     1      0         1
## 3 Baden-Wuerttemberg LK Alb-Donau… 00-04     F      2020…     1      0         1
## 4 Baden-Wuerttemberg LK Alb-Donau… 00-04     F      2020…     1      0         1
## 5 Baden-Wuerttemberg LK Alb-Donau… 00-04     F      2020…     1      0         1
## 6 Baden-Wuerttemberg LK Alb-Donau… 00-04     F      2020…     1      0         1
## # A tibble: 6 × 8
##   state      county    age_group gender date       cases deaths recovered
##   <fct>      <fct>     <fct>     <fct>  <fct>      <int>  <int>     <int>
## 1 Thueringen SK Weimar 80-99     M      2022-05-31     1      0         0
## 2 Thueringen SK Weimar 80-99     M      2022-06-08     1      0         0
## 3 Thueringen SK Weimar 80-99     M      2022-06-10     1      0         0
## 4 Thueringen SK Weimar <NA>      F      2021-12-01     1      0         1
## 5 Thueringen SK Weimar <NA>      F      2021-12-23     1      0         1
## 6 Thueringen SK Weimar <NA>      F      2022-06-02     1      0         0
## # A tibble: 6 × 8
##   state  county              age_group gender date       cases deaths recovered
##   <fct>  <fct>               <fct>     <fct>  <fct>      <int>  <int>     <int>
## 1 Bayern LK Nuernberger Land 15-34     M      2020-12-12     2      0         2
## 2 Bayern LK Nuernberger Land 15-34     M      2020-12-13     6      0         6
## 3 Bayern LK Nuernberger Land 15-34     M      2020-12-14     6      0         6
## 4 Bayern LK Nuernberger Land 15-34     M      2020-12-15    12      0        12
## 5 Bayern LK Nuernberger Land 15-34     M      2020-12-16    11      0        11
## 6 Bayern LK Nuernberger Land 15-34     M      2020-12-17     6      0         6

covid_de_vaccines

## # A tibble: 6 × 9
##   date       doses doses_first doses_second pfizer_cumul moderna_cumul
##   <date>     <dbl>       <dbl>        <dbl>        <dbl>         <dbl>
## 1 2020-12-27 24421       24421            0        24412             8
## 2 2020-12-28 18007       18007            0        42417             9
## 3 2020-12-29 50055       50055            0        92471             9
## 4 2020-12-30 64066       64066            0       156536            10
## 5 2020-12-31 49939       49939            0       206473            12
## 6 2021-01-01 25672       25672            0       232124            27
## # ℹ 3 more variables: astrazeneca_cumul <dbl>, persons_first_cumul <dbl>,
## #   persons_full_cumul <dbl>
## # A tibble: 6 × 9
##   date       doses doses_first doses_second pfizer_cumul moderna_cumul
##   <date>     <dbl>       <dbl>        <dbl>        <dbl>         <dbl>
## 1 2022-06-05  1498          52          130    132885488      30592907
## 2 2022-06-06  1725          60          137    132887059      30593044
## 3 2022-06-07 32279        1047         2146    132915424      30596465
## 4 2022-06-08 40424        1087         2030    132951127      30600894
## 5 2022-06-09 49100        1400         2759    132995444      30605301
## 6 2022-06-10 33180        1068         2095    133024701      30608919
## # ℹ 3 more variables: astrazeneca_cumul <dbl>, persons_first_cumul <dbl>,
## #   persons_full_cumul <dbl>
## # A tibble: 6 × 9
##   date         doses doses_first doses_second pfizer_cumul moderna_cumul
##   <date>       <dbl>       <dbl>        <dbl>        <dbl>         <dbl>
## 1 2021-07-14 1055928      239752       815962     63780938       7724684
## 2 2021-07-15  785037      200857       583969     64432602       7795034
## 3 2021-07-16  643614      171083       472409     64961052       7859104
## 4 2021-07-17  279090       89451       186259     65142628       7922390
## 5 2021-07-18  178975       68438       110493     65259065       7964155
## 6 2021-07-19  369095      103108       265895     65546689       8014606
## # ℹ 3 more variables: astrazeneca_cumul <dbl>, persons_first_cumul <dbl>,
## #   persons_full_cumul <dbl>

demographics_de

## # A tibble: 6 × 4
##   state              gender age_group population
##   <chr>              <chr>  <chr>          <dbl>
## 1 Baden-Wuerttemberg female 00-04         261674
## 2 Baden-Wuerttemberg female 05-14         490822
## 3 Baden-Wuerttemberg female 15-34        1293488
## 4 Baden-Wuerttemberg female 35-59        1919649
## 5 Baden-Wuerttemberg female 60-79        1182736
## 6 Baden-Wuerttemberg female 80-99         419471
## # A tibble: 6 × 4
##   state      gender age_group population
##   <chr>      <chr>  <chr>          <dbl>
## 1 Thueringen male   00-04          47365
## 2 Thueringen male   05-14          92545
## 3 Thueringen male   15-34         214553
## 4 Thueringen male   35-59         384822
## 5 Thueringen male   60-79         264189
## 6 Thueringen male   80-99          57340
## # A tibble: 6 × 4
##   state         gender age_group population
##   <chr>         <chr>  <chr>          <dbl>
## 1 Niedersachsen female 35-59        1392647
## 2 Niedersachsen female 60-79         914560
## 3 Niedersachsen female 80-99         327165
## 4 Niedersachsen male   00-04         190079
## 5 Niedersachsen male   05-14         369705
## 6 Niedersachsen male   15-34         956570

Now we can see that the data is imported in a way we expected it to be

NA Values

Let’s now check our datasets for na values

covid_de_data

There are some missing values, but overall the data set looks very clean.

Let’s now create a nice little table which helps us with an overview of the missingness and uniqueness of values in our columns. First, we start with calculating the percentage of missing values per column

Now, we can summarise our findings in a small data summary table:

##             NAs NAs_rel
## state         0    0.00
## county        0    0.00
## age_group 12366    0.59
## gender    93789    4.46
## date          0    0.00
## cases         0    0.00
## deaths        0    0.00
## recovered     0    0.00

Result: There are a tiny 0.59% of missing data in the column “age_group” and a insignificant number of missing values in the column “gender”. So, our covid_de_data dataset looks clean enough to explore it.

covid_de_vaccines

There are no missing values in covid_de_vaccines dataset so we won’t explore data in every column

demographics_de

There are no missing values in demographics_de dataset so we won’t explore data in every column

Now we are ready to do exploratory data analysis with regards to our research questions.

EDA

Covid-19 cases and deaths progression by German State

To begin with, we would like to see the dynamics of coronavirus cases and deaths in each German State over time. So let’s create a line graphs with COVID-19 cases (we will draw this graph from November 2020, not the date of the very first recorded COVID-19 case, since the dynamics of cases progression is better seen this way) and deaths over time.

Cases progression

Deaths progression

From the “Cases progression by German State” chart we can conclude that three German States, namely Nordrhein Westfallen, Bayern and Baden Wuerttemberg have far higher coronavirus cases than the rest of Germany. Regarding deaths progression Bayern and Nordrhein Westfallen have the worst situation, however, the number of deaths in Sachsen, which does not have as many infections as Bayern, Nordrhein Westfallen and Baden Wuerttemberg, is almost the same as in Baden Wuerttemberg.

German States with the most coronavirus cases and deaths

Let’s breakdown German States by the largest total number of coronavirus cases and deaths

States with the greatest number of cases

States with the greatest number of deaths

Now we can clearly see that Nordrhein Westfallen, Bayern and Baden Wuerttemberg have the most number of infections and consequently the largest number of covid- related deaths. Although Sachsen registered more than 2 times less coronavirus cases than Baden Wuerttemberg, its death toll turned out to be almost the same.

Let’s now create a table with the total number of cases and deaths by age groups

## # A tibble: 6 × 3
##   age_group   Cases Deaths
##   <fct>       <int>  <int>
## 1 00-04      866017     32
## 2 05-14     3828668     31
## 3 15-34     8119047    397
## 4 35-59     9845554   7215
## 5 60-79     2920726  42309
## 6 80-99      941766  89579

Breakdown of coronavirus cases and deaths by age group

Total coronavirus cases by age group

Total coronavirus deaths by age group

##{-}

It is evident from the charts that the most number of coronavirus cases is recorded in the age group from 35 to 59, while the greatest number of deaths occur in 80-99 age group, despite this group has the second lowest number of cases.

Death Rate by age group

Let’s calculate the death rate for each age group and depict it

From this graph we may make a conclusion that the older a person is, the more likelihood there is that the person dies from coronavirus. Almost every tenth person aged 80-99 dies from COVID-19 as the diagram suggests

For further analysis we would like to create a table with the total number of cases and deaths among males and females by different age groups and German States

For a better visuals let’s change the data in the column “gender” - specify that “F” is “female” and “M” is “male”

Let’s now look at the relative number of infected people. For this purpose let’s join the column “population” from demographics_de dataset and calculate the percent of infected people by each age_group

## # A tibble: 192 × 6
##    age_group state              gender Cases Deaths population
##    <chr>     <chr>              <chr>  <int>  <int>      <dbl>
##  1 00-04     Baden-Wuerttemberg female 58955      4     261674
##  2 00-04     Baden-Wuerttemberg male   62290      4     274882
##  3 00-04     Bayern             female 79856      4     306378
##  4 00-04     Bayern             male   84430      2     321629
##  5 00-04     Berlin             female 16080      0      93990
##  6 00-04     Berlin             male   17152      1      99098
##  7 00-04     Brandenburg        female 11936      0      52945
##  8 00-04     Brandenburg        male   12607      0      55847
##  9 00-04     Bremen             female  2898      0      16202
## 10 00-04     Bremen             male    3107      0      17380
## # ℹ 182 more rows

Coronavirus cases and deaths distribution by gender

Covid-19 cases distribution by gender

Covid-19 deaths distribution by gender

From these pie charts we can clearly see that although the number of women, who caught coronavirus, is a few percents larger, the proportion of deaths turned out to be 6.6% higher for males

Comparison of infection rates among male and female of different age groups

It is evident from this grouped bar chart that in the age group younger than 15 years old there are more males infected. From 15 to 59 years old the proportion of females who caught COVID-19 is few percents higher than that of men. In the most vulnerable age group 80-99 the proportions are distributed almost equally.

Let’s show the comparison of the total coronavirus cases in each German State on the map

For this purpose we need to add coordinates of each State.

Coronavirus cases distribution map view

Vaccine Analysis

Here we are going to explore our third dataset, covid_de_vaccines, to analyse the speed of vaccination by each of three vacines. Unfortunately, the data given is limited to the cumulated number of vaccines and date of first & secon doses, so we were not able to join the dataset with the previous ones. However, it is still interesting to know which vaccine was known to be the most popular one and how the trend has changed.

From the graph above we may clearly see the popularity of Pfizer vaccine, where more then 130M people used it by June 2022. Only around 30M people were vaccinated with Moderna, which is followed by nearly 13M getting Astra Zeneca vaccine. Notably, we may observe a speed curve of Pfizer vaccination starting from February 2021.

Now we want to check if the target vaccination rate of 90% stated by the German government was achieved

From these graphs we may conclude that 77.6% of the German population received one dose of the vaccine and 76% are fully vaccinated. However, the target of 90% set by the German government was not achieved yet. Thus, at the moment almost every fourth person has not received a jab against COVID-19 yet.