This is a quick data analysis I did on the kigali weather data. Any comments and recommendation are welcome.
I first got rid of some unnecessary columns like station_id, latitude, wind direction, etc, since they won’t be useful and are quite static. Then I went ahead and applied a summary on the data.
summary(kigali_temp)
## date year month day_in_month
## Length:17410 Min. :1971 Length:17410 Min. : 1.00
## Class :character 1st Qu.:1982 Class :character 1st Qu.: 8.00
## Mode :character Median :1994 Mode :character Median :16.00
## Mean :1994 Mean :15.73
## 3rd Qu.:2006 3rd Qu.:23.00
## Max. :2018 Max. :31.00
##
## doy precip tmp_min tmp_max
## Min. : 1.0 Min. : 0.000 Min. : 7.70 Min. :17.00
## 1st Qu.: 92.0 1st Qu.: 0.000 1st Qu.:14.80 1st Qu.:25.50
## Median :183.0 Median : 0.000 Median :15.70 Median :27.00
## Mean :182.9 Mean : 2.653 Mean :15.67 Mean :26.81
## 3rd Qu.:274.0 3rd Qu.: 1.800 3rd Qu.:16.60 3rd Qu.:28.20
## Max. :366.0 Max. :106.000 Max. :20.80 Max. :35.40
## NA's :93 NA's :166 NA's :158
Nothing really stands, no abnormal values, except for the max precipitation, 106mm would be an indication of a very heavy rainfall day, will look into it later. Also there are some missing values, that’ll be look into too.
#checking for duplicates
anyDuplicated(kigali_temp$date)#none
## [1] 0
#do we have instances where tmin was bigger than tmax? Nope
filter(kigali_temp, tmp_min>=tmp_max)
## # A tibble: 0 x 8
## # ... with 8 variables: date <chr>, year <dbl>, month <chr>,
## # day_in_month <dbl>, doy <dbl>, precip <dbl>, tmp_min <dbl>,
## # tmp_max <dbl>
filter(kigali_temp, precip>90)
## # A tibble: 2 x 8
## date year month day_in_month doy precip tmp_min tmp_max
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21/04/1982 1982 Apr 21 112 91.9 14.4 27.2
## 2 22/07/2001 2001 Jul 22 204 106 15.4 27.6
The first high value(91.9mm) might make sense since it was in April, which is one of the wettest months in Rwanda.
As for the second value, it looks like it might be a mistake, since July is when the dry season is in full swing.
I went further and looked at what preceeded and succeeded both this high values.
with(kigali_temp, kigali_temp[(date >= "1982-04-19" & date <= "1982-04-24") |
(date>="2001-07-19" & date<="2001-07-25"), ])
## # A tibble: 13 x 8
## date year month day_in_month doy precip tmp_min tmp_max
## <date> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1982-04-19 1982 Apr 19 110 0 15.6 23.9
## 2 1982-04-20 1982 Apr 20 111 0 15.8 25.2
## 3 1982-04-21 1982 Apr 21 112 91.9 14.4 27.2
## 4 1982-04-22 1982 Apr 22 113 3.7 14.2 20.7
## 5 1982-04-23 1982 Apr 23 114 14.6 12.8 26.2
## 6 1982-04-24 1982 Apr 24 115 2.1 12.6 24.5
## 7 2001-07-19 2001 Jul 19 201 1.1 16 26.2
## 8 2001-07-20 2001 Jul 20 202 9.6 16 27.8
## 9 2001-07-21 2001 Jul 21 203 0.1 15.8 27.2
## 10 2001-07-22 2001 Jul 22 204 106 15.4 27.6
## 11 2001-07-23 2001 Jul 23 205 0 15.2 29.2
## 12 2001-07-24 2001 Jul 24 206 0 16 26.4
## 13 2001-07-25 2001 Jul 25 207 0 17 27.4
There are no missing dates.
date_range <- seq(min(kigali_temp$date), max(kigali_temp$date), by = 1)
date_range[!date_range %in% kigali_temp$date]
## Date of length 0
Found a nice package that gives you information on missing values with some nice graphics. Have a look below;
##
## Variables sorted by number of missings:
## Variable Count
## tmp_min 0.009534750
## tmp_max 0.009075244
## precip 0.005341758
## date 0.000000000
## year 0.000000000
## month 0.000000000
## day_in_month 0.000000000
## doy 0.000000000
Ok, enough of the graphics, let’s dig a bit deeper.
## Combinations Count Percent
## 1 0:0:0:0:0:0:0:0 17237 99.00631821
## 6 0:0:0:0:0:1:1:1 91 0.52268811
## 4 0:0:0:0:0:0:1:1 62 0.35611717
## 3 0:0:0:0:0:0:1:0 13 0.07466973
## 2 0:0:0:0:0:0:0:1 5 0.02871913
## 5 0:0:0:0:0:1:0:0 2 0.01148765
There are 91 occurences where both tmin, tmax, and precip are missing, other instances include 62 days missing both tmin and tmax. Let’s investigate this;
kigali_temp[is.na(kigali_temp$precip) & is.na(kigali_temp$tmp_min),]
## # A tibble: 91 x 8
## date year month day_in_month doy precip tmp_min tmp_max
## <date> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1994-05-19 1994 May 19 140 NA NA NA
## 2 1994-05-20 1994 May 20 141 NA NA NA
## 3 1994-05-21 1994 May 21 142 NA NA NA
## 4 1994-05-22 1994 May 22 143 NA NA NA
## 5 1994-05-23 1994 May 23 144 NA NA NA
## 6 1994-05-24 1994 May 24 145 NA NA NA
## 7 1994-05-25 1994 May 25 146 NA NA NA
## 8 1994-05-26 1994 May 26 147 NA NA NA
## 9 1994-05-27 1994 May 27 148 NA NA NA
## 10 1994-05-28 1994 May 28 149 NA NA NA
## # ... with 81 more rows
The missing values where precipation, tmin, and tmax were all missing at the same time span from 19/05/1994 to 18/08/1994. The Tutsi genocide began on April 7, 1994 and ended in July 1994, so I don’t think anybody had the time to register the weather data around this time.
kigali_temp[is.na(kigali_temp$tmp_min) & is.na(kigali_temp$tmp_max) & !is.na(kigali_temp$precip),]
## # A tibble: 62 x 8
## date year month day_in_month doy precip tmp_min tmp_max
## <date> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1989-06-06 1989 Jun 6 158 0.4 NA NA
## 2 1994-03-01 1994 Mar 1 61 1.1 NA NA
## 3 1994-03-02 1994 Mar 2 62 0.9 NA NA
## 4 1994-03-03 1994 Mar 3 63 37.8 NA NA
## 5 1994-03-04 1994 Mar 4 64 1 NA NA
## 6 1994-03-05 1994 Mar 5 65 32.2 NA NA
## 7 1994-03-06 1994 Mar 6 66 0 NA NA
## 8 1994-03-07 1994 Mar 7 67 0 NA NA
## 9 1994-03-08 1994 Mar 8 68 30.7 NA NA
## 10 1994-03-09 1994 Mar 9 69 0 NA NA
## # ... with 52 more rows
The same scenario applies for the missing values where tmin, and tmax were all missing at the same time. They span from 01/03/1994 to 30/04/1994. There’s also one missing value on 06/06/1989.
There is small window from 01/05/1994 to 18/05/1994 where they resumed recording weather.
One suprising thing is that they stopped recoring temperatures in March but they didn’t stop recording the precipation until mid May. Maybe because knowing the precipation is more important than knowing the temperature when dealing with air traffic.
As a conclusion, I think we should probably leave the year 1994 out of the modelling phase.
There’s no real evidence in the graph that indicates a change in the pattern of rainfall. Except maybe in February where the amount has been low.
There’s a slight decrease in January and an increase in February, which might explaing the recent floodings. Also July has been extremely dry.
## Warning: Removed 158 rows containing non-finite values (stat_boxplot).
The weather in Kigali can categorized in four parts:
Short rain season: October - December.
Short dry season: January - February.
Long rainy season: March - mid-June.
Long sunny season: Mid-June - September.