This is a quick data analysis I did on the kigali weather data. Any comments and recommendation are welcome.

Quick look through

I first got rid of some unnecessary columns like station_id, latitude, wind direction, etc, since they won’t be useful and are quite static. Then I went ahead and applied a summary on the data.

summary(kigali_temp)
##      date                year         month            day_in_month  
##  Length:17410       Min.   :1971   Length:17410       Min.   : 1.00  
##  Class :character   1st Qu.:1982   Class :character   1st Qu.: 8.00  
##  Mode  :character   Median :1994   Mode  :character   Median :16.00  
##                     Mean   :1994                      Mean   :15.73  
##                     3rd Qu.:2006                      3rd Qu.:23.00  
##                     Max.   :2018                      Max.   :31.00  
##                                                                      
##       doy            precip           tmp_min         tmp_max     
##  Min.   :  1.0   Min.   :  0.000   Min.   : 7.70   Min.   :17.00  
##  1st Qu.: 92.0   1st Qu.:  0.000   1st Qu.:14.80   1st Qu.:25.50  
##  Median :183.0   Median :  0.000   Median :15.70   Median :27.00  
##  Mean   :182.9   Mean   :  2.653   Mean   :15.67   Mean   :26.81  
##  3rd Qu.:274.0   3rd Qu.:  1.800   3rd Qu.:16.60   3rd Qu.:28.20  
##  Max.   :366.0   Max.   :106.000   Max.   :20.80   Max.   :35.40  
##                  NA's   :93        NA's   :166     NA's   :158

Nothing really stands, no abnormal values, except for the max precipitation, 106mm would be an indication of a very heavy rainfall day, will look into it later. Also there are some missing values, that’ll be look into too.

Let’s check for abnormalities

#checking for duplicates
anyDuplicated(kigali_temp$date)#none
## [1] 0
#do we have instances where tmin was bigger than tmax? Nope
filter(kigali_temp, tmp_min>=tmp_max)
## # A tibble: 0 x 8
## # ... with 8 variables: date <chr>, year <dbl>, month <chr>,
## #   day_in_month <dbl>, doy <dbl>, precip <dbl>, tmp_min <dbl>,
## #   tmp_max <dbl>

Outliers

filter(kigali_temp, precip>90)
## # A tibble: 2 x 8
##   date        year month day_in_month   doy precip tmp_min tmp_max
##   <chr>      <dbl> <chr>        <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1 21/04/1982  1982 Apr             21   112   91.9    14.4    27.2
## 2 22/07/2001  2001 Jul             22   204  106      15.4    27.6

The first high value(91.9mm) might make sense since it was in April, which is one of the wettest months in Rwanda.

As for the second value, it looks like it might be a mistake, since July is when the dry season is in full swing.

I went further and looked at what preceeded and succeeded both this high values.

with(kigali_temp, kigali_temp[(date >= "1982-04-19" & date <= "1982-04-24") | 
              (date>="2001-07-19" & date<="2001-07-25"), ])
## # A tibble: 13 x 8
##    date        year month day_in_month   doy precip tmp_min tmp_max
##    <date>     <dbl> <chr>        <dbl> <dbl>  <dbl>   <dbl>   <dbl>
##  1 1982-04-19  1982 Apr             19   110    0      15.6    23.9
##  2 1982-04-20  1982 Apr             20   111    0      15.8    25.2
##  3 1982-04-21  1982 Apr             21   112   91.9    14.4    27.2
##  4 1982-04-22  1982 Apr             22   113    3.7    14.2    20.7
##  5 1982-04-23  1982 Apr             23   114   14.6    12.8    26.2
##  6 1982-04-24  1982 Apr             24   115    2.1    12.6    24.5
##  7 2001-07-19  2001 Jul             19   201    1.1    16      26.2
##  8 2001-07-20  2001 Jul             20   202    9.6    16      27.8
##  9 2001-07-21  2001 Jul             21   203    0.1    15.8    27.2
## 10 2001-07-22  2001 Jul             22   204  106      15.4    27.6
## 11 2001-07-23  2001 Jul             23   205    0      15.2    29.2
## 12 2001-07-24  2001 Jul             24   206    0      16      26.4
## 13 2001-07-25  2001 Jul             25   207    0      17      27.4

Investing missing values

There are no missing dates.

date_range <- seq(min(kigali_temp$date), max(kigali_temp$date), by = 1) 
date_range[!date_range %in% kigali_temp$date] 
## Date of length 0

Found a nice package that gives you information on missing values with some nice graphics. Have a look below;

## 
##  Variables sorted by number of missings: 
##      Variable       Count
##       tmp_min 0.009534750
##       tmp_max 0.009075244
##        precip 0.005341758
##          date 0.000000000
##          year 0.000000000
##         month 0.000000000
##  day_in_month 0.000000000
##           doy 0.000000000

Ok, enough of the graphics, let’s dig a bit deeper.

##      Combinations Count     Percent
## 1 0:0:0:0:0:0:0:0 17237 99.00631821
## 6 0:0:0:0:0:1:1:1    91  0.52268811
## 4 0:0:0:0:0:0:1:1    62  0.35611717
## 3 0:0:0:0:0:0:1:0    13  0.07466973
## 2 0:0:0:0:0:0:0:1     5  0.02871913
## 5 0:0:0:0:0:1:0:0     2  0.01148765

There are 91 occurences where both tmin, tmax, and precip are missing, other instances include 62 days missing both tmin and tmax. Let’s investigate this;

kigali_temp[is.na(kigali_temp$precip) & is.na(kigali_temp$tmp_min),]
## # A tibble: 91 x 8
##    date        year month day_in_month   doy precip tmp_min tmp_max
##    <date>     <dbl> <chr>        <dbl> <dbl>  <dbl>   <dbl>   <dbl>
##  1 1994-05-19  1994 May             19   140     NA      NA      NA
##  2 1994-05-20  1994 May             20   141     NA      NA      NA
##  3 1994-05-21  1994 May             21   142     NA      NA      NA
##  4 1994-05-22  1994 May             22   143     NA      NA      NA
##  5 1994-05-23  1994 May             23   144     NA      NA      NA
##  6 1994-05-24  1994 May             24   145     NA      NA      NA
##  7 1994-05-25  1994 May             25   146     NA      NA      NA
##  8 1994-05-26  1994 May             26   147     NA      NA      NA
##  9 1994-05-27  1994 May             27   148     NA      NA      NA
## 10 1994-05-28  1994 May             28   149     NA      NA      NA
## # ... with 81 more rows

The missing values where precipation, tmin, and tmax were all missing at the same time span from 19/05/1994 to 18/08/1994. The Tutsi genocide began on April 7, 1994 and ended in July 1994, so I don’t think anybody had the time to register the weather data around this time.

kigali_temp[is.na(kigali_temp$tmp_min) & is.na(kigali_temp$tmp_max) & !is.na(kigali_temp$precip),]
## # A tibble: 62 x 8
##    date        year month day_in_month   doy precip tmp_min tmp_max
##    <date>     <dbl> <chr>        <dbl> <dbl>  <dbl>   <dbl>   <dbl>
##  1 1989-06-06  1989 Jun              6   158    0.4      NA      NA
##  2 1994-03-01  1994 Mar              1    61    1.1      NA      NA
##  3 1994-03-02  1994 Mar              2    62    0.9      NA      NA
##  4 1994-03-03  1994 Mar              3    63   37.8      NA      NA
##  5 1994-03-04  1994 Mar              4    64    1        NA      NA
##  6 1994-03-05  1994 Mar              5    65   32.2      NA      NA
##  7 1994-03-06  1994 Mar              6    66    0        NA      NA
##  8 1994-03-07  1994 Mar              7    67    0        NA      NA
##  9 1994-03-08  1994 Mar              8    68   30.7      NA      NA
## 10 1994-03-09  1994 Mar              9    69    0        NA      NA
## # ... with 52 more rows

The same scenario applies for the missing values where tmin, and tmax were all missing at the same time. They span from 01/03/1994 to 30/04/1994. There’s also one missing value on 06/06/1989.

There is small window from 01/05/1994 to 18/05/1994 where they resumed recording weather.

One suprising thing is that they stopped recoring temperatures in March but they didn’t stop recording the precipation until mid May. Maybe because knowing the precipation is more important than knowing the temperature when dealing with air traffic.

As a conclusion, I think we should probably leave the year 1994 out of the modelling phase.

Data vizualization

This graph indicates that there has been an increase in hot days over the years.

There’s no real evidence in the graph that indicates a change in the pattern of rainfall. Except maybe in February where the amount has been low.

There’s a slight decrease in January and an increase in February, which might explaing the recent floodings. Also July has been extremely dry.

Kigali temperatures

## Warning: Removed 158 rows containing non-finite values (stat_boxplot).

The weather in Kigali can categorized in four parts:

Short rain season: October - December.
Short dry season: January - February.
Long rainy season: March - mid-June.
Long sunny season: Mid-June - September.