In the following report I plan on going through the numerous problems that are inherent in both the New York Times GitHub Repository (https://github.com/nytimes/covid-19-data) and Johns Hopkins GitHub Repository (https://github.com/CSSEGISandData/COVID-19). These problems have persisted since the onset of the COVID-19 pandemic when both these repositories were established. I want to make it clear from the outset that this is not at all a critique of what either The New York Times or Johns Hopkins has done in an effort to provide a service to the scientific community. They have done an incredible job pulling together the many disparate datasets from across the U.S. and it is quite an achievement. Unfortunately, those who have dedicated their time and effort to maintaining the repositories cannot do everything, some work must fall on the shoulders of others to help increase the accuracy of this data for use to better ensure that those in the scientific, public health, public policy, and media communities are using the most accurate data possible. The intent of these reports is to inform, correct, and make suggestions about how to resolve some of these artifacts.
Discussing the problems inherent with developing models and forecasts generated from them and evaluating their potential accuracy with less than optimal data is beyond the scope of this series of reports. Importantly some of these challenges are covered extensively in many available texts. For a gentle introduction to some of these, I highly recommend Hyndman and Anthanasopolulos’ Forecasting: Principles and Practice text that is freely available (https://otexts.com/fpp2/) and having a peek at Chapter 12 ‘Some practical forecasting issues’. For, in my opinion, one of the best papers on anomalous processes that one must consider in relation to their data and model, and precisely why one must deal properly with these problems before actually trying to model, see this oldie, but goodie by Tsay (https://onlinelibrary.wiley.com/doi/abs/10.1002/for.3980070102).
I have read a great number of papers (https://www.pnas.org/content/117/29/16732), blogs (https://blog.rstudio.com/2020/12/23/exploring-us-covid-19-cases/), and even some of the bits and code from the Kaggle COVID-19 forecasting competition (https://www.kaggle.com/c/covid19-global-forecasting-week-5/code), but not really seen these very problematic recording artifacts dealt with anywhere explicitly. Admittedly, I have not read everything and may have missed this being addressed somewhere, but the simple fact that we are now more than a year into a global pandemic, and the two main repositories that are being utilized by scientists, public health officials, and policy-makers for decision-making, and top-tier scientific publications across the U.S. appear to not address this problem explicitly is for lack of a better term disconcerting. Again, I want to reiterate that this is not in any way meant to be an attack on either The New York Times or Johns Hopkins who have done remarkable work in an effort to assist the scientific community. They are to be applauded for their contributions.
Over the next week (if I manage to find the time). I will be showing each of the problems I came across as I began to work on the development of a methodology to try and answer the very serious question surrounding the impact of fan attended National Football League (NFL) games, now in The Lancet’s (preprint) repository, that can be found here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3805754. I will begin, with the simplest and arguably the most understandable problem. This pertains to holidays. Keep in mind that I am selecting counties for these examples that I came across for the above-mentioned paper, but have done further exploratory analysis and have identified these problems as wide-spread, at least in relation to U.S. specific repositories.
Here we will have a quick look at Hamilton County, Ohio, home of the NFL’s Cincinnati Bengals. The figure below is a low/high-pass filter (±95CIs) based on a 7-day simple moving average plotted against the observed (actuals) recorded data for this county over the entire period of the pandemic (thus far). I have colored the actuals for regular days (blue), holidays that have not been properly recorded (red), and the day following these holidays (green) as this should help to clearly see the pattern that emerges.
The pattern here is really clear, Thanksgiving, Christmas, New Year’s were all days where 0 cases were recorded. Any data scientist or statistician would immediately recognize (hopefully!) that this as a problem. First, taking a close look at the underlying data to see when these 0s occurred and then also coming up with a reasonable “solution” for how to remedy this. There are many times where you have real 0s in the underlying process you are trying to understand/model, but this is not one of them. Further, and making this problem more pernicious is the fact that the day immediately following these holidays is roughly twice the volume of the day prior to the holiday. What makes dealing with this challenging is that there really are no existing imputation, or more specific interpolation methods (or R libraries, at least currently!) that can properly address such a recording artifact directly so this must be resolved programmatically. Of course, bear in mind that this is a conditional problem. Put differently, not all counties across the U.S. experienced the same identical recording artifact/issue so in correcting for this your approach must only correct for counties where this was a problem.
I strongly suspect that one of the reasons that some of these artifacts that exist in these repositories have not been addressed is because often the scientist/researcher, public health official, etc. are examining State-level data (perhaps even cumulative distributions of these data). The problem however with doing so without first examining the counties that make up the total State-level series is that it ultimately obscures these artifacts. What I mean by this is if you add up all the counties in a State, the variation inherent in the recording of COVID-19 cases washes out one’s ability to visualize the problem because the 0 that appears for a holiday in one county is not necessarily a 0 in different counties who continued to properly record on a daily basis throughout the pandemic. This results in an aggregated State-level time-series that obscures this artifact from view. This is why exploratory data analysis is so important to the modeling process. If you have not teased apart your data to examine it for possible problems, then in my estimation you have missed the most critical step in the entire process. I realize this is more work. Taking your data, identifying problems, correcting them, and putting it all back together can be an arduous process, however without doing it you will ultimately end up with a mispecified model, or worse!
For this particular problem one must develop an interpolation approach that is reflective of how the artifact became nested in the data to begin with. In this case, it is fairly obvious that the county official responsible for recording COVID-19 cases took a day off (or a few!). Cases then built up across two days, but rather than forward filling (which may not have even been possible!) the data for two days was lumped into the day after the holiday. A decent bit of code is necessary to help flag all U.S. holidays with 0 counts, split the recorded case count from the day after forward filling the holiday, and then leaving half the case count remaining will help resolve this. You will note how both the holiday and day after the holiday have now been cleaned for purpose, the 7-day simple moving average is now more accurately reflective of the underlying process, and the associated filter (±95CIs) has less volatility than it did prior to being corrected with a conditional interpolation holiday algorithm developed explicitly to address this particular artifact.
The purpose of this first example was to really gently introduce you to the problems that are inherent within the U.S.-specific COVID-19 repositories that are currently being utilized, well, everywhere! In fact, if you Google this exact county in Ohio right now to check their COVID-19 cases, you will see this artifact very clearly in their interactive plot (Google is utilizing the NY Times GitHub Repo). Come on Google!
I was hopeful that someone who read the first iteration of this report would have caught one of the other errors that I had left in this particular series. Unfortunately, no one has zinged me, which of course may be a function of having a very small number of people who are interested or come across this first report or perhaps no one actually has the time to read through what I am attempting to illustrate. My hope is that those in the data science community who have expressed an interest in public health, and in helping to more accurately forecast both incidence and mortality during the pandemic, will find this series of reports useful.
This next artifact is perhaps less obvious as there is only a single case in this particular series. Some readers may be thinking that having two inaccurate recordings is not so problematic. I ask you to reserve judgment about the how problematic all of these artifacts truly are until the end of these reports. Like before I have choosen to highlight the artifact by colors these with red and green, with the rest of the actuals in blue. This simply makes it easier to visualize and have a closer look at them in the figure. In this instance we have an artifact that seems to occur almost right before Independence Day. Of course, in 2020 the 4th of July fell on a Saturday and this meant many employers likely gave the Friday prior to the holiday off. However, in this case our artifact fell one day sooner. There are all kinds of reasons that this may have been the case. Perhaps the person responsible for recording for this county called out sick to have an extra long holiday. This person likely inherited new responsibilities with the pandemic and had up until July 2nd done what appears to be an admirable job recording for the county. Have a look for yourself at the artifact below.
Like with the holiday example above the day that follows July 2nd is roughly twice the magnitude of the day before July 2nd and so a similar approach to interpolation can be adopted to correct for this. That is, you can take half of case values for July 3rd and fill it forward for July 2nd. Have a look at the corrected series now and where these corrected case values sit in the series relative to the case counts immediately before and after them.
All data scientists/statisticians prefer to have ground truth whenever possible, not only because it is helpful for training data, cross-validation and hyperparameter tuning, but because it enables us to effectively evaluate model performance. Not having accurate actuals means that when it comes time to evaluate performance you will effectively be measuring how a model performs in relation to something that is not reflective of the underlying process. It does not matter if you are using RMSE, MAE, MAPE, MASE, or whatever KPI you feel is most suitable. No KPI in this scenario will properly capture how well your model has performed because you simply do not have the ground truth. So what does the ground truth look like now that I have corrected for the various artifacts covered in this first report? Let’s visualize it!