Examination of the Statistical Accuracy of COVID-19 Models
Forecasting models have been influential in shaping decision-making in the COVID-19 pandemic. However, there is concern that their predictions may have been misleading. Here, we dissect the predictions made by four models for the daily COVID-19 death counts between March 25 and June 5 in New York state, as well as the predictions of ICU bed utilisation made by the influential IHME model. We evaluated the accuracy of the point estimates and the accuracy of the uncertainty estimates of the model predictions. First, we compared the “ground truth” data sources on daily deaths against which these models were trained. Three different data sources were used by these models, and these had substantial differences in recorded daily death counts. Two additional data sources that we examined also provided different death counts per day. For accuracy of prediction, all models fared very poorly. Only 10.2% of the predictions fell within 10% of their training ground truth, irrespective of distance into the future. For accurate assessment of uncertainty, only one model matched relatively well the nominal 95% coverage, but that model did not start predictions until April 16, thus had no impact on early, major decisions. For ICU bed utilisation, the IHME model was highly inaccurate; the point estimates only started to match ground truth after the pandemic wave had started to wane. We conclude that trustworthy models require trustworthy input data to be trained upon. Moreover, models need to be subjected to prespecified real time performance tests, before their results are provided to policy makers and public health officials.
Professor Sally Cripps, firstname.lastname@example.org, +61 425-276-967
*This paper has been accepted by the European Journal of Epidemiology.
Professor Sally Cripps
Forecasting models have been influential in shaping decision-making in the COVID-19 pandemic in two main areas. The first area is the use of prediction models for decision making in resources allocation, and second is the use of models for decision making in regard to the impact of non-pharmaceutical interventions (NPIs). However, there is concern that predictions and inference from these models may have been misleading. In this talk I will discuss findings regarding the accuracy of four prediction models for daily death counts in the state of New York in the early stages of the pandemic, as well as inference on the impact of non-pharmaceutical interventions on the effective reproduction number for COVID-19 in various European countries. Our conclusions are that models performed poorly in the prediction of daily deaths and need to be subject to pre-specified real time performance tests, before their results are provided to policy makers and public health officials. In addition we conclude that different trajectories of the effective reproduction rate give rise to the same trajectory of daily deaths calling into question the effectiveness of NPIs such as lockdown in reducing the spread of COVID-19.
[11 June 2020, International Institute of Forecasters]
John P.A. Ioannidis1, Sally Cripps2, Martin A. Tanner3
Epidemic forecasting has a dubious track-record, and its failures became more prominent with COVID-19. Poor data input, wrong modeling assumptions, high sensitivity of estimates, lack of incorporation of epidemiological features, poor past evidence on effects of available interventions, lack of transparency, errors, lack of determinacy, looking at only one or a few dimensions of the problem at hand, lack of expertise in crucial disciplines, groupthink and bandwagon effects and selective reporting are some of the causes of these failures. Nevertheless, epidemic forecasting is unlikely to be abandoned. Some (but not all) of these problems can be fixed. Careful modeling of predictive distributions rather than focusing on point estimates, considering multiple dimensions of impact, and continuously reappraising models based on their validated performance may help. If extreme values are considered, extremes should be considered for the consequences of multiple dimensions of impact so as to continuously calibrate predictive insights and decision-making.
John P.A. Ioannidis, MD, DSc, Stanford Prevention Research Center, 1265 Welch Road, Medical School Office Building, Room X306, USA. E-mail: email@example.com
This paper provides a formal evaluation of the predictive performance of a model (and its various updates) developed by the Institute for Health Metrics and Evaluation (IHME) for predicting daily deaths attributed to COVID19 for each state in the United States. The IHME models have received extensive attention in social and mass media, and have influenced policy makers at the highest levels of the United States government. For effective policy making the accurate assessment of uncertainty, as well as accurate point predictions, are necessary because the risks inherent in a decision must be taken into account, especially in the present setting of a novel disease affecting millions of lives. To assess the accuracy of the IHME models, we examine both forecast accuracy as well as the predictive performance of the 95% prediction intervals provided by the IHME models. We find that the initial IHME model underestimates the uncertainty surrounding the number of daily deaths substantially. Specifically, the true number of next day deaths fell outside the IHME prediction intervals as much as 70% of the time, in comparison to the expected value of 5%. In addition, we note that the performance of the initial model does not improve with shorter forecast horizons. Regarding the updated models, our analyses indicate that the later models do not show any improvement in the accuracy of the point estimate predictions. In fact, there is some evidence that this accuracy has actually decreased over the initial models. Moreover, when considering the updated models, while we observe a larger percentage of states having actual values lying inside the 95% prediction intervals (PI), our analysis suggests that this observation may be attributed to the widening of the PIs. The width of these intervals calls into question the usefulness of the predictions to drive policy making and resource allocation.