AI4D Predict the Global Spread of COVID-19 challenge

in #datascience5 years ago

The epic world tour of COVID-19 has reached South Africa. Fortunately, the international data science community has already reacted swiftly. There are excellent people tracking the epidemic, who already offered predictive models to assist policy makers and the rest of us in making decisions regarding containment of the disease.

In this blog post, I prepare for the Zindi competition for predicting the global spread of COVID-19 disease.

How to predict the spread of COVID-19?

My approach is to first, model the epidemic, and then to figure out a way to predict the spread.

For modelling epidemics, it seems that epidemiologists utilise the SIR model. This model tracks the Susceptible people, the Infected people, and the Recovered people. Some variants of the model also reckon the Exposed group, such as this SEIR model. The SIR epidemic model page from scipython has some useful code and a more terse explanation of the model too.

The Zindi competition requires the use of casualties as the predicted value. Casualties are not explicitly included in the SIR model, but it can be used to compare with crude mortality rates and the case fatality rate. All these Greek letters like betas and gammas are technical Chinese to me, so I'm going to have to get grokking.

Getting the cases and casualties data

Zindi's data is obtained from John Hopkins University. In particular, they offer the number of cases, and the number of casualties per week.

Preliminary EDA

I have utilised the starter notebook to plot these two graphs:
Corona COVID-19 cases as of 16/03/2020
This is the global amount of cases per region on 16/03/2020.

Corona COVID-19 casualties as of 16/03/2020
This is the global amount of casualties per region on 16/03/2020.

Feature engineering

[Coronavirus: Why you must act now] (https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca) contains some interesting ideas to track a few more metrics I may utilise:

  1. The mortality rate, which is around 1%, but I may want to use a higher mortality rate of 3,5% (their terms, not mine).
  2. Time from infection to death, which is around 20 days.
  3. The amount of time it takes for cases to double, which is around 5 days.

Baseline modelling

For my baseline model, I will use Linear Regression.

Conclusion

In this post, I outlined a plan to tackle Zindi's Predict the Global Spread of COVID-19 challenge. The purpose of the competition is to predict the amount of weekly casualties of the disease in each region. I found some references to teach myself the SIR model that is used to model epidemics. Then I discussed some potential metrics that may help the predictive model, and decided to use Linear Regression model results as a baseline.

Stay safe, everyone! Wash your hands!