Official data for the COVID epidemic in Italy are availabe at https://github.com/pcm-dpc. The number of known active cases over time show a first wave of infections over March-June and a second one, much higher, started in October 2020 and still ongoing:

Conversely, if we look at the daily number of COVID-related deaths over time, the two waves are very similar:

My first thought was that the mortality rate due to COVID was much higher during the first wave, perhaps because the health care system was caught off guard. Then I was reminded that, during the first wave of COVID infections, Italy was lacking testing kits, so the known size of the first wave is likely to be an underestimation.

I wanted to estimate the real size of the first COVID wave in Italy based on the data available. Assuming that the daily number of deaths due to COVID was recorded accurately throughout the year, one can use this to estimate the number of active cases at any given time. One way to do this would be to build a model that accounts for mortality rate and the time it takes for the illness to develop. According to such model, if the mortality rate was 2% and the time for the illness to result in either recovery or death was one week, 100 deaths on day *x* would mean that 5000 people contracted the illness on day *x-7*. To derive the number of active cases on any given day, one would have to sum the number of newly-infected patients on that day to the number of patients that contracted the illness in the previous days and that have not yet recovered or died. The challenges of such a model are many, including, but not limited to:

- uncertainty regarding the exact mortality rate due to COVID, which can also vary over time and can be affected by factors such as patients’ age;
- variability in the duration of the illness.

A less refined but simpler way of estimating the number of active COVID cases during the first wave of infections is to calculate the correlation between number of known active cases and the daily number of deats due to COVID at a time when testing was reliable and widespread. Such correlation can then be used to estimate the number of active cases during the first wave based on the number of recorded deaths due to COVID.

For the time between Sept. 25th 2020 and January 2nd, 2021, the correlation between known active COVID cases and daily number of COVID-related deaths looks like this:

The data in FIGURE 3 are described well by a logistic regression (red line in the figure) of formula:

I can use this relationship to estimate the number of active cases through 2020:

The model underestimates the number of active cases for the days when the number of COVID-related deaths is the highest, but otherwise it reflects fairly well the trend of the second wave. According to the model, the actual number of active COVID cases at the peak of the first wave was at about seven times higher than recorded.

All data-handling, graphics and statistics were performed with the statistical software R.

**Other useful/interesting links:**