Intro
I thought it would be fun to turn the daily coronavirus predictions I’ve been making into statistics class, complete with code.
There are two parts to this: (1) the details of the simple or naive model, how it is fit and measuring its performance, and (2) questions about the data. For the first section, I assume the data is what it is, without question.
The second section is for discussion about the data: is it real? Chinese lies? Is the rest of the world going to get sick? Or is everybody panicking?
This is a follow-up to the original coronavirus post.
Code
As always, the code is written for clarity not conciseness. You can download the data here. Warning! It’s up to you to update this; it’s current only as of 12 February, Wednesday night stateside, or Thursday morning, Wuhan, China time. Original data source here: I update my sheet from this every evening about 8 PM. If that’s wrong, I’m wrong.
library(ggplot2) x = read.csv('corona.csv',na.strings='') x$date = as.POSIXct(strptime(x$date,format="%m/%d/%Y"),tz='EST') x$day = 1:dim(x)[1] # nls() doesn't seem to work with dates summary(x)
Note that tz='EST'
, which is dead necessary. Without it, when later we add to the days using the as.Date()
, it turns the data so that it reads as one day early! The as.Date
is needed, because R adds to dates not POSIXct objects (ggplot
requires POSIXct). I learned this the hard way. The other note is that the model code doesn’t work with dates, hence the creation of number of days.
# predict how far in advance? days.ahead = 21 fit <- nls(actual.cases ~ SSlogis(day, phi1, phi2, phi3),data = x) p.cases = predict(fit, data.frame(day=1:(dim(x)[1]+days.ahead))) fit <- nls(actual.deaths ~ SSlogis(day, phi1, phi2, phi3), data = x) p.dead = predict(fit, data.frame(day=1:(dim(x)[1]+days.ahead))) l = length( 1:(dim(x)[1]+days.ahead) ) d = seq.Date(as.Date(x[1,1]), length= l, by='day' ) x[1:l ,1] = d x$p.cases = as.numeric(p.cases) x$p.dead = as.numeric(p.dead)
Three weeks, at this point in time, is reasonable. Play with this to see what it does. The nls()
does the "the nonlinear (weighted) least-squares estimates of the parameters of a nonlinear model." The nonlinear model is the standard logistic function; i.e. the 'S' shape, indicating slow ramp up, quickening mid-range, then gradual slow down. This model type is commonly used to model outbreaks.
You can fill in the mandatory phiX
parameters with initial guesses. Leaving them without values forces the model to find estimates. Strangely, you can't leave them out.
There are other ways of adding the predictions to the data.frame x
, but this is plain. I just add dates from today to days.ahead
. And then I add to x
the predicted cases and deaths. The as.numeric()
keeps only the predictions and discards the (not so interesting to us) phiX
estimates. The prediction()
does not, at this date, support prediction intervals! The help file gives details on this sad state.
Meaning our model has no direct way to show uncertainty in the predictions. This is bad, but we can fix that below.
Now there are other packages and ways to fit logistic models to get the prediction intervals. I did not do them, but you can to compare. I didn't do them because I am lazy.
# Cases and deaths actual and predicted g <- ggplot(x, aes(date, actual.cases)) + ylab('Bodies')+xlab("")+ geom_line() + geom_line(aes(date,p.cases),x,col='black',linetype='dashed') + geom_line(aes(date,actual.deaths),x,col='red') + geom_line(aes(date,p.dead),x,col='red',linetype='dashed') + coord_trans(y = "log") + annotate('text',x=x$date[2], y = Inf, label = paste0('\n\n\nCases (black)\nDeaths (red)'),size=5,hjust = 0) g png('corona.png',width=900,height=500) print(g) dev.off()
ggplot2
does allow a legend, but it puts them outside the plot, which eats up screen space; hence my crude legend inside the plot using annotate()
. Sometimes it's nice to see the raw and not logged plot; comment out the coord_trans
line to get it.
Judging from this plot, the model does a reasonable job, and better once the data reporting settled down. Deaths are slightly under-predicted, which is investigated below.
# Mortality rate approximation g <- ggplot(x, aes(date, actual.deaths/actual.cases)) + ylab('Mortality Rate')+xlab("")+ geom_line() g png('corona2.png',width=900,height=500) print(g) dev.off()
This is of course only an estimate of mortality rate, since we don't know what's going to happen to the current cases. It's still instructive: mortality rate (or its approximation) is initially high because only sickest early cases are going to hospital; it rapidly falls as publicity and mandatory checks bring more softer cases in; and it upticks again at the end when the initial flurry is over and we're again left only with the sickest patients.
# New cases g <- ggplot(x, aes(date, diff(c(1200,actual.cases)))) + geom_line(aes(date,diff(c(1200,p.cases))),x,col='black',linetype='dashed') + ggtitle('Daily New Reported Cases')+ylab('New Cases')+xlab("")+ geom_line() g png('corona3.png',width=900,height=500) print(g) dev.off()
What a spike! That's because, as of today, a great coincidence, they're putting even "clinically diagnosed cases" into the official total case numbers, at this site anyway. Other sites are confirming it. The breakdown of infected, suspected, mild, critical, etc. have been shifted. Quoting:
Hubei's health commission said in its daily statement that it had changed the diagnostic criteria used to confirm cases, effective Thursday, meaning that doctors have broader discretion to determine which patients are infected.
"From today on, we will include the number of clinically diagnosed cases into the number of confirmed cases so that patients could receive timely treatment," the health authority said in a statement, which did not provide further details about the new criteria.
The model obviously does a lousy job capturing this kind of systematic change. Indeed, even a more complex model would have a difficult time, unless it on purpose and in advance assumed such a change was possible.
The only trick here is the diff(c(1200,actual.cases)))
, in which I MADE UP THE 1200, because I have no data from 20 January. This move only affects the first plotted point. Feel free to change it.
Looksa notta so bad, eh? The predicted "end" date we can read off the chart, as somewhere around first week of March. Or if you don't want to pick it off the chart, you can look at this table. Mixed code and output:
tail(x) date actual.cases actual.deaths day p.cases p.dead 45 2020-03-05 19:00:00 NA NA NA 78536.88 2916.419 46 2020-03-06 19:00:00 NA NA NA 78598.30 2927.004 47 2020-03-07 19:00:00 NA NA NA 78648.40 2935.942 48 2020-03-08 19:00:00 NA NA NA 78689.25 2943.481 49 2020-03-09 19:00:00 NA NA NA 78722.56 2949.835 50 2020-03-10 19:00:00 NA NA NA 78749.71 2955.186
The last line is thus a prediction of the total number of cases and dead. Or "end" date, which I've been defining as when new deaths fall to less than 1 per day. That is, it was until today's spike -- which should settle down by tomorrow.
g <- ggplot(x, aes(date, p.cases-actual.cases) ) + ggtitle('Model Error: Cases')+ylab('Predicted Cases - New Cases')+xlab("")+ geom_line() g png('corona3b.png',width=900,height=500) print(g) dev.off()
Above 0, the model predicted too many cases; below, too few. However, this is only the current shot of error. I mean, this is using all the data up to this point, and making predictions of the past data, and using that to guess the error. This is going to under-predict the future, i.e. genuine, prediction, error.
Anyway, the mean departure (before the change in reporting) is
mean(abs(x$p.cases-x$actual.cases),na.rm=TRUE)
Which is just under 500. Meaning we can take the one-day ahead prediction as accurate to about +/- 500 cases. This will increase as the number of days increases, and decrease as time goes on.
It would be more instructive to look at the actual prediction errors. Meaning, fit the model using data from 1 to i, predict point i + 1, then compare the actual i + 1 value with that prediction. We could get the +1 day error, +2 day, and so on, but given the limited data we can only go so far. I will do this, but I'm going to leave it as a homework problem for you.
# New deaths g <- ggplot(x, aes(date, diff(c(35,actual.deaths))),col='red') + geom_line(aes(date,diff(c(35,p.dead))),x,col='red',linetype='dashed') + ggtitle('Daily New Reported Deaths')+ylab('New Deaths')+xlab("")+ geom_line() g png('corona4.png',width=900,height=500) print(g) dev.off()
The model had been doing well, until the spike -- earlier on Wednesday, the deaths were only 4 or 5 more from the day before. Then by 8 PM the change occurred. It goes to prove how simple models just can't capture these kinds of systematic and abrupt changes in the nature of the data.
Again, I made up the first number, the 35. It only affects first data plotted point. Change it if you like.
This one was fitting well enough. The naive model is not catching what's happening with the uptick. Then this next plot.
# Deaths prediction error g <- ggplot(x, aes(date, p.dead-actual.deaths) ) + ggtitle('Model Error: Deaths')+ylab('Predicted Deaths - New Deaths')+xlab("")+ geom_line() g png('corona4b.png',width=900,height=500) print(g) dev.off()
Clearly the deaths prediction is more variable, even before the spike. The mean departure is:
mean(abs(x$p.dead-x$actual.deaths),na.rm=TRUE)
Which is about 10, before the spike. So while this is worse, it's not terrible.
Gist: the naive model fits out, does well enough in one-day ahead predictions, and seems to be capturing most major features of the cases, and some of the deaths. But it just can't capture large systematic changes.
Interpretation
All probability, where all means all and all means without exception, is conditional on the assumptions used. Part of those assumptions are the data themselves. Thus these predictions assume the veracity of the data (and the model form etc.). If the data is wrong, the model is probably wrong. It doesn't follow the model is necessarily wrong, in the sense it could still make good predictions, where as always "good" depends on the decisions you make with the model.
One theory is that the data are more-or-less accurate, honestly reported, but of course with all the problems of reporting in situations as complex as a virus outbreak. There are probably errors, but not in any preferred direction.
A second theory is that China, or rather the Chinese government, long known to be be willing to lie, is lying here. The data are false, the real numbers, rumor says, are one to two orders of magnitude higher. Look at all of the scary videos "smuggled" out, and so on.
The naive model used here is consistent with both theories. Indeed, it is consistent with any number of other theories, limited only by your imagination.
I've told this story before, but here's how I summarized it recently in an email to others:
An anecdote about the Chinese propensity to (let's call it) heightened reaction. I was in San Francisco Chinatown when Obama's Surgeon General warned that radiation from Japan's tsunami-breached Fukushima power plant would reach the States and that maybe people should buy iodine as treatment (remember that safe advice?).
I with my own eyes saw a run on boxes of Morton Salt. Women with armloads of salt, boxes grabbed off a truck, empty salt boxes outside of shops, a canister of salt rolling down Stockton. I'll never forget it.
Think of this: We can assume China has to report some numbers, even if they are lying about those numbers. Given that, what numbers can they report that will convince people, and more importantly convince experts, that the numbers are right? If they are reporting accurately, they can report the actual numbers.
But if they are lying, they might use a model like our naive one, well known to fit virus outbreaks. The naive model, or something like it with added "noise", can generate plausible looking data that could fool unwary statisticians and medicos. Or they might simply take the true numbers and divide them by a constant. Or things like last night's spike may indicate their bringing numbers in line with reality. Or see below about the flu.
And don't forget all reported deaths but two have been in China. One in the PI (to a Chinese, I believe) and one in Hong Kong.
If they are telling the truth, then we'll know in about two to three weeks. If people don't start dropping dead in the States, in Europe, and in other non-Chinese places, it will be an excellent indication the Chinese numbers were close or even accurate. Think: if the outbreak was a virulent as rumor has it, why hasn't it spread as rapidly or produced as many deaths outside China?
Some people have answers to that question, but they, too, are only guesses.
What about the flu? The CDC, as of 1 February (latest data), said this:
CDC estimates that so far this season there have been at least 22 million flu illnesses, 210,000 hospitalizations and 12,000 deaths from flu.
Lot of dead, no? That's only in the States. China is about three times larger, so at a rough guess they've had about ~36,000 deaths from the flu. How many of these deaths, and even cases, especially now with the new way of reporting, may have been wrongly ascribed to coronavirus, especially since both produce the pneumonia which is the real killer?
This is a distinct possibility.
Hubei authorities said the increases were because they had broadened their definition for infection to include people "clinically diagnosed" via lung imaging.
Up until now, they had been documenting cases using a more sophisticated laboratory test.
Health officials said they looked into past suspected cases and revised their diagnoses, suggesting older cases were also included in Thursday's numbers.
Not too long a shot to blame a flu pneumonia on coronavirus.
Addendum Two papers of interest. (1) Clinical characteristics of 2019 novel coronavirus infection in China, and (2) Bat Coronaviruses in China (this was from last year!). Stop eating bats!
Hack
In order to deal with the new reporting method and marrying it to the old data, here's a quick hack, which assumes the proportion of additional cases and deaths would have been in the past the same as yesterday. Plots as above.
Run this code right after reading in the data, i.e. after summary(x)
. The stuff inside of course. This is a quick hack and has to be done better, but you get the idea.
# fudge factor loop if (FALSE){ # quick hack! y = x[22,] x = x[-22,] days.ahead = 1 fit <- nls(actual.cases ~ SSlogis(day, phi1, phi2, phi3),data = x) p.cases = predict(fit, data.frame(day=1:(dim(x)[1]+days.ahead))) fit <- nls(actual.deaths ~ SSlogis(day, phi1, phi2, phi3), data = x) p.dead = predict(fit, data.frame(day=1:(dim(x)[1]+days.ahead))) l = length(1:(dim(x)[1]+days.ahead) ) d = seq.Date(as.Date(x[1,1]), length= l, by='day' ) x[1:l ,1] = d x$p.cases = as.numeric(p.cases) x$p.dead = as.numeric(p.dead) f.cases = y$actual.cases/p.cases[22] f.dead = y$actual.deaths/p.dead[22] x$actual.cases = x$actual.cases*f.cases x$actual.deaths = x$actual.deaths*f.dead x[22, ] = y }
Fudge factors are 1.29 and 1.13 for cases and deaths.
Selected plots:
Looks much less scary!
tail(x) date actual.cases actual.deaths day p.cases p.dead 38 2020-02-27 19:00:00 NA NA NA 69032.77 2092.952 39 2020-02-28 19:00:00 NA NA NA 69070.49 2101.701 40 2020-02-29 19:00:00 NA NA NA 69099.61 2108.914 41 2020-03-01 19:00:00 NA NA NA 69122.08 2114.852 42 2020-03-02 19:00:00 NA NA NA 69139.42 2119.736 43 2020-03-03 19:00:00 NA NA NA 69152.80 2123.749
To support this site and its wholly independent host using credit card or PayPal (in any amount) click here
Matt
flu ammonia should be pneumonia, I’m guessing.
Also: that goes both ways….
Bill,
Another vicious typo inserted by my enemies!
I see you used plenty of parameters here despite writing against parameters many times. Can you use your predictive probability logical approach instead here to show how that would work?
Justin
Justin,
This is prediction. Those dashed lines, I mean.
Did you ever have a chance to review the class where I did dozens of examples of how to “integrate out” the parameters to form predictive distributions? Here we’re only doing it by eye, informally. Which is good enough for the purposes to which we are putting this model.
For homework, try doing it formally, and finding the predictive distributions.
Possibly the simple logistic is not the best “model”, Gompertz should do better. Actually, the “carrying capacity” in a flu (or other human) epidemic should decrease with time (not constant, as postulated in the simpe logistic): doctors treat pacients and save a lot; infected who recovered become (in many diseases) immunized ans thus unavailable for infection; and (most of all) quaratine measures reduce enormously the available “infectable” persons.
“Not too long a shot to blame a flu pneumonia on coronavirus.”
Couldn’t agree more this is starting to look more like a global exercise to
extol the virtues of totalitarianism. In the past the regular old garden
variety annual flu has killed sixty million and no one felt compelled to lock
down tens of millions much less board them up in their homes. It certainly
has everyone’s attention and is producing a phenomenal effect; I am prepared
to admit this is completely wrong but if the official numbers are correct this
lends new meaning to the term ‘tempest in a teapot’. We may be witnessing
the greatest psychological crowd control methodology of recorded history.
just saying…
Pingback: Coronavirus Statistics: R Code Included! | Reaction Times
One programming note, might be more as I work through it. Thanks so much for posting your code.
You say:
R actually is very happy doing any kind of math operation on POSIXct or POSIXlt objects. But it does them in seconds, viz:
w.
One other note about POSIX objects. Internally, they are stored as the number of second since the first day in 1830. We can calculate this using the internal variable “offset” in the POSIX objects. I first calculate the POSIX value of some random day:
Then I use that offset from the random day chosen:
Note that the origin is always in GMT unless specified otherwise as follows:
Best to you as always,
w.
Thanks, Willis! It was the
seq.Date()
that was giving grief.Matt, try this:
> startdate=as.POSIXct(“1984-02-01 00:00:00″,tz=”EST”)
> zootime=startdate+c(1:20)*secsperday
> zootime
[1] “1984-02-02 EST” “1984-02-03 EST” “1984-02-04 EST” “1984-02-05 EST”
[5] “1984-02-06 EST” “1984-02-07 EST” “1984-02-08 EST” “1984-02-09 EST”
[9] “1984-02-10 EST” “1984-02-11 EST” “1984-02-12 EST” “1984-02-13 EST”
[13] “1984-02-14 EST” “1984-02-15 EST” “1984-02-16 EST” “1984-02-17 EST”
[17] “1984-02-18 EST” “1984-02-19 EST” “1984-02-20 EST” “1984-02-21 EST”
> thezoo=zoo(c(11:30),zootime)
> apply.weekly(thezoo,mean)
1984-02-05 1984-02-12 1984-02-19 1984-02-21
12.5 18.0 25.0 29.5
The function “apply.weekly” is part of the package “xts”, which also has “apply.monthly”, “apply.yearly”, etc. They allow the application of functions by the day, etc.
Best to you, fascinating post.
w.
Hardly understood a bit of this post, but oddly enough, enjoyed reading it. Maybe there is a statistician lurking undercover in me who will understand it all in Heaven!
God bless, C-Marie
—
Justin,
…
Did you ever have a chance to review the class where I did dozens of examples of how to “integrate out” the parameters to form predictive distributions?
…
For homework, try doing it formally, and finding the predictive distributions.
—
Yes, I see https://www.wmbriggs.com/post/24934/ for example, which while talking about supposed evils of parameters and simulation, in fact relies on them 100% to get results. I’m asking can you do it as you philosophically want to, without relying on parameters and simulation?
Justin
Evidently I suck at R because I can’t get the plots to display, even if I copy-paste the code chunks (maybe Briggs’ enemies have seeded the code with spelling or syntax errors). I’m getting different errors at nearly every step, but I don’t know enough about R’s syntax or ggplot2 to figure out how to fix them. Bummer. Guess I should read a book on R one of these days instead of attempting to hack my way to logical probability glory. . .
Sussibar,
Try it one line at a time. Then report your first error.
Hi Briggs. In the first chunk the interpreter stumbles over:
>x$date = as.POSIXct(strptime(x$date,format=”%m/%d/%Y”),tz=’EST’)
Error in `$ summary(x)
ï..date actual.cases actual.deaths day
01/22/2020: 1 Min. : 654 Min. : 25.0 Min. : 1.00
01/23/2020: 1 1st Qu.: 7373 1st Qu.: 160.5 1st Qu.: 6.75
01/24/2020: 1 Median :19006 Median : 394.0 Median :12.50
01/25/2020: 1 Mean :23879 Mean : 528.4 Mean :12.50
01/26/2020: 1 3rd Qu.:37972 3rd Qu.: 832.0 3rd Qu.:18.25
01/27/2020: 1 Max. :63859 Max. :1381.0 Max. :24.00
(Other) :18
Wow, and my post removed the actual text of the error. It should be:
Error in $<-.data.frame(*tmp*, date, value = numeric(0)) : replacement has 0 rows, data has 22
sussibar,
It looks like you’re pasting output. And not the code you used.
Try just this:
x = read.csv('corona.csv',na.strings='')
summary(x)
Mama always called me “special.”
x = read.csv(‘corona.csv’,na.strings=”)
summary(x)
works fine for me and gives me the output I accidently pasted into my first comment (in that former comment I was trying to say summary(x) gives me output even when R gives me the error I posted in my latter comment, but somehow the format of the code screwed up what posted to the comments section that first time around.).
So I wasn’t supposed to include the two lines in the first formatted chunk,
“x$date = as.POSIXct(strptime(x$date,format=”%m/%d/%Y”),tz=’EST’)” and “x$day = 1:dim(x)[1]”?
I though those were some sort of object property assignment statements (but like I said, R is definitely not a language I’ve spent much time with).
Am I mistaken trying to whole-cloth copy-paste each formatted code chunk in the article and either execute each one sequentially in the console interpreter or pasting it all together in one file and trying to run it?
If I copy and execute line-by-line in the console I’m good until I get to:
> d = seq.Date(as.Date(x[1,1]), length= l, by=’day’ )
and the interpreter chokes and yells at me:
Error in charToDate(x) :
character string is not in a standard unambiguous format
And calling to check the contents of d gives me “Error: object ‘d’ not found”, so for whatever reason d isn’t getting instantiated in memory.
I know this isn’t a workshop on R, and by this point you can probably tell my expertise with R is on par with a drunkard trying to safely stumble through a dark hallway littered with hot wheels. I think I can follow along with the article well enough by following your explanation and looking up functions and other syntactical structures that are new to me by using R’s help and Google.
sussibar,
Point of line by line is to just see what you’re getting. I’m particularly interesting in the
summary(x)
right after just reading the code in and nothing more.It appears you’re getting the error in the
as.POSIXct
line. Theseq.Date()
appears to thinkx[1,1]
is some character string and not a date. It should be aPOSIXct
object from the previous line.So this all leads me to wonder if the csv file isn’t corrupted. Hence looking at the summaries to make sure all is right.
Also just type
x
after reading in. You should see all the data. Maybe there’s a problem there.Thanks Briggs! Lo and Behold, what does x show me?
The column headings were:
ï..date actual.cases actual.deaths
So obviously x$date didn’t exist, but x$ï..date did. Hence my problems. ROOKIE MISTAKE.
It turns out that when I saved your .csv locally in Excel with a csv utf-8 file encoding then Windows, logically, inserts a UTF-8 byte-order mark at the start which R then tries to parse when you call read.csv (thus creating the garbage string “ï..”). The solution was to explicitly pass
fileEncoding='UTF-8-BOM"
as an arg when invoking read.csvMan I love Windows…
It’s always the simple things, isn’t it? Sorry for wasting your time but thanks for all your help! I’m now getting plots to display with magnificient data in all their glory.
Pingback: Coronavirus Update II, Stats & Predictions – William M. Briggs
Thank you for this interesting tutorial! I am having problems with nls, having tried a number of different values for phi1, phi2, and phi3. Using some guidance from other websites, I set phi1 to 378856, phi2 to 31.5, and phi3 to 1. Regardless of using these values or other values, I receive the same error:
Error in nls(y ~ 1/(1 + exp((xmid – x)/scal)), data = xy, start = list(xmid = aux[[1L]], :
step factor 0.000488281 reduced below ‘minFactor’ of 0.000976562.
Might you have some suggestions as to what values of phi1, phi2, and phi3 to use? (I assume these values correspond to Asym, xmid, and scal?) Many thanks!