Skip to content

Category: Class – Applied Statistics

December 7, 2018 | 7 Comments

Woke Nature

It has been deemed as unscientific to say that a person’s genitals determines their sex. Genes, too, have been disallowed as scientific evidence. Nature says so, and they are the preeminent science authority in the world.

How long before we see a geneticist, with black eyes and swollen face, of his own “free will”, announce on the slave-control box that he denounces the unscientific belief in sexual reproduction?

Well, if neither genitals or genes count, what does determine sex, or gender, then?

Does it matter? Once we disallow the truth, what differences does the lie that replaces it make? None, if the goal is getting people to accept the lie not because it’s a lie, but because it isn’t the truth.

We know that’s definitely the goal here when they say, “The idea that science can make definitive conclusions about a person’s sex or gender is fundamentally flawed.”

Now, brothers and sisters, I ask you how you came to be here today. Not to this site, which is the sure choice of all wise men and women. No: how did you yourself come to be? If you don’t know, ask your mother and father.

The answer they supply you with would have, until last week, have been deemed a definite scientific conclusion about your parents’ sex.

No longer, bigots.

Nature, true their new post-scientific selves, did not stop with calling sexual reproduction an affront to equality, and a denial of those who cannot use the “lavatory of their choice”. No, sir. They also want all scientific research teams to increase their sexual diversity.

Diversity in gender expression, which is to say, in variations of non-reproductive simulated sexual activities, “helps to produce stronger research”.

How?

You bigot. That’s how. (There’s a paper behind this, naturally, which if I can read through without retching, we’ll analyze on another day.)

They say “Scientists must also analyse how sex and gender shape the questions they ask and the methods they use to reach scientific insights”.

Borrowing from Wikipedia’s open physics problems page, is it possible to construct, in the mathematically rigorous framework of algebraic QFT, a theory in 4-dimensional spacetime that includes interactions and does not resort to perturbative methods from a team of penisless scientists identifying as male?

What relationship is there between lesbians who occasionally like men and Standard Model bosons and the set of all gauge superpartners?

Can we say why is gravity such a weak force? Does it have anything to do with all men being potential rapists?

Can the acceptance of necrophilia as a sexual orientation answer whether the mass of neutrinos follow Dirac or Majorana statistics?

“The question is no longer what the benefits of diversity are, but how we can best support the potential benefits of diversity,” says one of the Perspective’s authors, Mathias Wullum Nielsen, who studies gender in science at Aarhus University in Denmark. For example, he says, newcomers to a scientific field often ask original research questions. And, although friction can arise from different perspectives and ideas in research teams, those same approaches can spark innovative endeavours.

The logical mistake, one common in our Current Year, is equating original with good. How adding those with gender dysphoria to a team of quantum chemists will increase the rate of good questions about reaction rates is supposed to work we are not told. We are never told. It is accepted as true. It is insisted upon.

Yet it’s a better bet injecting mentally confused individuals onto research teams will only increase unique questions, not good ones.

Nature closes—the very last sentence!—with one purported benefit: “almost half of the proposals submitted to the Canadian Institutes of Health Research in 2011 included gender and sex analysis.”

Did you really think science could hold out forever against the forces of insanity and doom (to coin a phrase) that have gripped Western progressives?

November 27, 2018 | No comments

How To Do Predictive Statistics: Part X: Survival Analysis

Query: now that I’m a video master, would people like videos of these lessons? I don’t see how they’d help much, but who knows. Let me know below or via email.

Next up is survival analysis, a.k.a. time-to-event analysis. We could treat times to events as regular numbers, and use regression, or even tobit regression, or the like, except for a twist. And that twist is called censoring.

Suppose we’re studying when people hand in their dinner pails for the final time after shooting them up with some new Ask-Your-Doctor-About drug. Some will during our study gladden the hearts of undertakers, yet others will have frustratingly remained above ground. Yet we know they have an unbreakable appointment with the Lord. We know the keeling over times of the dead, but only the so-far times of the living. These so-far times are said to be censored.

There are other kinds of censoring, but today all we want to do is this so-called “right” censoring. (Wrong censoring happens in your more mature Western democracies.)

The first problem is finding useful software. Predictive methods are not yet so common that every package contains them. And the rare ones that rely on MCMC-type methods, about which more below. But if you don’t recall why these creatures are not what most think, then you must review: this and this at the least, and this too for good measure.

We’ve been using rstanarm, and it has a method that sorta kinda doesn’t not work, called stan_jm, for joint longitudinal-survival models. Longitudinal models measures things over time, like time-series. There is a prediction method for this model, but it only produces predictions for the longitudinal part. There are some laborious workarounds, but our point here is not software per se, but understanding the philosophy of models. So we’ll leave it behind.

There is also spBayesSurv, which works, and which allows spatial processes to join the fun. The difficulty with it is that you have to work directly with design matrices, which aren’t especially hard to grasp, but again the code requirements will become a distraction for us.

So we’re going to use brms. The weakness here is resources. It is a memory hog (which is why I’ve been avoiding it up to now). It externally compiles models before running them. First time I tried this for the model below, I had several other instances of R running, along with a video editor, and it locked up my system. Close extraneous programs before beginning.

Obligatory anti-MCMC (mini) rant. Since you reviewed, or you remembered the cautions, you recall MCMC doesn’t do what it says, not exactly. The problem with it is not that useful answers can’t be extracted from MCMC methods—of course they can. The problem is that people find it so useful I fear not enough thinking is going into finding better analytical approximations to complex integrals, solutions which would bypass all these fictional “random” simulations. For one, we could learn to embrace discrete finite models, which are exact, and not approximations as all continuous models are (and which everybody forgets because of the Deadly Sin of Reification). Bonus: discrete finite models don’t need integrals, thus don’t need MCMC. End of rant.

The Code


install.packages('brms', dependencies=TRUE)

library(survival)
library(brms)

We’ll use the built-in kidney data. Type ?kidney to learn about it. It has a time to event (infection), a censoring indicator, age, sex, and disease type. The package authors already wrote the model code for us, to which I make only one change: assigning the data to x (for consistency).


x = kidney

fit = brm(time | cens(censored) ~ age + sex + disease, data = x,
family = weibull, inits = "0")

The code form is mostly familiar from other models, except for the addition of time | cens(censored) to indicate this is time given censoring. The “weibull” is to characterize uncertainty in the time. ?brmsfamily shows the other options. When run, this will first show “Compiling the C++ model“. It could take considerable time here, too; minutes, maybe, depending on your resources. Then the MCMC bits begin.

As before, we could take time to examine all the MCMC diagnostics which give information about the parameters. This should be done. Do plot(fit). Look for “convergence”. I won’t do that here, because this example works fine.

In the end, we do not give a rat’s kidney about the parameters. They do not exist. There are mathematical struts that make the model work. That and nothing more. As always, we care about this:

     Pr( time in t | New age, sex, disease, D, M) (1),

where t is a some time of interest, where we make guesses of new values of the measures, where D is the old data, and M the model. Where we remember that the priors and the MCMC supposeds all form the model M. Change the model—change any part of the model—change the probability! This is not a bug, it’s a feature. Change anything on the right hand side, change the probability. All probability is conditional.

We want predictions. Which is to say, we want equation (1). Which means we must supply guesses of age, sex, and disease type. If we didn’t want to specify guesses of age, sex, and disease type, we shouldn’t have put them into the model. Luckily, we have a ready supply of such guesses: the old data. The probabilities produced by (1) will not be for these old patients, though (unlike the supposition of classical hypothesis testing). We already know all about them. They will be for new patients who are “like” the old ones, where “like” is defined by us: an implicit right-hand-side assumption. It is there even though it doesn’t appear visibly.


p = predict(fit, probs = c(0.10, 0.90))

The probs = c(0.10, 0.90) is not the default, which instead is the old familiar number. But why on earth do we want 95% prediction intervals? That’s a mighty and harsh requirement for time predictions. It produces great uncertainty; why shouldn’t it? Such an interval says “Conditional on the model and so on, there is 95% chance this patient will have an event in this time interval.” Considering a 100% chance would be the interval (0, infinity), you can see a 95% interval would be wide, too. The default is there only because old habits die hard.

First few rows of p:

> p
       Estimate  Est.Error       Q10        Q90
 [1,]  44.91669   54.83210  3.636403  105.41077
 [2,] 197.94483  223.35119 17.389400  478.37263
 [3,]  42.76444   51.70631  3.689502   98.58186
 [4,] 211.71675  233.77522 18.804205  494.05413
 [5,]  46.70474   58.45127  3.842455  111.61812
 [6,] 221.68699  263.71488 19.417692  519.37713
 [7,]  40.13264   52.69928  3.114666   95.80451
 [8,] 196.17201  237.87493 16.869121  470.98349
 [9,] 123.36390  146.84011 10.551705  294.04778

And the first few rows of x (which are matched to these p):

> x
   time censored patient recur age    sex disease
1     8        0       1     1  28   male   other
2    23        0       2     1  48 female      GN
3    22        0       3     1  32   male   other
4   447        0       4     1  31 female   other
5    30        0       5     1  10   male   other
6    24        0       6     1  16 female   other
7     7        0       7     1  51   male      GN
8   511        0       8     1  55 female      GN
9    53        0       9     1  69 female      AN

Doesn’t look so hot, this model. We have to be careful how we interpret its performance, though, because of the censoring (none of the first nine were censored, meaning all had the event).

Let’s look at the empirical cumulative distribution functions for the data, and for the point predictions, busted out by censoring.

plot(ecdf(x$time[x$censored==0]), main = 'ECDF', xlab='Time' )
   lines(ecdf(x$time[x$censored==1]),col=2)
   lines(ecdf(p[x$censored==0,1]),col=3)
   lines(ecdf(p[x$censored==1,1]),col=4)
   grid()
   legend('bottomright', c('X not-cens', 'X cens', 'P not-cens', 'P cens'), col=1:4, lwd=3, bty='n')

There is a clear difference in distributions of times for censored and uncensored data. What is assumed is that the times for the censored patients will be larger that what is seen (obviously). There is no censoring in the predictions, of course; the breaking out by censoring is only to show the matching points with the data. That is, the model is not predicting whether a new patient will be censored, for that concept has no place in guessing a person’s eventual time to event—which may be “infinite”, i.e. they never get an infection. Censoring only happens in limited-time studies.

This all means the predicted times must be larger than what was seen. The censored points “push out” the ECDFs to higher numbers. Thus (at least with a Weibull) the model tends to over predict in a way.

And what all that means is that we can’t really compare the model’s predictions with the observed data. The only way to verify this model is to test it on new times.

Wait. Build a model, make predictions, then test how well the model performs in real life? Where have we heard that before? You know it, baby. Rightched here. And in this gorgeous award-eligible book.

So let’s example the predictions themselves, knowing (as we knew for all our past efforts) that we’re limited to making statements only about the predictions and not their quality.

Here we run back into the screwiness of MCMC. Ideally, we’d specify a new age, sex, disease and compute (1), which would produce the same number (same prediction) for every duplicate entry of age, sex, and disease. Not so with MCMC, which produces hazy numbers. Run this:


i = order(x[,5], x[,6],x[,7]) # order by age, sex, disease
x = x[i,]
p = p[i,]

Then look at the first few rows of x:

> x
   time censored patient recur age    sex disease
5    30        0       5     1  10   male   other
36   12        0       5     1  10   male   other
24  132        0      27     1  10 female      GN
50  156        0      27     2  10 female      GN
6    24        0       6     1  16 female   other
14  536        0      15     1  17 female   other
37  245        0       6     1  17 female   other
68   25        1      15     2  17 female   other
31  119        0      35     1  22 female   other
57    8        0      35     2  22 female   other
1     8        0       1     1  28   male   other
33   16        0       1     1  28   male   other
20  402        0      22     1  30 female   other
72   24        1      22     2  30 female   other

The first two patients are the same! As are many of the others. At this point somebody will chirp up “But those data are correlated! This is the wrong model!” Which, I have to tell you, is empty of meaning, or ambiguous. Correlated to us means that when conditioned on a probability changes. It does not mean cause. Since probability is conditional on the sort of model we choose, and on everything else on the right hand side, it is not clear how multiple measures on patients would change the probability. Other models are easy to explore; the package authors even thought of some. ?kidney will show them to you (scroll to the bottom).

We’re going to ignore the multiple measures aspect (we’re not in this lesson trying to build the world’s most predictive model of kidney infections). Instead we’ll suppose, as happens, we have some rows of data that are the same. The first two rows of data are identical, as far as (1) goes. The have predictions

> p
       Estimate  Est.Error       Q10        Q90
 [1,]  46.70474   58.45127  3.842455  111.61812
 [2,]  46.26424   56.88700  3.454595  111.20644

Again, if these were analytical results, or non “simulated” results, these rows would be identical. They are not. They’re close, and whether “close” is close enough depends on the decisions that would be made—and on nothing else. I’m going to say close enough. But you might not. If not, you have to find a way to merge them, either by some kind of averaging, say, or by working though the parent code and hacking the simulation to group like rows, or whatever. Far-apartness would then be an indication the model did not “converge”.

Relevance

Okay! So hypothesis testing is out. How do we test measures? Class? Class?

If you said relevance, you’re right! Good memory. Let’s first look at all the predictions in some useful way. (The reordering of x and p won’t matter.) We could do something like this.

jit = rnorm(nrow(p),0,.2)
plot(x$age+jit,p[,1],pch=as.numeric(x$sex), col=as.numeric(x$disease), 
    cex=2, ylab='Times',xlab='Age', ylim=c(0, max(p)))
    grid()
    for(i in 1:nrow(p)){
      lines(c(x$age[i]+jit[i],x$age[i]+jit[i]),c(p[i,3],p[i,4]), col=as.numeric(x$disease[i]))
    }
   legend('topleft', c('male', 'female', 'other', 'CN','AN','PKD'), col=c('#999999','#999999',1:4), pch =c(1,2,NA,NA,NA,NA),lty=c(NA,NA,1,1,1,1), lwd=3, bty='n')

The jit adds a bit of jitter (which needs to be saved) to separate points. Remember: we looking for differences in probability and not just point predictions.

The changes in probabilities for sex are obvious, and they are for diseases AN, and PKD versus the other two. The changes in probabilities is not so great for age, except for two females with PKD (it’s the same two patients measured twice each). These are the only females with PKD, and the suspicion is age doesn’t matter too much, but the combination of female and PKD does. I’m not a kidneyologist so I don’t know what this means.

We can test these easily.


y = data.frame(age = seq(10,70,10), sex=rep('female',7), disease=rep('PKD',7))
p = predict(fit, newdata=y, probs = c(0.10, 0.90))

> cbind(round(p),y)
  Estimate Est.Error Q10  Q90 age    sex disease
1     1084      1884  56 2509  10 female     PKD
2     1022      1539  63 2481  20 female     PKD
3      951      1375  67 2353  30 female     PKD
4      914      1340  64 2176  40 female     PKD
5      942      1468  63 2279  50 female     PKD
6      876      1279  68 2066  60 female     PKD
7      901      1309  58 2140  70 female     PKD

So age does change the probability (and in a way that makes sense to naive readers like me). Is this change enough to make a difference? I have no idea, and unless you are kidney guy, neither do you. I don’t know what kind of decisions are pertinent. These kinds of decisions are not up to the statistician make. There are no magic numbers in the predictive way. P-values presume to give probabilities and make decisions simultaneously. We won’t make that enormous mistake.

Homework

You can repeat the same thing but for sex and disease. If you know something about kidneys, let us know below.

Then fit the second model, where it says (from ?kidney) “adding random intercepts over patients”. That’s a misnomer. Everything not known in a Bayesian analysis is “random”, which his nothing but a synonym for unknown.

Compare directly the predictions (don’t forget you sort p above) from both. We cannot say which of these models is better in a predictive sense per se: not until we get new data in. But what can you say?

Advanced readers should try this. brms is limited, unlike rstanarm, because its prediction method only spits out a point and predictions bounds. In rstanarm you get the whole distribution.

First pick a combination of the measures, and then a time you think is interesting. Suppose it’s 300. Now find the probability of exceeding that time with your given combination. This is trivial in rstanarm. Here you need to optimize. Next have a systematic series of measures (age, sex, disease) and plot these exceedance probability for this sequence. Different times will lead, of course, to different curves. The differences in those curves may be big or small depending on the decisions to be made conditional on the model. Proving, yet again, that the same model may be useful to one man, and useless for the next.

November 6, 2018 | 5 Comments

The Controversy Over Randomization And Balance In Clinical Trials

There was a paper a short while back, “Why all randomised controlled trials produce biased results“, by Alexander Krauss, in the Annals of Medicine. Raised some academic eyebrows.

Krauss says, “RCTs face a range of strong assumptions, biases and limitations that have not yet all been thoroughly discussed in the literature.”

His critics says, “Oh yes they have.”

Krauss says that the “10 most cited RCTs worldwide” “shows that [RCT] trials inevitably produce bias.”

His critics say, “Oh no they don’t.”

Krauss says, “Trials involve complex processes — from randomising, blinding and controlling, to implementing treatments, monitoring participants etc. — that require many decisions and steps at different levels that bring their own assumptions and degree of bias to results.”

His critics say, “No kidding, genius.”

Those critics—Andrew Althouse, Kaleab Abebe, Gary Collins, and Frank E Harrell—were none too happy with Krauss, charging him with not doing his homework.

The critics have the upper hand here. But I disagree with them on a point or two, about which more below.

The piece states that the simple-treatment-at-the-individual-level limitation is a constraint of RCTs not yet thoroughly discussed and notes that randomization is infeasible for many scientific questions. This, however, is not relevant to the claim that all RCTs produce biased results; it merely suggests that we should not use randomized controlled trials for questions where they are not applicable. Furthermore, the piece states that randomized trials cannot generally be conducted in cases with multiple and complex treatments or outcomes simultaneously that often reflect the reality of medical situations. This statement ignores a great deal of innovation in trial designs, including some very agile and adaptable designs capable of evaluating multiple complex treatments and/or outcomes across variable populations.

They go on to note some of these wonders. Then they come to one of the two key points: “there is no requirement for baseline balance in all covariates to have a valid statistical inference from [a statistical] trial”, calling such a belief a “myth”, meaning (as moderns do) a falsity.

It is false, too. Balance is not necessary. Who cares if the patients in group A used to own just as many marbles as the patients in group B when they were all six? And, of course, you can go on an on like that practically ad infinitum, which brings the realization that “randomization” never brings balance.

Control is what counts. But control is, like probability itself, conditional on the premises we bring to the problem. The goal of all experimentation is to discover, to the closest possible extent, the cause of the thing studied. If we knew the cause, we would not need to do the study. If we do not know the cause, a study may enlighten us, as long as we are measuring things in what I call the “causal path” of the item of interest. Also see this, for the hippest most modern-year analysis ever!

We con’t control for past marble ownership in most clinical trials, nor do we wish to, because we cannot bring ourselves to believe the premise that marble ownership is in the causal path of the thing under study. If we knew, really knew, the exact cause, we could run an experiment with perfect controls, since what should be controlled is a known part of the cause.

That we know so little, except in the grossest sense, about the right and proper controls, is why we have to do these trials, which are correlational. We extract, in our minds, the usable, and sometimes false, essences in these correlations and improve our understanding of cause.

Another reason balance isn’t needed: probability conditions on the totality of our beliefs about the proposition of interest (models, on the other hand, condition on a tiny formal fraction). Balance doesn’t provide any special insight, unless the proposition of interest itself involves balance.

Notice that medical trials are not run like physics experiments, even though the goals of both are the same, and the nature of evidence is identical in both setups, too. Both control, and physics controls better, because physical knowledge of of vastly simpler systems, so knowledge of cause is greater.

The differences are “randomization” and, sometimes, “blinding”.

Krauss’s critics “It is important to remember that the fundamental goal of randomization in clinical trials is preventing selection bias”.

Indeed, it is not just the fundamental, but the only goal. The reason “randomization” is used is the same reason referees flip the coin at the start of ballgames and not a player or coach or fan from one of the sides. “Randomization” provides the exact same control—yes, the word is control—that blinding performs. Both make it harder to cheat.

There is nothing so dishonest as a human being. The simplest of most frequent victim of his mendacity is himself. Every scientist believes in confirmation bias, just as every scientist believes it happens to the other guy.

“Randomization” and “blinding” move the control from the interested scientist to a disinterested device. It is the disinterestedness that counts here, not the “randomness”. If we had a panel of angelic judges watching over our experiment and control assignments, and angels (the good kind) finding impossible to lie, well, we would not need “randomness” nor blinding.

The problem some (not all) have with “randomization” is that they believe it induces a kind of mystical condition where certain measurements “take on” or are imbued with probability, which things can do because (to them) things “have” probability. And that if it weren’t for “randomization”, the things wouldn’t have the proper probability. Randomization, then, is alchemy.

Probability doesn’t exist, and “random” only means unknown (or unknown cause), so adding “unknownness” to an experiment does nothing for you, epistemologically speaking.

There are some interesting technical details about complex experiments in the critics’ response that are also worth reading, incidentally.

October 3, 2018 | 1 Comment

How To Do Predictive Statistics: Part IX Stan — Logistic & Beta Regression

Review!

We’re doing logistic and beta regression this time. These aren’t far apart, because the observable for both lives between 0 and 1; for logistic it is 0 or 1; for beta, any fraction or ratio—but not probability–that is on (0,1) works. We don’t model probability; we use probability to model.

That brings up another blunt point. In these demonstrations I do not care much about the models themselves. I’m skipping over all the nice adjustments, tweaks, careful considerations, and other in-depth details about modeling. Most of that stuff is certainly of use, if tainted by the false belief, shared by both frequentists and Bayesians, that probability exists.

I am not here to teach you how to create the best model for this or that kind of observable. I am not here to teach you best coding practices. I am here to teach you the philosophy of uncertainty. Everything takes an economy class nonrefundable ticket seat to that. Because that’s true, it’s more than likely missed code shortcuts and model crudities will be readily apparent to experts in these areas. You’re welcome to put corrective tips in the comments.

On with the show!

Logistic

Let’s use, as we did before, the MASS package dataset birthwt.


library(MCMCpack)
library(rstanarm)

x=MASS::birthwt # we always store in x for downstream ease
x$race = as.factor(x$race)

fit.m =  MCMClogit(low ~ smoke + age + race + ptl + ht + ftv, data=x)
fit.s = stan_glm (low ~ smoke + age + race + ptl + ht + ftv, family=binomial, data=x)

Last time we didn’t put all those other measures in; this time we do. Notice that we specify the data in both methods: rstanarm needs this, and it’s good practice anyway.



# This is a copy-paste from the MCMClogit lesson; only changing to p.m
p.m = NA
for(i in 1:nrow(x)){
  p.m[i] = MCMClogit.pred(fit.m,x[i,])
}

plot(x$age,p.m,col=x$race,ylab='Pr(low wt|old data,M)', pch=x$smoke+1,main='MCMC')
grid()
legend('topright',c('r1','r2','r3','s0','s1'), col=c(1,2,3,1,1), pch = c(3,3,3,1,2), bty='n')

Save that plot. Recall that the three races and two smoking states are given different colors and plotting characters. There is more to each scenario than just these measures, as the model statements show. But this is a start.



# This is new
p.s = posterior_predict(fit.s)
plot(x$age,colMeans(p.s),col=x$race,ylab='Pr(low wt|old data,M)', pch=x$smoke+1,main='Stan')
grid()
legend('topright',c('r1','r2','r3','s0','s1'), col=c(1,2,3,1,1), pch = c(3,3,3,1,2), bty='n')


Notice in the plot we have to do colMeans(p.s) to get the probability estimates—this is the tweak I mentioned last time. That’s because p.s contains nothing buy 189 columns (same as original data length) of 0s and 1s. Remember these are predictions! We take the average of the predictions, at each scenario, to get the probability estimate.

Stare and see the differences in the plots. We can’t put them all on one so easily here. While there are some differences, my guess is that no one would make any different decision based on them.

For homework, you can use rstanarm to check for measure relevancy and importance. Recalling, as you do so, that these are conditional concepts! As all probability is.

What’s that? You don’t remember how to do that? You did review, right? Sigh. Here’s one way checking the measure ptl, number of previous premature labours.



# The commented out lines we already ran; they're in memory
#fit.s = stan_glm (low ~ smoke + age + race + ptl + ht + ftv, family=binomial, data=x)
fit.s.2 = stan_glm (low ~ smoke + age + race +  ht + ftv, family=binomial, data=x)

p.s = posterior_predict(fit.s)
p.s.2 = posterior_predict(fit.s.2)

#a.1 = colMeans(p.s)
a.2 = colMeans(p.s.2)

plot(a.1,a.2, xlab='Full model', ylab='Sans ptl', main='Pr(low wt|old data, Ms)')
  abline(0,1)

Obviously ptl is relevant in the face of all these other measures. Would it be excluding others? Or with new observations? You check. That’s an order: you check. Would you say, as a decision maker interested in predicting low birth weight, that the probabilities change enough such that different decisions would be made using the different models? If so, then ptl is important, and should be kept in this model; i.e. in this form of the model with all the other measures in, too. If not, then it should be chucked.

There is no longer any such thing as hypothesis testing, or model building without reference to the decisions to be made using the model. It is impossible, beyond raw relevancy, which is trivial to see, that model building is independent of decision making.

Beta regression

We can’t do comparisons here, because only rstanarm has this kind of model. It’s for observables living on (0,1), things like ratios, fractions, and the like. The idea (which you can look up elsewhere) is that uncertainty in the observable y is characterized with a beta distribution. These have two parameters—normal distributions do, too. Unlike with normals, here we can model linear functions of both (suitably transformed) parameters. This is done with normals, too, in things like GARCH time series. But it’s not usual for regular regressions.

For beta regression, one or both parameters is transformed (by logging or identity, usually), and this is equated to a linear function of measures. The linear function does not have to be the same for both transformed parameters.

In the end, although there are all kinds of considerations about the kinds of transforms and linear functions, we are really interested in predictions of the observable, and its uncertainty. Meaning, I’m going to use the defaults on all these models, and will leave you to investigate how to make changes. Why? Because no matter what changes you make to the parameterizations, the predicitons about observables remains the same. And that is really all that counts. We are predictivists!

We last time loaded the betareg package. We’re not going to use it except to steal some of its data (the rstanarm package doesn’t have any suitable).

library(betareg)
data('GasolineYield', package='betareg')
x = GasolineYield

attr(x$batch, "contrasts") <- NULL # odd line

?GasolineYield # to investigate the data

We're interested in yield: "proportion of crude oil converted to gasoline after distillation and fractionation" and its uncertainty relating to temperature and experimental batch. There are other measures, and you can play with these on your own.

The "odd line" removes the "contrasts" put there by the data collectors, and which are used to create contrasts in the parameters; say, checking whether the parameter for batch 2 was different than for batch 10, or whatever. We never care about these. If we want to know the difference in uncertainty in the observable for different batches, we just look. Be cautious in using canned examples, because these kinds of things hide. I didn't notice it at first and was flummoxed by some screwy results in some generated scenarios. People aren't coding these things with predictions in mind.


fit = stan_betareg(yield ~ batch + temp | temp, data=x)
p = predictive_interval(fit)


The first transformed parameter is a linear function of batch and temp. The second---everything to the right of "|"---is a linear function of temperature. This was added because, the example makers say, the lone parameter model wasn't adequate.

How do we check adequacy? Right: only one way. Checking the model's predictions against new observables never before used in any way. Do we have those in this case? No, we do not. So can we adequately check this model? No, we cannot. Then how can we know the model we design will work well in practice? We do not know.

And does everything just said apply to every model everywhere for all time? Even by those models released by experts who are loaded to their uvulas with grant money? Yes: yes, it does. Is vast, vast, inconceivable over-certainty produced when people just fit models and release notes on the fit as if these notes (parameter estimates etc.) are adequate for judging model goodness? Yes, it is. Then why do people do this?

Because it is easy and hardly anybody knows better.

With those true, sobering, and probably to-be-ignored important words, let's continue.


plot(x$temp,p[,2],type='n',ylim=c(0,.6),xlab='Temperature',ylab='Yield')
for(i in 1:nrow(p)){
   lines(c(x$temp[i],x$temp[i]),c(p[i,1],p[i,2]))
   text(x$temp[i],mean(c(p[i,1],p[i,2])), as.character(x$batch[i]))
}
grid()


This pictures heads up today's post.

We went right for the (90%) predictive intervals, because these are easy to see, and plotted up each batch at each temperature. Depends on the batch, but it looks like as temperature increases, we have some confidence (and I do not mean this word in its frequentist sense) yield increases.

Let's do our own scenario, batch 1 at increasing temperatures.


y = data.frame(batch="1",temp=seq(200,450,20))
p.y = predictive_interval(fit,newdata=y)
plot(y$temp,p.y[,2],type='n',ylim=c(0,.6),xlab='Temperature',ylab='Yield')
for(i in 1:nrow(p.y)){
   lines(c(y$temp[i],y$temp[i]),c(p.y[i,1],p.y[i,2]))
}


This is where I noticed the screwiness. With the contrasts I wasn't getting results that matched the original data, when, of course, if I make up a scenario that is identical to the original data, the predictions should be the same. This took me a good hour to track down, because I failed to even (at first) think about contrasts. Nobody bats a thousand.

Let's do our own contrast. Would a decision make do anything different regarding batches 7 and 8?


y = data.frame(batch="7",temp=seq(200,450,20))
  p.y.7 = predictive_interval(fit,newdata=y)
y = data.frame(batch="8",temp=seq(200,450,20))
  p.y.8 = predictive_interval(fit,newdata=y)

plot(p.y.7[,1],p.y.8[,1],type='b')
  lines(p.y.7[,2],p.y.8[,2],type='b',col=2)
  abline(0,1)


This only checks the 90% interval. If the decision maker has different important points (say, yields greater than 0.4, or whatever), we'd use those. Different decision makers would do different things. A good model to one decision maker can be a lousy one to a second!

Keep repeating these things to yourself.

The batch 7 gives slightly higher upper bounds on the yields. How much? Mixing code and output:


> p.y.7/p.y.8
         5%      95%
1  1.131608 1.083384
2  1.034032 1.067870
3  1.062169 1.054514
4  1.129055 1.101601
5  1.081880 1.034637
6  1.062632 1.061596
7  1.068189 1.065227
8  1.063752 1.048760
9  1.052784 1.048021
10 1.036181 1.033333
11 1.054342 1.028127
12 1.026944 1.042885
13 1.062791 1.037957


Say 3%-8% higher. That difference enough to make a difference? Not to me. But what the heck do I know about yields like this. Answer: not much. I am a statistician and am incompetent to answer the question---as is each statistician who attempts to answer it with, God help us, a p-value.

At this point I'd ask the client: keep these batches separate? If he says "Nah; I need 20% or more of a difference to make a difference", then relevel the batch measure:


levels(x$batch) = c('1','2','3','4','5','6','7-8','7-8','9','10')

Then rerun the model and check everything again.