Jaynes’s book (first part): https://bayes.wustl.edu/etj/prob/book.pdf
Permanent class page: https://www.wmbriggs.com/class/
Uncertainty & Probability Theory: The Logic of Science
Link to all Classes. Jaynes’s book (first part):
Video
Links:
Bitchute (often a day or so behind, for whatever reason)
HOMEWORK: Play with the code, if you like, or just read.
Lecture
Stuck with the shiny whiteboard for the moment. Working on it.
There were some questions about “randomization” which I thought best to clear up in an entire lecture. We’ll soon come to “simulations”, so we badly need to understand what “randomization” provides, and, more so, what it does not.
Random, you will recall, means unpredictable, and unpredictable means a proposition with a probability that is not extreme, i.e. 0 or 1. Propositions about Reality are unpredictable because you don’t know the full cause and condition of that proposition. So sometimes I’ll say casually (a good joke) that random means unknown cause.
A proposition is only “random” on the evidence assumed. Change the evidence, change the probability: all (as in all) probability is conditional on the evidence assumed.
You have to choose which side gets the kickoff in a football game. The choice must be made. The choice has a cause. The referee makes the choice. But if he were just to announce the choice, many would suspect him of cheating, of favoring one team or another. So he removes known causes of the choice, and makes the choice unpredictable. He flips a coin. The coin has removed the predictability of the choice. This is seen as fair, since the causes of the flip are both unknown and assumed to be non-manipulated (but they can be!).
Randomization, then, removes cause. Our goal for the propositions of interest is to try to get as close to knowledge of cause as possible, so that any process that removes knowledge of cause does not help us.
Let’s fix an example. We want to test the drug Profital against a placebo, or some older treatment. We’ll test a certain number of people, some getting Profital and some the placebo. We could form the Profital group by taking the first people that came in the door to sign up, and the placebo group by taking the others. It need not be a fifty-fifty split (why, we’ll learn another day), but for ease, and because it makes no difference to this discussion, suppose it is.
The people signing up will have some attributes that we can measure, like sex, and a host of others we either could but don’t, or just plain can’t. We can obviously control—physically control—the attributes we measure. For instance, we ensure (say) that the two groups have a roughly even split between males and females. No one disputes that this is a good idea, nor that it is a possibility.
Randomization obviously cannot be used as this kind of control. If we’re, say, flipping coins to decide who goes in what group, it’s possible, if our two drug groups are small, that we could end up will all or most males in one of the groups. You have the idea.
Yet somehow it is thought that randomization, such as by a kind of coin flipping, does ensure balance of all the attributes we do not, or cannot, measure. This is false. Which I shall now prove to you.
For ease, suppose each person has one of many attributes, or not. A given person may have fewer than the total, which only means that they do not have all the attributes after some point and up to whatever that total is. In reality, it’s more than just yes/no, 1/0, and ab attribute may be a number, a range, a quality. But, as you’ll see, considering those makes the case for randomization even worse.
Remember, we cannot or do not measure these attributes. They are hidden. We do not know what they are, except that a person can have them or not. Consider attribute 1. Now in the sample we eventually collect, of whatever size, there will be a proportion p_1 of people with the attribute. This proportion is any number between 0 and 1, inclusive. The same is true for the other attributes: there will be some proportion of people who have each. It does not matter whether any grouping of these attributes is “correlated”. We don’t even know what this word means yet. But, once you do learn, you’ll agree.
Using the statistical syllogism, the probability a person in our entire sample has attribute 1 is p_1. In notation,Pr(A_1 | D_1) = p_1, where D_1 is the “distribution”. Not some fictional infinite frequentist or Bayesian entity (which, alas, goes by this same name), but the description of the actual sample of people we will have. We proved long ago that this is the correct probability.
Obviously, Pr(A_i|D_i) = p_i, for i = 1, 2, …, m. Do not think beyond this correct true equation, to some persiflage about “correlation”. We already know that if we change the information, we change the probability, so that Pr(A_i|D_i) will not equal Pr(A_i|D_i OtherInfo), if that “OtherInfo” (after considering D_i) is probative of A_i.
Now introduce some “randomization mechanism”, such as a coin flip where the causes are not controlled (physically controlled). Call the information, but not the actual physical procedure as done on something or someone, as R (for, ta da, randomization information). Given the physical act of randomization, a person can end up in Group 1 or Group 2. Again, these need not be equal chances. In the video, I discuss a paper which did 2 to 1. But we’ll suppose 1 to 1 for ease.
Obviously, the probability a person lands in G_1 or G_2 depends on R. Thus Pr(G_1|R) = 1/2. R says there are two groups, and that any person can equally well land in either. We don’t care so much about that, but we are interested in this: Pr(A_1 | G_1D_1R). (And for all i, of course.) This is the probability any person “randomized” to Group 1 has attribute 1.
I claim this is identical, mathematically and logically equivalent, I mean, to Pr(A_1 | D_1). If that’s so, the randomization does nothing to change our state of knowledge. Here’s the proof.
Let (stripping off the subscripts to make it easier to read) Pr(AG|DR) = Pr(A|GDR)Pr(G|DR), by Bayes. We want Pr(A|GDR), which equals Pr(AG|DR)/Pr(G|DR). We also have, by Bayes again, Pr(AG|DR) = Pr(G|ADR)Pr(A|DR) so that
Pr(A|GDR) = Pr(G|ADR)Pr(A|DR)/Pr(G|DR).
Knowing a person’s attribute does not affect the information about the randomization, which is done in ignorance of any attribute. Thus Pr(G|ADR) = Pr(G|DR), and so these terms cancel in numerator and denominator. (You can now see why the probability of being in a group doesn’t matter.) We are left with:
Pr(A|GDR) = Pr(A|DR),
We know that the knowledge about the randomization process, which is what R means (and is not the assignment itself, recall, which is G), has no bearing on whether a person has an attribute or not. Think: you can imagine any kind of randomization scheme, and that knowledge does not change, by magic, a person’s attributes. The question we wondered about was whether knowing a person was in a randomization group changed what we knew about their attributes. The knowledge of R is not the assignment. G is. Thus Pr(A|DR) = Pr(A|R) and so
Pr(A|GDR) = Pr(A|D),
QED. Randomization does not change our knowledge of hidden attributes.
How could it! We don’t even know what the attributes are. Nor how many any person has.
Nor, and this is the real key, whether those attributes are important in the causal path of the outcome. Which is all we really care about in science. A person can have a million attributes, a billion, but are all of them involved in the causes and conditions that bring about the outcome? One thing is sure: you won’t know. You can’t know. How could you! The attributes are hidden.
Hello, quantum mechanics. Goodbye, locality.
This is probability, which is not in reality but our mind. In reality, in any real group the balance of people with attribute 1 might not be equally proportioned. Suppose we’re taking 100 in each Group, and that in total, 30 people have the hidden attribute. It could be that all 30 are in Group 1, or that only 29 are, and so on, down to none. We don’t know, we can’t know.
If we could know we could control—physically control. Not the word “control” as abused by regression fanatics, which we’ll cover another time. I can hear some objections. “We can’t take the first half and give them the drug and the second half the placebo. It could be that the early group are old men who get out of bed earlier, and have different medical characteristics that the younger women who arrive late in the day.”
True. But then if you suspect this controllable, measurable attribute is important then you could (a) measure it and (b) control it. And should, because this is where knowledge of a person’s group changes the probability of the causal-path attribute! It adds a whole new piece of information to the right hand side. Pr(A|GDRC), where C is for control.
It remains that “mixing up” subjects in the hope that this provides “balance” to hidden, unmeasured or unmeasurable attributes does nothing to change our state of knowledge — whatever it might do to the state of the world. Which is different.
Again, using the 100 in each group example, with 30 total having attribute 1, it could be that any number from to 30 will be in either group. Ideally, if this attribute is in fact important in the causal path of the outcome/proposition-of-interest (which we do not and cannot know), we’d have 15 people in each group with attribute 1. The actual proportion we get could be anything. And the same for every other attribute.
There will be some “distance” between the actual proportions and the equally split proportions (I should say randomization proportions, because recall we don’t have to do 50-50, but that still continues not to matter). If this is a simple absolute difference in proportions, it can never be more than 100%. In practice, it’s going to be some (absolute number) between 0 and 100 percent. If the attribute is in the causal path, then if the larger this difference is, the more one group will differ, not because of the items we controlled, but because of differences in the dispersion of attributes.
Yet we still remain ignorant of all these attributes. And anyway, it should be easy to see (without resorting to math) that for any difference we might pick in actual-ideal proportions, the more attributes a person has, the greater the probability that at least one attribute will exceed this difference. That’s the last graph of the lecture. For any difference, even all the way up to 100, the probability goes to 1 as the number of attributes increases.
Put it this way: there are going to be large differences in hidden attributes between groups no matter what you do, as long as the number of attributes is itself large. And you’ll never know.
The story is not bleak, however, because, at least with people, most attributes (like sock color, say, or favorite lollipop flavor) have no importance to the outcome/proposition-of-interest most of the time. Unless, of course, you are doing marketing studies on fashion or candy. And for things that aren’t people, like electrons, the number of attributes don’t (seem to) keep increasing.
If everywhere these hidden attributes were seriously important, we’d never get anything done in science. Things would be happening, it would seem, for no reason. No reason we could see or measure, that is. That is true in quantum mechanics, where we cannot see what’s “under” the world. But about that, and about causality in general, we will wait for another day.
Meanwhile, remember this: the only good randomization does is to reduce the suspicion of cheating or bias. Nothing more. If you have been using it and thinking it has been doing some good, it is because you knew of a causal-path important attribute that you were controlling for indirectly without admitting it to yourself outright (like mixing up timing above).
Code
Now, because some people don’t believe math, but do believe simulations, I have created some code which does this, to prove to you via example that randomization does nothing. (Later, I will teach you not to outright believe simulations.)
This code is written pedantically, so you can figure it out easily, and tinker with it.
Roughly, you pick a number of hidden attributes, and number of subjects in each randomization group. Then the code “simulates” (I drop the scare quotes after this) a proportion of those having each attribute. These proportions run anywhere from 0 to 1; the more attributes you specify, the more proportions in 0 to 1 there are.
Next it generates the actual “distributions” (see above!). Then it either “randomizes” the sample into two parts, or just takes the first and then the second parts.
It does this ‘b’ times—a number you pick—and gives you the average across all repetitions. This stands in the for the math we did above.
We next compare the proportion of attributes in each group (averaged b times). If randomization gives you anything, these should be different between the randomized groups and those who . They aren’t. Randomization does nothing.
#randomization does nothing R code
n = 20 # number of hidden attributes
s = 200 # number of subjects; make it even for ease
b = 1000 # number of resamples
x = array(NA,c(s,n,b)) # matrix of s subjects with n attributes, b times
# a loop since i'm too lazy to vectorize
# generate b resamples of s x n matrix of hidden attributes; this is the math part
p = runif(n) # chance each subject has each attribute
for(i in 1:b){
q = runif(n*s) # numbers between 0 and 1 'uniformly'
x[,,i] = matrix(q<=p,c(s,n),byrow=TRUE) # simulate attribute distribution
}
# Now these s individuals must come to us.
# We can split them 'randomly' to groups A and B, or the first half
# to A, then second to B.
# We then look at the proportion of hidden attributes in A and B,
# which should be roughly the same proportion in each.
# 'randomly'
j = sample(s) # this is the randomization of numbers 1 to s
Ar = x[j[1:(s/2)],,] # Group 'A'; takes first half of randomization
Br = x[j[(s/2+1):s],,] # Group 'B'; takes the second
# first one half, then the other in Order
Ao = x[1:(s/2),,] # Group 'A'; takes the first half in the door
Bo = x[(s/2+1):s,,] # Group 'B'l takes the second
# This compares the proportion of attributes in the randomization group verus the proportion in no-randomization group, for Group A
# these only look exactly identical. Look at the individual rowMeans
# and you'll see small differences
plot(rowMeans(apply(Ar,3,colMeans)),
rowMeans(apply(Ao,3,colMeans)))
abline(0,1) # 1 to 1 line
plot(rowMeans(apply(Br,3,colMeans)),
rowMeans(apply(Bo,3,colMeans)))
abline(0,1) # 1 to 1 line
# the actual proportions over. i.e. our "distribution"
rowMeans(apply(x,3,colMeans))
Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.
I happen to know a certain medical/research oncologist (Robert A. Nagourney Md.) who, well over 30 years ago, developed an assay using a cancer patient’s own living tumor cells to determine the most effective treatment.
After observing very dramatic improvement in clinical response rates, he was chagrinned at the unwillingness of the medical community to even examine his approach. Thus, he proposed a study which would compare response rates of patients whose treatment was “assay directed” versus patients whose treatment had been selected via response probability models. He felt confident that such a comparison would establish the validity of his technique, as indeed it would have.
lamentably. that study was never carried out. Why was the study denied?, one might ask… It wasn’t random enough