Every Result Of Unsupervised Learning Is Correct; Or, All Learning Is Supervised

The real point I wish to make is that there is no such thing as unsupervised learning; or, stated another way, Truth exists; or, stated another way, every solution to an unsupervised learning problem is conditionally correct.

To explain. In machine learning, artificial intelligence, and statistics, too, there are, it is said, two regimes: supervised and unsupervised learning. Let’s state these in terms of classification, since that’s the simplest and always discrete, a form of problem which regular readers will know I love.

In so-called supervised learning, a set of data is observed which have specified classes (more than one, of course). These classes are known, A, B, C, and so on. Male and female are good example. Days of week another. And so on endlessly.

Not only are the classes known, but other measures thought probative of the classes are also measured—these additional measures are a requirement. If all—as in all—we had were labels, we are done. Given that warning, if we’re interested in M/F, then perhaps we have “calories consumed today” and “weight bench pressed”, or whatever.

A model then relates the probative measures to the classes, and in the end we form the prediction

     Pr( class = i | old data, new probative measures, M),

where M are the assumptions that led to our model. Notice we have “integrated out” any parameters of M, since they are of no interest to man nor beast. This form is so familiar that nothing more need be said about it. Except that I dislike the term “learning” applied to it. It doesn’t matter if M is a standard statistical model, or machine learning algorithm, or neural net, or whatever. It’s just a model—unless are in the rare situation where we know the cause of the class given the new probative measure, or have otherwise deduced M from known or assumed true premises.

What about unsupervised learning? All we have are some measures and no classes. For instance, somebody supplies us with a spreadsheet having just “calories consumed today” and “weight bench pressed”. Obviously—and most importantly—these are labels we assign, that we supply meaning to. This must not be forgotten. To the computer which reads these numbers, they are just numbers; and they’re not even numbers, just electrical impulses.

Now we might think to ourselves, “Say, I wonder if these two measures belong directly or indirectly, together or perhaps separately, to different classes? All I have are the measures and no indication of class. How many classes might there be? If there are two or three or whatever, what are their characteristics? It might be that there are k classes, and that it is true that these two measures vary by some set way inside each class. Indeed, if the measures do not vary in some set way between two or more classes, I will not be able to tell these classes apart.”

And so we come to an algorithm that identifies classes. There are many such algorithms. K-means clustering is one of the most popular, ably and succinctly described here (I’ll assume everybody reads this, or already knows about these algorithms).

We start with the assumption this algorithm will find the clusters, or classes, that are there. And we end with that assumption, too!

Put it another way. We started by assuming the algorithm will do its job. We run it and it does its job. Therefore, there algorithm has done its job. Assuming no mechanical or electrical failures in the computations, the algorithm has done what it promised to do. How could it not?

The clusters/classes the algorithm finds are correct—conditional on assuming the algorithm. It’s exactly the same situation as supervised learning. The probability above spits out correct probabilities conditional on the model. Unsupervised learning spits out correct classifications conditional on the model/algorithm.

So why, then, do people look at the output of unsupervised learning algorithms, and (sometimes) say, “The number of clusters/classes is too large. The algorithm is over-fitting,” or “I think these classes are right in number, but they don’t look right”? Why do they express any dissatisfaction, since the algorithm always does what it was designed to do?

Because, as should now be obvious, these folks are using higher-order criteria to judge the algorithms. Meaning the “learning” was supervised after all, but that the supervision wasn’t done inside the computer. Which proves the boundaries of the computing machine/algorithm are artificial, if you like. Or, since part of the algorithm is sans mathematics, not all aspects of the algorithm are quantified or quantifiable, which is also not surprising to long-time readers.

It turns out the algorithms are always set up in such a way that there was prior knowledge of the targets; i.e., supervision. (And given many of these algorithms are used in computer vision applications, the pun is apt.) The picture which heads this post (taken from the clustering/k-means link above) proves the point. Supervision is always there. And it’s always there because we’re always looking toward what we either know or suspect is true.

Technical Notes: All these clustering algorithms are based on two notions: the number of clusters/classes and the concept of variation within and between classes. Something has to guess how many, using some prior criterion, and something has to say what it means to vary and how that variation is measured, also pre-specified. Changing these two notions gives a different algorithm.

There is no such thing as random, and so pieces of algorithms that are said to do this or that “randomly” are always deterministic after all, but with an eye closed to the determinism.

Of course, since the “learning” (the high falutin’ term for estimating parameters in a model) is always supervised, and the problems to which these models are put are important (automatic classification, say), finding better algorithms is just the right thing to do. Obviously, the better we are at knowing the cause or measuring the determinants of classes, the better these algorithms will be. So our ultimate goal, just as in statistical modeling, is always the same: understanding cause.


  1. Now this is obviously true on a fundamental level. However, there is an actual difference between these two kinds of model fitting in the practical approach of the user. When you do supervised learning, you typically apply some generic model, and you try to make it predict new observations well without caring too much of a compact or intelligible mathematical description of the fitted model. In an unsupervised setting, you typically devise your model with a compact description, let’s say a handful of principles, in mind, and see how well can these principles be leveraged to predict new observations compared to some other principles. But I agree that it’s nice to know that all modelling is modelling, you just input different kinds of information at different stages of the process.

  2. Yes, if you are attempting to classify using specific class designations, you must supervise the learning.

    Truly unsupervised learning most probably would arrive at different classifications. A 10×10 Kohonen network (self organizing map) would find 100 classes. The classes are determined by nearest neighbor association. Whether this is useful is determined by the application. These may be applied as inputs to a supervised learning algorithm.

    See the R Kohonen package and the white papers by Wehrens for examples.

    Here’s one: https://www.jstatsoft.org/article/view/v021i05/v21i05.pdf

Leave a Comment

Your email address will not be published. Required fields are marked *