Chris Anderson, over at Wired magazine, has written an article called The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.
Anderson, whose thesis is that we no longer need to think because computers filled with petabytes of data will do that for us, doesn’t appear to be arguing serious—he’s merely jerking people’s chains to see if he can get a rise out of them. It worked in my case.
Most of the paper was written, I am supposing, with the assistance of Google’s PR department. For example:
Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required.
He also quotes Peter Norvig, Google’s research director, who said, “All models are wrong, and increasingly you can succeed without them.”
Lastly,
The scientific method is built around testable hypotheses….The models are then tested, and experiments confirm or falsify theoretical models of how the world works…But faced with massive data, this approach to science ? hypothesize, model, test ? is becoming obsolete.
Part of what is wrong with this argument is a simple misconception of what the word “model” means. Google’s use of page links as indicators of popularity is a model. Somebody thought of it, tested it, found it made reasonable predictions (as judged by us visitors who repeatedly return to Google because we find its link suggestions useful), and thus became ensconced as the backbone of its rating model. It did not spring into existence simply by collecting a massive amount of data. A human still had to interact with that data and make sense of it.
Norvig’s statement, which is false, is typical of the sort of hyperbole commonly found among computer scientists. Whatever they are currently working on is just what is needed to save the world. For example, probability theory was relabeled “fuzzy logic” when computer scientists discovered that some things are more certain than others, and nonlinear regression were re-cast as mysterious “neural networks,” which aren’t merely “fit” with data, as happens in statistical models, instead they learn (cue the spooky music).
I will admit, though, that their marketing department is the best among the sciences. “Fuzzy logic” is absolutely a cool sounding name which beats the hell out of anything other fields have come up with. But maybe they do too well because computer scientists often fall into the trap of believing their own press. They seem to believe, along with most civilians, that because a prediction is made by a computer it is somehow better than if some guy made it. They are always forgetting that some guy had to first tell the computer what to say.
Telling the computer what to say, my dear readers, is called—drum roll—modeling. In other words, you cannot mix together data to find unknown relationships without creating some sort of scheme or algorithm, which are just fancy names for models.
Very well—there will always be models and some will be useful. But blind reliance on “sophisticated and powerful” algorithms is certain to lead to trouble. This is because these models are based upon classical statistical methods, like correlation (not always linear), where it is easy to show that it becomes certain to find spurious relationships in data as the size of that data grows. It is also true that the number of these false-signals grow at a fast clip. In other words, the more data you have, the easier it becomes to fool yourself.
Modern statistical methods, no matter how clever the algorithm, will not being salvation either. The simple fact is that increasing the size of the data increases the chance of making a mistake. No matter what, then, a human will always have to judge the result, not only in and of itself, but how it fits in with what is known in other areas.
Incidentally, Anderson begins his article with the hackneyed, and false, paraphrase from George Box “All models are wrong, but some are useful.” It is easy to see that this statement is false. If I give you only this evidence: I will throw a die which has six sides, and just one side labeled ‘6′, the probability I see a ‘6′ is 1/6. That probability is a model of the outcome. Further, it is the correct model.