Does the Data Deluge Make the Scientific Method Obsolete?

The End of Theory: The Data Deluge Makes the Scientific Method Obsolete by Chris Anderson

“All models are wrong, but some are useful.”

So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don’t have to settle for wrong models. Indeed, they don’t have to settle for models at all.

Speaking at the O’Reilly Emerging Technology Conference this past March, Peter Norvig, Google’s research director, offered an update to George Box’s maxim: “All models are wrong, and increasingly you can succeed without them.”

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

see update, below. Norvig was misquoted, he agrees with Box’s maxim

I must say I am not at all convinced that a new method without theory ready to supplant the existing scientific method. Now I can’t find peter Norvig’s exact words online (come on Google – organize all the world’s information for me please). If he said that using massive stores of data to make discoveries in new ways radically changing how we can learn and create useful systems, that I believe. I do enjoy the idea of trying radical new ways of viewing what is possible.

Practice Makes Perfect: How Billions of Examples Lead to Better Models (summary of his talk on the conference web site):

In this talk we will see that a computer might not learn in the same way that a person does, but it can use massive amounts of data to perform selected tasks very well. We will see that a computer can correct spelling mistakes, translate from Arabic to English, and recognize celebrity faces about as well as an average human—and can do it all by learning from examples rather than by relying on programming.

Related: Will the Data Deluge Makes the Scientific Method Obsolete?Pragmatism and Management KnowledgeData Based Decision Making at GoogleSeeing Patterns Where None ExistsManage what you can’t measureData Based BlatheringUnderstanding DataWebcast on Google Innovation

The Google Way of Science by Kevin Kelly

My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science.

Google lets us look into its search and translation technology

Yesterday, Google’s director of research Peter Norvig let visitors at the Emerging Technology conference in San Diego look into the technology that his firm uses in search and translation functions. As Norvig put it, a lot of the time Google does not rely on complex models and theories, but simply on large amounts of data.

One example is Google’s translation function, which allows Chinese texts to be translated into English, for example. In Chinese, multiple symbols that mean something on their own can also be combined to create a single word. Google segments Chinese texts by comparing a large amount of Chinese and English versions of the same content to increase the probability that the Chinese characters will match the English words., first sentence, September 2008:

Google’s mission is to organize the world’s information and make it universally accessible and useful.

Update: Actually, Peter Norvig has posted a correction, he did not say what was quoted:

I recently had a run-in with the fact-checkers for Wired magazine. They wrote and asked me:

Is it true that at your ETech presentation in March, you said, in a direct homage to George Box, “All models are wrong, and you don’t need them anyway”? Is that accurate?

Great, I thought–Wired is a publication with integrity and wants to get the facts right. I wrote back:

The quote I used was “essentially all models are wrong, but some are useful”.

The point I was making — and I don’t remember the exact words — was that if the model is going to be wrong anyway, why not see if you can get the computer to quickly learn a model from the data, rather than have a human laboriously derive a model from a lot of thought.

I figured they would either use the quote I gave them, paraphrase it, or drop it completely if it didn’t fit with the point of the story. But when Chris Anderson’s story The End of Theory: The Data Deluge Makes the Scientific Method Obsolete came out in June 2008, there was a fourth possibility that I hadn’t even counted upon: they attributed to me a made-up quote that actually contradicts the reply I gave them:

Peter Norvig, Google’s research director, offered an update to George Box’s maxim: “All models are wrong, and increasingly you can succeed without them.”

To set the record straight: That’s a silly statement, I didn’t say it, and I disagree with it.

The ironic thing is that even the article’s author, Chris Anderson, doesn’t believe the idea. I saw him later that summer at Google and asked him about the article, and he said “I was going for a reaction.” That is, he was being provocative, presenting a caricature of an idea, even though he knew the idea was not really true. That’s a mode I expect from other publications, but it’s not what I want from Wired, and I don’t expect Wired to make up facts to support their caricature.

7 thoughts on “Does the Data Deluge Make the Scientific Method Obsolete?

  1. I have been asking myself this same question for several years. I am of the opinion that the data deluge will not supplant the scientific method. I have three reasons for this.

    First, increased computing power does not mean that you can accurately model all events; you can just look at a larger subset of the possible data. For instance, we still could not use Google’s computing power to model chemical reactions from quantum physics. What we can do is use the greater processing and data retrieval capabilities to look for low-probability discrepancies between data and theory.

    Second, the scientific method is more than just a data-analysis method. It is a way of carefully answering questions with the right data. Simply having more data does not make for better answers. Indeed, more data, while having the potential to provide better estimates, also offers the potential to lead researchers off into the weeds.

    Lastly, theory is a means to predict future events. More generally, science is tool to improve understanding. Simply being able to model past events given some data set does not provide greater understanding. This is, of course, part of the contention: if our data set is big enough, what need do we have of understanding? My experience is that theory leads to new questions, which in turn lead to a deeper and more accurate understanding. More data can provide better models, but not deeper understanding, and generally does not lead to new questions. It’s the unexpected–the differences between expectations driven by theory and models derived from data–that lead to advances in knowledge.

  2. If data alone were sufficient, then our financial markets would not be crashing right now. If you don’t have an explicit model, then the data you decide to collect is your implicit model. There are no guarantees of sufficiency, nor of predictability in the collected data.

  3. Pingback: Curious Cat Science Blog » Google Flu Leading Indicator

  4. Pingback: Curious Cat Science Blog » Large Quantities of Information Change Everything

  5. Pingback: Curious Cat Management Blog » Friday Fun: Correlation

  6. Pingback: Curious Cat Management Improvement Blog » Bogus Theories, Bad for Business

  7. Pingback: Richard Feynman Explains the PDSA Cycle » Curious Cat Management Improvement Blog

Leave a Reply

Your email address will not be published. Required fields are marked *