Out for a run the other day along a stretch on a busy road, I saw a "for sale" sign on a house. That got me to wondering about the downsides of living on a busy road, and further wondering: do people who live on busy roads stay in their houses for shorter periods of time than people on quiet streets? Is real estate turnover higher on busy roads?
That should be an easy question to answer in today's big data world. You would just get a computer to sort houses by the amount of traffic on the street on which they sit (Google maps surely has that data), and then correlate it to the average time between sales of each house.
But then it occurred to me: who would actually perform the analysis to answer that question?
After all, if the answer to my question was yes, the real estate industry probably wouldn't particularly want that fact known, so it's not in their interest to explore that question. And I'm not sure who else would care to look into it.
This line of thinking is just a reminder of an important point: the exploration of data is not a neutral or unguided process. There's a lot of talk about big data these days and its potential to help humanity, but the seeking of insight and meaning and advantage from large-scale analytics will be directed at answers to the questions certain parties have an interest in asking. Where an especially interesting data set is open, we can imagine that many people will explore it, the “million eyeballs” principle will hold, and that the data space will be isometrically scrutinized in every possible way. That may be true for some data sets, but others may only be investigated by those with a particular interest in looking for something—and perhaps only by those with the means of hiring data scientists to look for answers. And in many other cases, where data is proprietary, the parties that have access to or control over the data will be the only ones asking the questions. In some cases, no doubt, even where questions are asked, the answers will never see the light of day if those who paid for the research don’t like what they get.
This dynamic will have implications not only for privacy, but for how big data is used across the board.
Let me be clear that I know little about the real estate industry, and my assumptions may be completely wrong in the specific line of thinking above that sparked this post. But I think the larger point is valid.
Others have sought to puncture utopian views of how big data will provide new knowledge and insight. In , for example, Tim Harford cites problems including the difference between causation and correlation and the fact that correlations can shift, sample bias in the data, high levels of false positives, and something called the “multiple-comparisons problem,” which essentially says that if you explore a large enough data set looking for enough correlations, you will eventually find one that is a spurious statistical quirk. I think what we might call “differential exploration” should be on any such list.
In some ways the problem I’m thinking about resembles problems we already see in other areas of science. Some have complained, for example, that our for-profit pharmaceutical industry tends to research medical conditions not in proportion to their importance, but instead according to their profitability. That can mean they target their research budgets towards treatments for minor but lucrative cosmetic problems rather than devastating diseases. It’s likely that the mining of big data sets for insights, like medical research, will not be carried out evenly, but will proceed on some fronts much more vigorously than others, and not necessarily according to criteria that are best for everyone.