In Data, Precise Is Not Always Accurate

I believe that Big Data has yet to fulfill its promise of providing a clear and consistent business advantage and that to achieve this, the industry must evolve from Big Data to Smart Data -- data that drives business value. To ensure that data drives business value, the data being integrated must be accurate.

How do you measure accuracy?

Accuracy, when measured, is often confused with precision. Some, for example, might argue that a consensus approach that polls data providers is the best way to determine the accuracy of data. This approach essentially claims that attributes with the highest consensus across data providers is the most accurate. We categorically reject this view, as it equates agreement -- or precision -- with accuracy. In fact, checking multiple data sources against one another may create a kind of confirmation bias (the tendency for people to believe data that supports what they already believe to be true).



This can happen when multiple vendors have the same technique to collect data. Take income data, for example. If several data vendors are supplying the mean income for a ZIP code to individual users, each will agree for a given household, but each will be incorrect for most of the households. And interestingly, if five data vendors applied the ZIP code mean to a specific user and one supplied the actual income to the same user, the actual income would be discounted as inaccurate.

Serial correlation in data streams is often misinterpreted as accuracy. That is, agreement among everyone doesn't necessarily make it right. It may just be that many data sources are wrong for the same reason -- and that is why the consensus approach breaks down.

As you can see, precision is not accuracy.

This statement was proven a few weeks ago in Canadian provincial elections for the premiership of British Columbia, when all of the polls incorrectly called the final results.

For weeks, pollsters predicted that Andrew Dix and the British Columbia New Democratic Party had up to a 20-point lead and an 8-point lead on election eve, yet Christy Clark and the British Columbia Liberals won on Election Day. What happened?

According to the fine print of from one of the leading pollster’s final poll: “These data were statistically weighted to ensure the sample’s regional and age/sex composition reflects that of the actual B.C. (British Columbia) population according to 2011 Census data.

Although the weightings ensured that the sample precisely reflected the demographics breakdown of the population of British Columbia, they did not reflect the actual voting population.

The pollsters' predictions were very precise to the extent that they all predicted a New Democratic Party victory -- but much to the NDP party's chagrin, inaccurate.

So to summarize, accuracy matters! Everyone in our business needs to understand what it is and what it is not. And those that fail to give it the attention it deserves do so at their own risk!

Next story loading loading..