Formerly known as Wikibon

Big Data Adds Complexity, Nuance to the Data Quality Equation

July 16, 2013 | By Jeff Kelly |

Big Data

There’s an old saying in the data management world: garbage in, garbage out, or GIGO. It means that the results of any data analysis project are only as good as the quality of the data being analyzed. Data quality is of critical importance when data sets are relatively small and structured. If you only have a small sample of data on which to perform your analysis, it better be good data. Otherwise, the resulting insights aren’t insights at all.

But does GIGO apply in Big Data scenarios? From a purely practical standpoint, is it realistic to cleanse and scrub data sets that reach the hundreds of terabytes to petabytes level to the same degree as smaller data sets? And since most Big Data lacks traditional structure, just what does data quality look like?

Consider that many data quality issues in smaller, structured data sets are man-made. A call center representative inputs the wrong digits. A customer selects the wrong option from a drop-down menu. These errors can be fixed fairly easily, if they’re caught. But most data in Big Data scenarios is machine-generated, such as log files, GPS data or click-through data. If a piece of industrial equipment starts stamping streaming log-files with incorrect dates and times, for example, the problem will quickly multiply. And retroactively applying the correct date and time to each log-file – little more than just strings of digits and dashes – will be a daunting, if not futile, task.

Further, Big Data evangelists maintain that the sheer volume of data in Big Data scenarios mitigate the effects of occasional poor data quality. If you’re exploring petabytes of data to identify historical trends, a few data input errors will barely register as a blip on a dashboard or report. Is it even worth the time and effort, then, to apply data quality measures in such a scenario? Probably not.

But that doesn’t mean data quality isn’t important to Big Data. This is particularly true in real-time transactional scenarios. Big Data applications that recommend medicines and doses for critically ill patients, for one, better be relying on good data. Same goes for Big Data operational applications that support commercial aviation, the power grid and other Industrial Internet use cases.

There are no easy answers to these questions, but clearly it’s important for practitioners to understand the source and structure of the data in question, as well as the data quality requirements for given Big Data use cases, in order to determine the level and type of data quality tools/measures to apply. There’s also the human element to consider. Someone needs to “own” data quality for Big Data projects, otherwise it can be easily overlooked.

We will be exploring these and other topics, including the role of the Chief Data Officer in the enterprise, all day tomorrow at the 2013 MIT CDOIQ Symposium in Cambridge, Mass. Tune in at Wikibon.org/blog orSiliconANGLE.com starting at 10:30 am EST, and join the conversation on Twitter, hashtags #theCUBE and #MITIQ.

Jeff Kelly

You may also be interested in

CISA’s Secure by Design Pledge Continues to Build Momentum: Is it Basic? Maybe, but it’s a Start

Shelly Kramer May 15, 2024

230 | Breaking Analysis | RSAC 2024 goes beyond AI-powered security to securing AI itself

David Vellante May 11, 2024

Cutting Edge Research, Analysis, Insights + Media

Studio Locations

Silicon Valley
989 Commercial St.
Palo Alto, CA 94303

Boston Metro
5 Mount Royal Ave.
Marlborough, MA 01752

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Stay ahead of the curve with the exclusive insights by our team straight to your inbox each week.

By submitting this form, you are consenting to receive marketing emails from: theCUBEResearch, info@siliconangle.com. You can revoke your consent to receive emails at any time by using the SafeUnsubscribe® link, found at the bottom of every email. Emails are serviced by Constant Contact