Getting Ready for Big Data 3.0 and the IoT

December 02, 2015 | By George Gilbert |

Analysis, Big Data

Premise

Analytic data pipeline was stable for decades, providing business intelligence in the form of historical performance reporting from enterprise applications
We are in the midst of a transition toward a near real-time “convergence” of analytics within Systems of Intelligence
The emerging analytic pipeline, which we call Big Data 3.0, will collect data from smart, connected products (IoT) and optimize their performance as part of a larger system
There is a major intermediate step in this transition that can provide “training wheels” for IT leaders and practitioners: managing the operation of their modern apps through the analysis of their log data

The emerging analytic data pipeline

Sometimes the best way to look forward is to look back and see if there are hints about the future by connecting the dots between past and present. If we look at the evolution of the analytic data pipeline, three key directions are emerging. Harnessing them fully in the future requires aligning investment in human and financial capital now. They are:

Fully leverage the decline in the cost of capturing and storing data from $700M/TB 30 years ago to roughly $50/TB today

Covered in Part 2:…

Deliver near real-time responsiveness between capturing data and driving an action
Build towards “converged” analytics, which enables any type of analytics on any type of data

Fully leveraging the decline in the cost of storing (and processing) data from $700M/TB 30 years ago to roughly $50/TB today

If data volumes for any workload can leverage a growing number of sources, the default database choice should work on commodity clusters

The widely-cited study claiming that the world’s digital data is growing 40% per year is wrong. The number doesn’t convey the real growth and the urgency of adopting the infrastructure and skills to build a new generation of analytic data pipelines.

The supply of data, as opposed to the cost of storing it, is essentially limitless. Traditional applications captured all their data through human data entry. That cost is constant at roughly $1bn/TB. But almost all information from all sources is now generated in digital form and with a zero marginal cost.

Harnessing as much of that new data as possible starts with capturing the log data from applications. Mainstream database technology has traditionally bottlenecked on expensive shared storage that required SAN or NAS appliances. Learning how to capture and process data on commodity clusters is critical and the new event log data being collected makes that much easier than with traditional business application data.

Traditional business transactions often updated information and supporting that was much easier when all the database processing nodes could share the same (scale-up and expensive) storage. But log events are all unique because each component or sensor only emits one event at a time, each with a unique timestamp. That means all the events keep getting appended or inserted into the database and that makes it much easier to use commodity clusters. There’s no need for the different database nodes to try to update the same data at the same time.

The variety of “things” emitting events requires new ways of storing information – almost like ingesting a Data Lake and then reorganizing everything into a Teradata data warehouse

So many applications and the services within them are emitting events that evolve with them that new database storage techniques are necessary. JSON has emerged as the preferred way of representing this data. It has the flexibility to handle the variety of machine-generated information. It also is easy for developers to read when they’re working with it.

Traditional business transactions come from the same forms so common transactions can all be stored together in common tables. But JSON “documents” have no such guarantees that they will all be alike. So for a database to be accessible like traditional SQL databases, it has to be much cleverer in organizing the data. Under the covers it has to take on more of the admin tasks of tuning the physical layout of the data to deliver on the performance expectations of end-users.

(to be covered in part 2 of 2 posts on this topic)…

Deliver near real-time responsiveness between capturing data and driving an action
Build towards “converged” analytics, which enables any type of analytics on any type of data

Action Items

To get ready for the future of smart, connected products, practitioners must deal with the current equivalent of IoT: the portfolio of modern applications.
Properly collecting and analyzing their log data in order manage them requires putting in place many of the skills, infrastructure, and processes necessary to support the IoT of tomorrow.
Running Hadoop on a public cloud is likely the quickest way to acquire the skills to manage elastic big data applications on shared infrastructure.

George Gilbert

George Gilbert, lead data & analytics analyst for theCUBE Research. Former Gartner analyst, former lead enterprise software analyst for Credit Suisse First Boston, one of the top investment banks serving the technology sector. Big Data analyst for Gigaom Research. Co-founded Techalphapartners, a consultancy that advised vendors and institutional investors on market development and product strategy. George has led conference panels with prominent thought leaders in cloud infrastructure and big data. He has been profiled on the front page of the Wall Street Journal and published as a guest author in a major overview of the evolution of cloud computing in The Economist. Prior to being an analyst, George was a product manager on Notes at Lotus Development. George received his BA in economics from Harvard University.

You may also be interested in

228 | Breaking Analysis | Security budgets are growing but so is vendor sprawl

David Vellante April 27, 2024

Extreme Connects with Forward-Looking Customers

Bob Laliberte April 26, 2024

Getting Ready for Big Data 3.0 and the IoT

Premise

The emerging analytic data pipeline

Fully leveraging the decline in the cost of storing (and processing) data from $700M/TB 30 years ago to roughly $50/TB today

If data volumes for any workload can leverage a growing number of sources, the default database choice should work on commodity clusters

The variety of “things” emitting events requires new ways of storing information – almost like ingesting a Data Lake and then reorganizing everything into a Teradata data warehouse

Action Items

George Gilbert

You may also be interested in

228 | Breaking Analysis | Security budgets are growing but so is vendor sprawl

Extreme Connects with Forward-Looking Customers

Studio Locations

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Getting Ready for Big Data 3.0 and the IoT

Premise

The emerging analytic data pipeline

Fully leveraging the decline in the cost of storing (and processing) data from $700M/TB 30 years ago to roughly $50/TB today

If data volumes for any workload can leverage a growing number of sources, the default database choice should work on commodity clusters

The variety of “things” emitting events requires new ways of storing information – almost like ingesting a Data Lake and then reorganizing everything into a Teradata data warehouse

Action Items

George Gilbert

You may also be interested in

228 | Breaking Analysis | Security budgets are growing but so is vendor sprawl

Extreme Connects with Forward-Looking Customers

Studio Locations

Research Areas

Podcasts

Solutions

Engage

Stay Connected

theCUBE Research weekly

Book A Briefing