Introduction to streaming data platforms

Streaming data platforms bring together not just low-latency analysis of information, but the important aspect of being able to integrate data between different sources

Google
Google/Connie Zhou

Data-driven strategies are becoming a greater part of an organization's DNA. Executive management is embracing how data can be used to create, sustain, and strengthen competitive advantage. Disruptive companies are building business models based on data that other organizations leave behind. Employees are growing into the role of data analysts as part of their day-to-day responsibilities, and companies are introducing new data sources to continue this trend.

Specifically, data-driven strategies can be seen in the way that organizations are taking advantage of modern technical architectures in mobility, cloud, and device sensors, and integrating that information into new ways of doing business. The use of location-based mobile apps, optimized supply chains for online retail applications, and the introduction of the internet of things, increased the focus on low-latency data collection, transformation, and analytics.

With the rise of importance of data-driven organizations and the focus on low-latency decision making, the speed of analytics increased almost as rapidly as the ability to collect information. This is where the world of streaming data platforms comes into play. These modern data management platforms bring together not just the low-latency analysis of information, but the important aspect of being able to integrate information from operation systems, mobile applications, databases, and the Internet of Things in real-time/near real-time.

The true key to streaming data platforms, and the applications they support, is the integration of technical and business data sources in real time. Without this level of streaming data acquisition, or data integration, the analytics that data-driven strategies and the business models built on those strategies cannot match the promise of the business stakeholders who are looking to create new business value and increase competitive advantage.

As the business models and strategies of data-driven organizations drive real-time applications, many technologists may ask the following question:

We can already perform analytics in near real-time, so why do we need a streaming data platform?

The simple answer is that the low-latency many streaming applications require and that streaming data platforms provide is associated with the acquisition and integration of data sources as opposed to simply the processing of the data into analytics.

While some organizations can perform analytics in real time, their ability to collect and integrate that information at the same speed detracts from the low-latency concept. For example, imagine you are an organization with a predictive maintenance application for your connected fleet vehicles. As part of your predictive maintenance app, you collect both vehicle usage information and geolocation information from the vehicles.

While it can be a relatively easy task to perform near real-time analytics, once all the information from your connected fleet is landed in your data warehouse or data lake of choice, the real issue is getting the information to the point where you can do the analysis. Having a delay in data acquisition associated with a predictive maintenance application means that you now know that your fleet vehicle is broken down by the side of the road (in a best-case scenario) or along a single vehicle access "lane" such as a rail line, airport runway, or mining access road (in a worst-case scenario).

You need to have the information from and about your connected fleet as the data is created on the sensor-equipped vehicle in time to make corrective actions before there is disabled vehicle or that disabled vehicle takes a greater toll on your business operations than simply a single truck, train, plane, or ship.

Major sources of information to place this streaming data come from various sources, including:

  • Event, product, and process stage reference data
  • Operational support applications (i.e. ERP, supply chain, customer care)
  • Curated business information in data warehouses and data marts

Much of this contextual information resides in traditional data management platforms, which operate on a different operational tempo and access speed faster than the real-time nature of the applications and strategies of data-driven organizations. This information can be accessed and matched with streaming event information using a technique referred to as CDC (Change Data Capture).

CDC techniques have been used for years. CDC is the concept of using the transaction and event logs associated with relational database management platforms, or the operational support applications themselves, and using that information to either replicate or access information without causing the underlying data to be disturbed or the core operations of the operational support application to be interrupted. Streaming data platforms use new technical approaches to CDC to improve the speed and availability of the contextual information from those operational platforms in a streaming data platform environment.

Copyright © 2016 IDG Communications, Inc.