How to achieve large-scale IoT with relational databases

With these key optimizations, your favorite relational database can support the high connection volumes and high-speed ingestion rates of IoT workloads

How to achieve large-scale IoT with relational databases
geralt (CC0)

Conventional wisdom states that relational databases are not scalable or robust enough to handle the huge numbers of connections, the massive throughput, and all the cool tricks required to master IoT (Internet of Things) applications. But I know this isn’t true, as Oracle customers have been doing it for years. How, exactly, is this done? 

First, let’s make sure we agree on the meaning of IoT. The official definition is interconnection via the Internet of everyday objects, enabling them to send and receive data. IoT involves all sorts of devices, from watches, TVs, and phones to ovens, refrigerators, and gas meters made “smart” by being connected and sharing data. And at some point, the need to gather this data for mining.

From a database point of view, organizations need to capture the huge volume of data sent from IoT devices, ingest it, and then mine it as the business requires. Ingestion and rapid analysis of this data is critical for a business to maintain its competitive edge in today’s 24/7, 100-percent uptime business world. This can take many different forms, most of which will work just fine with an optimized relational database and hardware.

To SQL or NoSQL?

A common misconception (perpetuated by marketing campaigns) is that NoSQL databases are required for IoT workloads because the ingestion rate required exceeds the capabilities of traditional relational databases. This is not true, as evidenced by many of Oracle’s customers. We have found that an optimized relational database can ingest more in a single node of an Oracle RAC cluster than some entire NoSQL environments. Note that I said an optimized relational database. 

Overall, the characteristics of an IoT workload consist of a high volume of small, lightweight transitions, often including complex data types like geospatial data and unstructured data (video or audio recordings). So, for example, companies that collect fitness data from wristbands use geospatial data (to record the route walked) along with the number of steps, heart rate, etc. 

Despite the recent hype, high-ingest workloads are not a new concept. For example, one of our telecom customers processes millions of call detail records per second, adding up to 18 terabytes of data per day.

Stock exchanges running on relational databases record 180,000 message per second, adding up to 15 terabytes of data per day.

So how do they do this? The key requirements are scalability, flexibility, and the speed to provide real-time analytics. Using optimized hardware when sifting through unstructured data is also an important component for success.

IoT key 1: Database scalability

Scalability is the ability of a system to provide throughput in proportion to and limited only by available hardware resources. There are two different approaches to scaling out an Oracle database system: Real Application Clusters and sharding.

Real Application Clusters (RAC) enable any packaged or custom application to run, unchanged, across a server pool connected to shared storage. If a server in the pool fails, the database continues to run on the remaining servers. When more processing power is needed, you simply add another server to the pool—without incurring any downtime. 

Sharding allows a massive database to be partitioned into smaller databases, or shards, each of which can be stored on a different piece of hardware. Imagine a global retail company with a massive relational database containing all of the available merchandise that’s sharded; each shard of the database will contain a part of the catalog. 

When a query comes in asking for a specific widget, the database server knows which shard those widget types are stored on and retrieves the data directly from there. An added benefit in this global economy is that it is possible to create a shard for each country and apply the local regulations to that specific shard. For example, Germany requires that all German data be stored physically within the country’s borders. This is easy to do with a shard.

By now, most databases that operate at large scale provide sharding, including MongoDB, Microsoft SQL Server, and PostgreSQL. But it’s not enough to shard the data. What is equally important is the ability to run cross-shard queries efficiently, which significantly reduces the time require to get a complete view of the business.

One key requirement of IoT is the ability to quickly ingest huge volumes of data. So, wherever the data is staged, the bandwidth to that storage will control the speed of the load much more than the database itself. After all, data can never be loaded faster than it can be read. If the network can’t read any faster than 200 megabytes per second, then the data can’t be loaded any faster than 200 megabytes per second.

Other important components of scalability are the use of array inserts (as opposed to single row inserts), parallel execution, and the use of external tables. 

IoT key 2: Database flexibility

Flexibility is the ability to quickly and easily adapt to changes. IoT is still in its infancy and new use cases come with each device. Being able to easily adapt to changes in data formats is critical. When faced with this requirement, developers often turn to JSON. JSON greatly simplifies app development because it allows for the same schema-less representation to be used in both applications and the database.

Partitioning is also key to flexibility at scale. Partitioning makes large volumes of data easier to manage, improves the overall performance, and is completely transparent to the applications. Another benefit of partitioning is the ability to store historical data in a separate partition, enabling the data to be compressed and moved to a lower tier of storage. That way, the data is available for data mining, but doesn’t slow down real-time queries. 

IoT key 3: Real-time analytics

All of these tricks for speeding up data ingestion and data storage would be useless without the timely analysis of data with the goal to affect business outcomes. At the end of the day, the business must make use of the data, which means getting the data out

For a global retail organization, one goal might be aggregating the products sold across all shards to determine the top 10 most popular products globally at any point in time.

For an international telecom company, this same database architecture might help in a couple of different ways. Most of their queries come from their customers, so they can use both sharding and partitioning to eliminate a large portion of the data store from the query execution, enabling the speeds necessary to meet their customer’s expectations.

Internally, the business managers want to cut down churn and increase profitability, so they mine the data across shards and partitions, looking for patterns that allow them to determine which customers might be about to cancel their service or which additional products (e.g. international data plan) they could proactively offer a customer.

For example, they might pull the billing history of customers who canceled and run analytics on it. What was happening just before they left? Are there any common patterns? Did a significant percentage overspend their data and get three big bills in a row right before they switched to another provider? 

Once these patterns are identified, proactive steps can be taken to keep their business. For example, after a customer experiences two spikes in overspending data minutes, a representative could reach out to them to offer a different plan with more data and avoid a cancellation.

To learn more about other tricks to streamline databases and update hardware that is optimized for IoT workloads, check out this white paper.

Maria Colgan is a master product manager at Oracle. Maria’s core responsibility is creating material and lectures on the Oracle Database and the best practices for incorporating it into your environments. She is also responsible for getting the feedback from our customers and partners incorporated into future releases of the product. Prior to this role, Maria was the product manager for Oracle Database In-Memory and the Oracle Database query optimizer. Maria is the primary author of the SQLMaria blog and a contributing author to the Oracle Optimizer blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2018 IDG Communications, Inc.