Data lakes 101: Come on in, the water's fine

How to plan for and build a central hub for data analytics with the ever-evolving Hadoop ecosystem

There is no denying that data is everywhere and there is a lot of it. Smartphones and other mobile devices are enabling companies to communicate directly with their customers and provide highly customized insights and services. Not only are the amounts of data increasing dramatically, but new types of data are being created at an accelerating rate. Much of this data is different both in structure and in substance, and it cannot easily be arranged into tables or transformed in a reasonable amount of time.

Meanwhile, data-driven business analytics often involves processing so intensive that even a few years ago it would have seemed inconceivable. While we have existing systems specifically designed for this type of processing, they are typically extremely closed proprietary architectures that are priced well out of reach for small and medium-sized organizations.

These challenges have given rise to a new open, distributed computing architecture typified by implementations such as Hadoop and Apache Spark. Recently these techniques and technologies are being combined to form a type of hub-and-spoke integration called a data lake. In this article, we will outline what a data lake is, as well as the tools and considerations involved in implementing one.

For a closer look at business use cases where companies have implemented Hadoop ecosystem tools to leverage new analytics and improve analytics workflows, please see our data lake whitepaper.

In a world that is becoming more data-driven and globally competitive, the success of a company will increasingly depend on investing in technologies that can create a 360-degree view of its business, designing a scalable infrastructure, and using analytics to drive better decision-making throughout the organization.

What is a data lake?

A data lake is an enterprise-wide system for storing and analyzing disparate sources of data in their native formats. A data lake might combine sensor data, social media data, click-streams, location data, log files, and much more with traditional data from existing RDBMSes. The goal is to break the information silos in an enterprise by bringing all the data into a single place for analysis without the restrictions of schema, security, or authorization. Data lakes are designed to store vast amounts of data, even petabytes, in local or cloud-based clusters consisting of commodity hardware.

In a more technical sense, a data lake is a set of tools for ingesting, transforming, storing, securing, recovering, accessing, and analyzing all the relevant data of a company. Which set of tools depends in some part on the specific needs and requirements of your business.

Creating a plan

The choice of technologies to be installed will vary depending on the use cases and the streams that need to be processed. Additional considerations could be whether to host a solution internally or to rely on companies providing platforms as a service, like Amazon Web Services.

Regardless of the physical location of the cluster, Hadoop provides a broad range of tools with the flexibility of including some or all at any stage of the data lake’s development. Hadoop also allows the system to be horizontally scaled by adding or removing nodes without considerable time or effort. As a result, organizations can start small and grow the cluster as usage increases.

Use cases can be broadly categorized into different streams of processing like batch, streaming, and queries with fast read/write times. An appropriate tool set can be configured depending on the kind of processing that needs to take place.

Use case discovery

Defining use cases early helps immensely in implementing a data lake. Analyzing data without enough information about the data will lead to unwanted results and wasted engineering cycles. Lack of information on how the data will be utilized or who the target audience is will also impact the performance of the analytical process implemented on top of the data lake. This will become abundantly evident as the data lake usage increases over time.

How do you identify a big data challenge? A good measure is to define or categorize data in terms of the three V's: volume, variety, velocity. Volume refers to the amount of data, usually in the scale of petabytes. Variety refers to the structure and sources of the data. Velocity indicates the speed at which the data is generated and needs to be processed.

Identifying the problem and implementing a big data architecture will not in itself produce the results automatically. A proper system must be implemented and optimized to provide repeatable results.

The goal of identifying a use case should not be to create an exhaustive list, but to define a small number of business-critical use cases so that the data lake can be designed and built to meet those needs. This will assist with cluster design and sizing, planning for future growth, and decisions around ingestion, as well as a comparative analysis with existing systems.

Building the data lake

Given the advantages of a central data lake, there are some measures and utility challenges within the building process that every enterprise should address. The most important include data management and security for sensitive data. However, there are Hadoop ecosystem tools available on top of the core frameworks that make data lakes an enterprise-ready solution. Taking into account these common challenges, we can reclassify a data lake as an enterprise data warehouse that does not require you to predefine metadata or define a specific schema before ingesting data into the system.

The tools

Hadoop is not any one system or tool; it is an ever-evolving collection of tools. At the core of Hadoop is HDFS: a reliable, fault-tolerant, distributed file storage system with MapReduce as its processing engine. Future releases of Hadoop are progressively moving away from MapReduce, which is batch-oriented processing, to a more responsive processing engine like Apache Spark.

The following is a list of some Hadoop ecosystem tools currently in use. The list is not exhaustive but includes those most commonly found among production deployments:

In addition, production deployments of Hadoop usually include tools for data management, data security, and cluster management. Tools for these tasks differentiate Hadoop offerings available in the market. For example, Cloudera offers a proprietary solution that includes Cloudera Manager, Navigator, and Sentry for cluster administration and security. Hortonworks offers open source tools like Ambari, Ranger, Falcon, and Atlas.

Building a data pipeline

Constructing a data lake requires transferring data into Hadoop clusters, then running the necessary processes to parse, cleanse, and prepare the data for analytics. Depending on the data producers, data can be either “pushed” or “pulled” into the cluster. To push data into the cluster, you could use command-line tools from a client machine, REST APIs, or a Hadoop ingestion tool such as Flume. Data can also be pulled with tools like Sqoop. The choice of tools and strategies for data ingestion are determined by the nature of the data ingested into the cluster.

Ensuring data quality

The old adage “garbage in, garbage out” still applies, perhaps even more so given the massive amounts of data typical in a Hadoop cluster. Before any ingested data is analyzed, a minimum degree of data sanitization should be performed to ensure its authenticity.

Hadoop is not a source of record, and most of the data is transferred to the data lake from disparate sources. As data is transferred into the data lake, there is a chance of the data becoming corrupted or arriving in an unexpected form. Therefore, any data lake implementation should include workflows that check the quality of the ingested data as early as possible.

Consider a simple scenario in which the weather needs to be analyzed for Durham, N.C. Unless the input data is validated to ensure that it is from Durham, the analytical process is bound to produce incorrect results. Data analysis should focus on the quality of the model, without having to focus on the quality of the data on which the model will run.

CDC processing

Change Data Capture (CDC) data can be managed in Hadoop, as long as it is handled carefully. Currently, production implementations of Hadoop are not systems of record. Therefore, data does not reflect the current status in the source systems. The data must be periodically refreshed in Hadoop and updated to reflect the latest state.

A major limitation here is that a record in Hadoop can be updated only in its entirety; subrecord or columnar updates are not possible with existing implementations of Hadoop and Hive. Another limitation is that data in HDFS exists in the form of files. This means that updating records will require the entire file to be rewritten even if there is only a single record that needs to be updated. Therefore, when implementing CDC on Hadoop/Hive data, the data should be organized into a small number of large partitions to achieve efficient processing. Performing frequent CDC updates on a large number of small partitions is inefficient in Hive, as Hadoop does not perform well with small data sets.

Security

For many organizations, data security is one of the most critical components to consider when designing a data lake. The subject of security can be a lengthy one, but when it comes down to it each company and use case requires a custom solution. Many open source tools are available to address data governance and data security.

For example, Apache Atlas is a data governance initiative by Hortonworks. Apache Falcon (which automates data pipelines and data replication for recovery and other emergencies), Apache Knox Gateway (which provides edge protection), and Apache Ranger (which handles authentication and authorization for Hadoop) are all provided in the Hortonworks Data Platform. Cloudera Navigator is Cloudera’s data governance solution for Hadoop. It’s provided in the Cloudera Data Hub.

Management

Data lakes can easily contain hundreds of nodes, making the tools that monitor, control, and resize the data lake a critical resource. Currently, there are two main options for this: Cloudera’s Cloudera Manager and Hortonworks’ open source tool Ambari. With these tools, you can easily decrease or increase the size of your data lake while finding and addressing issues such as bad hard drives, node failures, and stopped services. They also make it easy to add new services, ensure compatible tool versions, and upgrade the data lake.

The massive amounts of data now available to organizations presents both challenges and opportunities. The data lake is the first step in addressing those challenges. While Hadoop is not exactly a “one size fits all” solution, it does provide the tools and the flexibility to address issues across a multitude of industries.

Naturally, the data lake in your organization should evolve with the technologies that enable the most advanced data science available. Today, a data lake is created and defined by the tools available in the Hadoop ecosystem.

Elizabeth Edmiston is vice president of engineering at Mammoth Data (formerly Open Software Integrators), a big data consulting firm based in Durham, N.C. She received her Ph.D. in computer science from Duke University.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2016 IDG Communications, Inc.