Home Data Management How different SQL-on-Hadoop engines satisfy BI workloads

by Thor Olavsrud

How different SQL-on-Hadoop engines satisfy BI workloads

Analysis

Feb 24, 2016

AnalyticsBusiness IntelligenceHadoop

A new benchmark of SQL-on-Hadoop engines Impala, Spark and Hive finds they each have their own strengths and weaknesses when it comes to Business Intelligence (BI) workloads.

Credit: Thinkstock

According to a new benchmark, the three leading SQL-on-Hadoop engines — Apache Impala 2.3, Apache Spark 1.6 and Apache Hive 1.2 — all have unique strengths and weaknesses that make them well-suited to some Business Intelligence (BI) use cases and less suited to others.

“The conclusions really are that one engine does not meet all requirements,” says Dave Mariani, CEO and founder of AtScale, a startup specializing in enabling BI on Hadoop. “What we have done in our deployments, for our customers, is plug in multiple engines.”

For the Business Intelligence on Hadoop benchmark, AtScale set out to help technology evaluators select the best SQL-on-Hadoop technology for their BI use cases. AtScale’s testing team used the Star Schema Benchmark (SSB) data set, based on widely used TPCH data, modified to more accurately represent a typical BI-oriented data layout. The data set allowed the test team to test queries across large tables: The lineorder table contains close to 6 billion rows and the large customer table contains over a billion rows.

SQL-on-Hadoop engines have different ‘sweet spot’ workloads

Mariani explains that AtScale looked at three key requirements to evaluate the SQL-on-Hadoop engines and their fitness to satisfy BI workloads:

Performs on big data. SQL-on-Hadoop engines must be able to consistently analyze billions or trillions of rows of data without generating errors and with response times on the orders of 10s or 100s of seconds.
Fast on small data. The engine needs to deliver interactive performance on known query patterns and, as such, it is important that the SQL-on-Hadoop engine return results in no greater than a few seconds on small data sets (on the order of thousands or millions of rows).
Stable for many users. Enterprise BI user bases consist of hundreds or even thousands of data workers. The underlying SQL-on-Hadoop engine must perform reliably under highly concurrent analysis workloads.

Mariani, who led the effort to build what may have been the world’s largest OLAP cube for BI at Yahoo!, says he believes these three criteria are representative of the primary requirements the average enterprise doing BI on Hadoop will have to meet. The criteria were drawn from the test team’s experience working with a large number of companies in financial services, healthcare, retail, telecommunications and other industries.

“We used real-world enterprise experience to produce a document that every technical evaluator can use as part of their evaluation process,” adds Josh Klahr, vice president of Product Management at AtScale.

The test team found that all three engines passed the tests and are stable enough to support BI workloads, but one engine does not fit all needs. Each has its own “sweet spot,” and enterprises are likely to find that blended usage of all engines might fit their goals best.

[ Related: Tableau partners for BI on Hadoop ]

While Hive is generally considered the default for SQL-on-Hadoop, it was far and away the slowest of the engines in the benchmark, making it poorly suited to interactive queries.

“If you want to use Hive Tez as your interactive query engine exclusively, the best you’re going to do is 2.4 seconds,” Mariani says.

But while it may be slow, Hive is also the most stable of the three engines, with the best consistency across multiple query types.

“Hive Tez is the tortoise,” Mariani adds. “It will always finish the race, but not in a spectacular, speedy fashion. It’s the most reliable.”

Impala and Spark, on the other hand, were at their best when it came to smaller data sets. Impala topped Spark across a gamut of workloads, but Mariani notes that Spark 1.6 was a vast performance improvement over Spark 1.5 and he expects that trend to continue as Spark has drawn a large open source community focused on its development. Cloudera recently proposed donating Impala to the Apache Software Foundation, which could also lend additional momentum to its development.

For now, Impala is the king for use cases that require large numbers of users.

“Impala kicks butt when it comes to concurrency,” Mariani says. “If you’re going to have a whole bunch of users running small, fast queries, Impala is a much better choice than Spark would be.”

“If speed is not a priority, but stability and reliability is, I would choose to Use Hive Tez as my data pipeline engine,” he adds. “For those big batch workloads I would choose Hive Tez. If I wanted my BI users to get access to my warehouse, I would choose to use Spark or Impala.”

Mariani notes that while the team didn’t benchmark other engines like Apache Drill or Apache Presto, they will next time.

“You never know between release and release who’s going to be the better horse to bet on,” he says.

Show me more

Camunda simplifies process automation with new AI-powered natural language features

By John Dunn

May 16, 20245 mins

Generative AIBusiness Process ManagementProcess Improvement

What’s holding CTOs back?

By Esther Shein

May 16, 20248 mins

CTOIT Leadership

Baldor’s first-ever CIO sets the transformation agenda

By Paula Rooney

May 16, 20246 mins

Transportation and Logistics IndustryDigital TransformationIT Leadership

CIO Leadership Live with Satya Jayadev, Vice President & CIO, Skyworks Solutions

May 17, 202417 mins

CIO Leadership Live

Principal Financial CIO Kathy Kay on balancing traditional AI and genAI

May 16, 202456 mins

CIO Leadership Live

CIO Leadership Live Middle East with Ramadan Mohamad, Digital infrastructure specialist at Public Transport Corp.

May 14, 202412 mins

CIO Leadership Live

CIO Leadership Live with Satya Jayadev, Vice President & CIO, Skyworks Solutions

May 17, 202417 mins

CIO Leadership Live

Principal Financial CIO Kathy Kay on balancing traditional AI and genAI

May 16, 202456 mins

CIO Leadership Live

Alteryx adds genAI to enable analytics creation via no-code platform

May 15, 20249 mins

Generative AIArtificial Intelligence

Africa

Americas

Asia

Europe

Oceania

Topics

About

Policies

Our Network

More

How different SQL-on-Hadoop engines satisfy BI workloads

A new benchmark of SQL-on-Hadoop engines Impala, Spark and Hive finds they each have their own strengths and weaknesses when it comes to Business Intelligence (BI) workloads.

SQL-on-Hadoop engines have different ‘sweet spot’ workloads

Show me more

Camunda simplifies process automation with new AI-powered natural language features

What’s holding CTOs back?

Baldor’s first-ever CIO sets the transformation agenda

CIO Leadership Live with Satya Jayadev, Vice President & CIO, Skyworks Solutions

Principal Financial CIO Kathy Kay on balancing traditional AI and genAI

CIO Leadership Live Middle East with Ramadan Mohamad, Digital infrastructure specialist at Public Transport Corp.

CIO Leadership Live with Satya Jayadev, Vice President & CIO, Skyworks Solutions

Principal Financial CIO Kathy Kay on balancing traditional AI and genAI

Alteryx adds genAI to enable analytics creation via no-code platform

How different SQL-on-Hadoop engines satisfy BI workloads

A new benchmark of SQL-on-Hadoop engines Impala, Spark and Hive finds they each have their own strengths and weaknesses when it comes to Business Intelligence (BI) workloads.

SQL-on-Hadoop engines have different ‘sweet spot’ workloads

Related content

State of IT jobs: Mixed signals, changes ahead

Project manager salary: 5 key tips to earn more

Cyber resilience: A business imperative CIOs must get right

Shine a Spotlight on Your Team’s IT Excellence with CIO Awards Canada

From our editors straight to your inbox

Show me more

Camunda simplifies process automation with new AI-powered natural language features

What’s holding CTOs back?

Baldor’s first-ever CIO sets the transformation agenda

CIO Leadership Live with Satya Jayadev, Vice President & CIO, Skyworks Solutions

Principal Financial CIO Kathy Kay on balancing traditional AI and genAI

CIO Leadership Live Middle East with Ramadan Mohamad, Digital infrastructure specialist at Public Transport Corp.

CIO Leadership Live with Satya Jayadev, Vice President & CIO, Skyworks Solutions

Principal Financial CIO Kathy Kay on balancing traditional AI and genAI

Alteryx adds genAI to enable analytics creation via no-code platform