If you have ambition, open source at scale is essential

“No proprietary software can solve all the problems of companies that operate at the scale of Didi,” says Li Luo, technical director of big data at Didi Chuxing, the Uber of China

Contributor, InfoWorld |

If you have ambition, open source is essential — Thinkstock

When your job is to provide the cloud infrastructure to run analytics and workloads across three datacenters that are more than 100 miles apart, sucking 100-plus petabytes from each daily, it’s no longer an even remotely credible option to buy it from Megavendor X. These days, the only place to find such software is on an open source repository somewhere.

Which is exactly what Didi Chuxing, the Uber of China, did.

“No proprietary software can solve all the problems of companies that operate at the scale of Didi,” said Li Luo, technical director of big data at Didi. “We need access to the source code so we can update it and contribute back frequently to meet changing requirements. For companies like ours, open source software is the right choice.”

In reality, it’s the only choice.

Five years ago, Cloudera cofounder Mike Olson wrote, “No dominant platform-level software infrastructure has emerged in the last ten years in closed-source, proprietary form.” In significant measure, this stems from the realities of operating at web-scale: The financial costs, never mind the technical costs, of trying to scale proprietary hardware and software systems are simply too high. Companies like Google and Facebook keep gifting genius creations to the open source community, driving innovation faster, well beyond the realm of proprietary firms’ ability to compete in data infrastructure.

Which is why Didi quickly abandoned the idea of paying a big vendor like Oracle for scale.

Scale that money can’t buy

In the global race to dominate the ride-hailing business, people usually think first of Uber. But Didi, with a private valuation of $56 billion, has raised more than $20 billion from investors to date, just $2 billion behind Uber. The China-based mobile transportation platform serves more than 500 million users and reaches 80 percent of the world’s population in more than 1,000 cities. That is scale.

What’s under hood of that cloud engine? Open source.

Before Li and his team rearchitected their data platform to accommodate the exploding machine learning workloads, the company used to rely on expensive, slow ETL tools to pool the data in HDFS to run big-data applications that could match drivers with passengers and other tasks. With the new architecture, Didi runs standard open source big-data applications like Apache Spark, Presto, Hive, Flink, and Druid for general analytics and queries.

Cost? $0.00.

Of course, the autonomous technology giant pays in other ways. Nothing is free, and while Didi’s open source arsenal may come for free, the people developing it do not. Didi recognizes, however, that skilled developers are critical to competitive differentiation. Not every enterprise could pull this off, as Li suggests. But for companies serious about data, there’s simply no off-the-shelf solution that gifts big-data prowess. To achieve this, you need developers—and they need open source.

A hub for open source innovation

An open source platform originally developed at UC Berkeley’s AMPLabs, Alluxio, was the bridge to solving critical problems in Didi’s legacy ETL solution. Alluxio manages all the data from the datacenters in shared memory so jobs run in real-time, without ETL. It allows many different jobs and applications to run simultaneously across the giant shared data pool.

Now, new applications are plug and play, eliminating issues with multiple data formats and file systems. Whereas Li used to run a huge HDFS cluster to get a single name node, now he can have distributed data sources and still have a single namespace.

How does Li choose where to place his open source bets? “Most of the time, we don’t have to keep up with the pace of innovation, because in the projects we use we’ll often also be leading contributors of that innovation because of our specific requirements,” Li said. “So we don’t have to worry about open source projects being displaced by new ones. Our job is to solve the business problems of our company and open source allows us to solve those problems more efficiently. That’s why we use it.”

Catch that? While the company tends to push the boundaries of innovation, by sharing this innovation it ensures that it won’t be the sole maintainer of it. Open source gives the company a way to collaborate on both innovation and maintenance. It’s genius.

In this and related ways, Li points out that open source also helps Didi get around the need to write and maintain a lot of custom code, all while helping to solve integration complexity. For example, the Alluxio project can take data from almost any source, from HDFS and object stores to traditional storage systems. Before Alluxio, relying on ETL required lots of custom integrations with each individual application to run them on HDFS. Aggregating all that data was difficult.

Li said the most important advice he gives to his peers building out massive-scale architectures is to first really understand the problem you are trying to solve. Open source helps him future-proof those decisions. “In the end, it’s all about the expectations of your customers on that data,” Li said. “Find the best way to meet those expectations, and make sure you run an infrastructure that is easy to update to meet changing business requirements.”

Oh, and then contribute it to an open source community.

Next read this:

Open Source

Matt Asay runs developer relations at MongoDB. The views expressed herein are Matt’s and do not reflect those of his employer.