How to choose the right data-integration tools

For data integration, pipelining, and wrangling data: Here are the seven types of tools you should build your data tool set from.

How to choose the right data-integration tools
Thinkstock

Data doesn’t sit in one database, file system, data lake, or repository. To service many business needs, data must be integrated from a system of record with other data sources and then used for analytics, customer-facing applications, or internal workflows. Examples include:

  • Data from an e-commerce application integrated with user analytics, customer data in a CRM, and other master data sources to establish customer segments and tailor marketing messages.
  • IoT sensor data linked to data in operational and financial data stores to control throughput and report on quality of a manufacturing process.
  • An employee workflow application that connects data and tools across multiple SaaS platforms and internal data sources into one easy-to-use mobile interface.

Many organizations also have data requirements from data scientists, data analysts, and innovation teams that have growing needs around integrating internal and external data sources:

  • Data scientists developing predictive models often load multiple external data sources such as econometrics, weather, census, and other public data and then blend them with internal sources.
  • Innovation teams experimenting with artificial intelligence need to aggregate large and often complex data sources that can be used to train and test their algorithms.
  • Business and data analysts especially in data-driven marketing departments that were once performing their analyses in spreadsheets may now require more sophisticated tools to load, join, and process multiple data feeds.

The big market of data technologies and capabilities

The question is: What tools and practices are used to integrate data sources? What platforms are used to automate operational data? What tools are being commissioned to data scientists and data analysts who want to be more hands on when working with new data sources? When developing applications that transact across multiple data sources and APIs, what are efficient development and devops tools that enable more rapid application development?

Because many organizations have different types, volumes, and velocities of data, with different business needs developed over time, it’s likely that there are already different methodologies and tools already in use to integrate data. It’s tempting to stick with those, and extend them to new use cases. Although anyone working with data tools might be more familiar with one approach over others, applying a one-size-fits-all data integration approach may not be optimal for organizations with multiple business and user needs.

In addition, there is a healthy market for big data solutions as more organizations are investing in data solutions. The result is that are now many new types of platforms and tools to support data integration and processing.

With so many tools available, organizations looking to make data processing a core capability should consider various tool types that can be applied depending on the business and technical needs. Technologists working with or responsible for data technologies should be well versed on the types of tools available. Here, I survey seven major types of tools:

  • Programming and scripting data integrations
  • Traditional extract, transform, and load (ETL) tools
  • Data highways for SaaS platforms
  • Data-prep tools for users and data scientists
  • API and data integrations for application development
  • Big data enterprise platforms with data-integration capabilities
  • AI-infused data-integration platforms

Programming and scripting data integrations

For anyone with even basic programming skills, the most common way to move data from source to destination is to develop a short script. This can be done inside a database with stored procedures, as a script that runs as a scheduled job, or it can be a small data-processing code snippet that’s deployed to a serverless architecture.

These scripts typically run in one of several modes. They can run on a predefined schedule or run as services that are either triggered by an event or respond when defined conditions are met. They can pull data from multiple sources, join, filter, cleanse validate, and transform data before shipping them to destination data sources.

Scripting is a quick and easy approach to move data, but it is not considered a professional-grade data processing method. To be production-class a data-processing script, it needs to automate the steps required to process and transport data and handle several operational needs. For example, if the script is processing large volumes or fast-moving data, using Apache Spark or another parallel-processing engine may be needed to run multithreaded jobs. If the input data isn’t clean, programmers should enable exception handling and kick out records without affecting data flows. Programmers also should implement significant logging of the compute steps to facilitate debugging.

The script programming to support these operational needs is not trivial. It requires the developer to anticipate things that can go wrong with the data integration and program accordingly. In addition, developing custom scripts may not be cost-effective when working with many experimental data sources. Finally, data-integration scripts are often difficult to do knowledge transfer knowledge and to maintain across multiple developers.

For these reasons, organizations that have larger data-integration needs often look beyond programming and scripting data integrations.

Traditional extract, transform, and load (ETL) tools

Extract, transform, and load (ETL) technologies have been around since the 1970s, and platforms from IBM, Informatica, Microsoft, Oracle, Talend, and others are mature in capability, performance, and stability. These platforms offer visual programming tools that let developers break down and automate the steps to extract data from sources, perform transformations, and push data to destination repositories. Because they are visual and break down data flows into atomic steps, the pipelines are easier to manage and enhance compared to scripts that may not be as easy to decode. In addition, ETL platforms typically offer operational interfaces to show where data pipelines break down and provide steps to restart them.

Over the years, ETL platforms have added many capabilities. Most can handle data from databases, flat files, and web services whether they are on premises, in the cloud, or in SaaS data stores. They support a wide variety of data formats, including relational data, semistructured formats such as XML and JSON, and unstructured data and documents. Many have tools to parallelize jobs using Spark or another parallel-processing engine. Enterprise ETL platforms often include data-quality capabilities so data can be validated with rules or patterns and exceptions sent to data stewards to resolve.

A common ETL example is when organizations load new files of sales prospects into a CRM. Before loading, these data sources often need cleansing of physical and email address, which can be done as a transformation using rules and standard data sources. Cleansed records are then matched to records already in the CRM so existing records get enhanced with data not previously had and so new records are added. If the ETL has a hard time deciding whether a row is a match or a new record, it can flag it as an exception to be reviewed.

ETL platforms are often used when data sources are supplying new data on an ongoing basis and the destination data store’s data structures don’t change frequently. These platforms are designed for developers to program the ETLs and so are most effective to operationalize data flows against a mix of proprietary, commercial, and open data stores.

Data highways for SaaS platforms

But is there a more efficient way to pull data from common data sources? Maybe the primary data objective is to pull account or customer contacts from Salesforce, Microsoft Dynamics, or another common CRM program. Or maybe marketers want to extract web analytics data from tools like Google Analytics or are trying to push customer data into marketing tools such as Mailchimp. How should you prevent your SaaS platforms from becoming data silos in the cloud and easily enable bidirectional data flows?

If you already have an ETL platform in place, look to see if the vendor provides standard connectors to common SaaS platforms or has a marketplace from which you can purchase one from a development partner.

If you don’t have an investment in an ETL platform and your data integration need is largely to connect common platforms, you might need an easy-to-use tool to construct simple data highways. Data-highway tools like Scribe, Snaplogic, and Stitch provide simple web interfaces to connect to common data sources, select fields of interest, perform basic transformations, and push data to common destinations.

Another form of data highway helps integrate data in closer to real time. It operates from triggers so when data changes in a source system, it can be manipulated and pushed to a secondary system. IFTTT, Workato, and Zapier are examples of such tools. These tools are particularly useful for moving single records using “if this then that” logic from one SaaS platform into another. When evaluating them, consider the number of platforms they integrate with, the power and simplicity of their processing logic, and price, as well as any factors specific to your needs.

Data-prep tools for users and data scientists

When nontechnical and less-technical users want to load and cleanse new data sources, there’s a new breed of data-preparation tools targeting these users to help them perform data integrations.

To implement basic data integrations and blending, consider what data-integration capabilities any business intelligence system in your organization already provides. Tools like Looker, QlikView, and Tableau provide basic data integration, modeling, and transformation. You can also use them publish and share virtual data sources.

If data scientists and data analysts are working with a large number of data sources or spend a lot of time wrangling data, a data-prep tool that integrates with your BI tools may be a smart investment.

What is data wrangling? It’s all the work business users and data scientists need to do on a data source to make it usable in an analysis. It often starts with profiling the data, a field-by-field analysis showing what percentage have values versus nulls and what distinct values exist in the data set. Using the profile data, analysts clean up fields that have different values that mean the same thing (like “New York” and “NY”), merge and remove fields, or use formulas to create calculated fields. They can also perform row-level operations like removing duplicates and consolidating records.

Data-prep tools are often designed around spreadsheet-like user interfaces to let users visualize data profiles and blend data sources. But unlike traditional spreadsheets, these tools capture the data-processing steps performed by the user and enable visualizing and editing the operations. Most of these tools can use these scripts they capture to automate the data flow for data feeds that have ongoing operational needs.

There are standalone data-prep tools like Alteryx, Paxata, and Trifacta. In addition, the traditional ETL vendors like IBM and Talend have developed data-prep tools that target business users and data scientists.

API and data integrations for application development

If your objective is to develop web or mobile applications that require connecting to multiple data sources and APIs, there are API and application development tools to simplify those integrations. Instead of integrating the data into a central repository, these tools offer various options to enable more rapid application development when working with many APIs and data sources.

Application integration has several different platform types and tool providers. Platforms like Dell Boomi, Jitterbit, and Mulesoft aim to simplify API and data access, as well as act as a data bus to centralize interactions. Low-code and mobile development platforms like Built.io, OutSystems, and Pow Wow Mobile enable the integration and provide development and devops environments to rapidly build and run applications.

Big data enterprise platforms with data-integration capabilities

If you are developing capabilities on top of Hadoop or other big data platforms, you have a couple of options to integrate data into these data stores:

  • You can develop scripts or use ETL tools that support big data platforms as endpoints.
  • You can select an end-to-end data management platform that has ETL, data governance, data quality, data prep, and master data capabilities.
  • You can do both.

Many vendors that provide ETL tools also sell enterprise platforms with these added big data capabilities. There are also emerging platforms like Datameer and Unifi designed to enable self-service (like data-prep tools) but work on top of Hadoop distributions from supporting vendors.

AI-infused data-integration platforms

1 2 Page 1
Page 1 of 2