10 New Requirements for Modern Data Integration

Feb 24, 2016

By Viakom Krishnan

There is no denying the fact that cloud-based software and computing services are now accepted as the norm. This change has profound implications on how software applications are architected, delivered, and consumed.

This change has ushered in a new generation of technology and an entirely new category in data integration. Today, there are 10 new requirements for an enterprise data integration technology:

1. Application integration is done primarily through REST and SOAP services.

Software applications are increasingly delivered as cloud-based services that expose SOAP/REST APIs for data and metadata management based on business services and business objects. Unlike the previous generation of on-premises applications, today’s SaaS applications do not allow direct access to the database behind their services. As a result, these applications lack the relational client server interface leveraged by the previous generation of application integration tooling.

To be effective, modern data integration platforms must provide easy and robust ways to consume REST and SOAP. They need to provide an easy way to abstract the complexities of these APIs into business actions and objects to enable an application administrator to integrate these services with the rest of the enterprise.

2. Large-volume data integration is available to a Hadoop-based data lake or to cloud-based data warehouses.

Increasingly, enterprise IT organizations (and lines of business/departments) are moving away from bespoke data warehouses to data lakes that are repositories of all data based on a Hadoop cluster. MapReduce and, more recently, Spark are used as the compute frameworks for data transformation of large amounts of data in this environment. Cloud data warehouse technologies such as Amazon Redshift and Microsoft Azure Data Warehouse are providing low-cost and low-administration alternatives to expensive specialized data warehouse appliances. Data integration tooling has to have a native understanding of newer storage and compute frameworks based on large-scale distributed frameworks such as HDFS and Spark. This is difficult for client server-based tooling that has relied on row sets as the primary commodity to be efficiently managed.

3. Integration has to support the continuum of data velocities starting from batch all the way to continuous streams.

Change in data velocity or data size should not require you to change engines as in the previous version of tooling. Last-generation data integration engines were either optimized for batch processing of large volume data or for low latency handling of small messages.

Modern integration platforms should be able to provide the necessary velocity regardless of size of data. This means that the engine has to be able to stream large data such as sensor data from the Internet of Things just as easily as it can consume and deliver responses to discrete business events such as the addition of a new product or a new customer.

4. Integration is event-based rather than clock-driven.

Responding to a business event as it happens is expected. For example, increasing the stock inventory on an item based on sentiments expressed in social media or entering a support case automatically when a failure is detected at a device. In either case, polling after the fact for these conditions means a frustrated or lost customer and an inefficient process in today’s real-time enterprise.

5. Integration is primarily document-centric.

This is a corollary to the fact that integration is based on SOAP/REST APIs that send and receive hierarchical documents rather than row sets or compressed message payloads of the previous generation client server-based technologies.

Transforming hierarchical documents into row sets or into compressed payloads at the edges to make the internal engines run efficiently is the biggest impediment to streamlined repurposing of the previous generation of data integration tooling.

6. Integration is hybrid and spans cloud-cloud and cloud-ground scenarios.

We are in a transitional period. While the newer software purchases are almost exclusively cloud-based, there is still a lot of investment in legacy on-premises enterprise applications that will take time to migrate. Some applications may never migrate to the cloud.

In today’s hybrid, multi-cloud environment, modern data integration technology has to be able to handle both on-premises and cloud-based applications with the same efficiency and ease.

7. Integration itself has to be accessible through SOAP/REST APIs.

Integration has to interoperate with other services in the enterprise such as monitoring, provisioning, and security. For example, enterprises might want to monitor the success or failure of integration flows through their own monitoring tools, and they might want to add new users automatically as they get added to the enterprise integration group. And most enterprises require single sign-on with their identity provider.

8. Integration is all about connectivity, connectivity, connectivity.

Just like real estate is all about location, location, location, integration is all about connectivity. By definition, integration is about connecting disparate systems each with its own API set, and an integration toolset needs an effective framework to adapt these APIs to efficiently process the data. In addition, a large set of pre-built connectors speeds up the implementation and increases agility in responding to new integration scenarios.

9. Integration has to be elastic.

Integration demands of a modern real-time enterprise can vary widely from one day to the next based on the business events that are taking place. One day could see hundreds of integrations triggered by a scenario that a data scientist is exploring, and the next day could be back to the normal load of a few integrations. Reserving capacity to handle the worst case computation/storage needs is costly, and not having sufficient capacity when necessary is even more so. This means that the integration framework has to be able to scale up and scale down resources on demand.

10. Integration has to be delivered as a service.

In a world that is increasingly cloud-based and data-driven, data access and integration technology have to be delivered as a service that’s accessible to anyone who needs them, rather than the few practitioners who toil away in the back room. The service has to be always on and web-scale to handle the elastic integration demands of the modern enterprise. On-premises data integration technology, with its long release cycles, complex and costly upgrades, and general administration, cannot handle the agility and need for speed in the modern enterprise. A new class of users—sometimes referred to as “citizen integrators”—has made self-service essential, and only a SaaS-based approach with simplified design, management, and monitoring interfaces can meet the broad spectrum of users and requirements.

These new requirements have given rise to a new category of integration called integration platform as a service (iPaaS), which should be built from the ground up to address the new and legacy enterprise application and data integration needs.

Viakom Krishnan is vice president of engineering at SnapLogic.