<< back Page 2 of 2

Technologies and Skills that Build the Foundation for Data Management

By Joe McKendrick

May 15, 2017

Ultimately, for the success of analytics and real-time solutions, data needs to be trusted. “Most analytics technologies fall down in this area,” said Pasqua. He noted that “the goal of many real-time analytic processes is to determine as much as possible about an individual entity as opposed to a population. While many people think about analytics in terms of statistics over large groups of data, in real-time analytics, you often want to be able to scope your analysis to a very fine target.”

Microservices and Containers

Containers and microservices play a key role in helping to achieve agility in hybrid cloud or on-premises environments, industry observers agree. “Containers and microservices were born out of the cloud environment and are critical components to help developers be more agile,” said Jason McGee, IBM fellow, vice president, and CTO for IBM Cloud Platform. “It’s all about enabling developers to progress and iterate quickly. Developers have to spend a lot of time setting up the environments that support their application, installing and configuring software, setting up infrastructure, and moving applications between development, test, and production systems. Containers solve this challenge by standardizing how developers package their applications and dependencies, making it super simple to create, move, and maintain applications and allowing more time for what developers really want to do, which is create.” Keep agreed that containers provide much-needed application portability, “making it simpler to move services between on-prem and cloud environments, facilitated increasingly by the public cloud vendors rolling out container services.”

For their part, microservices contribute to agility “by enabling the formation of smaller teams that do not have to coordinate as much with the larger organization,” McGee continued. Keep added that “the large, monolithic code bases that traditionally power enterprise applications make it difficult to quickly launch new services. In the last few years, microservices—often enabled by containers—have come to the forefront of the conversation. Containers work very well in a microservices environment as they isolate services to an individual container. Updating a service becomes a simple process to automate and manage, and changing one service will not impact other services.”

Containers and microservices may go together, but are not joined at the hip. “Just to be clear, containers are not required for microservices, nor are microservices required for containers,” said Mayuram. “While it’s correct that both containers and microservices are frequently used together in today’s modern web, mobile, and IoT applications, they are not a requirement for each other.”

Flexibility and adaptability are critical to container and microservices success. “Choose a database that meets the requirements of microservices and continuous delivery,” Keep said. “When you build a new service that changes the data model, you shouldn’t have to update all of the existing records, something that can take weeks for a relational database.” Instead, Keep noted, it is important “to ensure that you can quickly iterate and model data against an ever-changing microservices landscape, resulting in faster time to market and greater agility.”

One risk is the distributed nature of microservices, Keep said. “There are more potential failure points. Microservices should be designed with redundancy in mind.”

Automation is also essential to these environments, he added. “With a small number of services, it is not difficult to manage tasks manually. As the number of services grows, productivity can stall if there is not an automated process in place to handle the growing complexity.” Finally, he advises, “learn from the experiences of others.”

Spark Versus Hadoop

While Hadoop has emerged as a popular open source framework in recent years, another contender, Apache Spark, is stealing its thunder. “Our customers, especially those who are building newer big data projects, tend to choose Spark over Hadoop for big data processing,” said Couchbase’s Mayuram. “Spark performs better, is easier to manage, and provides additional functionality like machine learning, which tends to make it much more attractive than Hadoop for big data processing,”

Hadoop is dying in the enterprise, Glickman agreed. “Hadoop-based projects are slowly failing and will eventually be replaced with cloud-based services that are better suited to the tasks Hadoop tried to solve on-premises. Apache Spark, on the other hand, is thriving. By being data-source agnostic by design, Spark never had a tight coupling to Hadoop, or more precisely, HDFS.”

Some industry observers, however, believe Spark and Hadoop can coexist and deliver impressive synergies. “We don’t view this as a Spark-versus-Hadoop debate,” said Mahmood. “We believe that analysts and data scientists require a centralized platform to develop predictive applications. Apache Hadoop provides this foundational platform for big data processing with HDFS for storage and YARN for compute management. We believe that Apache Spark is more effective when it operates as part of a Hadoop platform. With the burden of the platform being taken care of by Hadoop, data scientists can be more productive by simply focusing on building predictive applications.”

Open Source

Open source is also gaining traction and, in particular, a number of key Apache projects are getting a foothold in the enterprise. “We often see different technologies being brought in to address application development, data management, and operational challenges,” said Mayuram. Some of the more common Apache projects that Couchbase sees within enterprise customers are Spark, Kafka, ActiveMQ, Flume, Arrow, TomEE, Web Server, Cordova, Axis, ZooKeeper, Mesos, Groovy, Commons, OpenJPA, ServiceMix, Zeppelin, and Lucene.

Mahmood sees another solution, Apache Ranger, also gaining traction among enterprises that “are increasingly concerned about providing secure and authorized access to data such that it can be widely used across the organization, while also keeping sensitive information safe. Apache Ranger is being used by some of the largest companies across industries to provide a framework for authorization, auditing, and encryption and key management capabilities across big data infrastructure.” Other open source tools include Apache Atlas, which addresses data management and governance, and Apache Zeppelin, which assures “access to data is democratized and citizen data scientists can use a web-based tool to explore data, create models and interact with machine learning models,” Mahmood stated.

Blockchain

And, finally, there is an increasing role for blockchain—the global, distributed database—in today’s enterprise environments. While the direction and impact of this technology is not yet clear, blockchain promises to disrupt many data management approaches. “Blockchain technology excels at building trust between groups of inherently untrusting legal entities,” said Jerry Cuomo, IBM fellow and vice president of blockchain technologies. “If everyone trusts each other like in a private enterprise, we really don’t need a blockchain. However, every enterprise has business-to-business relationships where value is exchanged.” For example, he noted, in a supply chain, partners, suppliers, and shippers manage the exchange of goods across enterprises. “This is where blockchain shines.”

<< back Page 2 of 2