As companies grow increasingly data-centric in their decision making, product and services development, and their overall understanding of the world they work in, speed and agility are becoming critical capabilities. A common theme in big data and analytics today is “Industry 4.0,” representing a new wave of technology that enables the automation necessary for scaling. There’s compelling justification for this as companies seek to unlock business value from big data with two broad approaches: the democratization of data with greater access by more users, and the enablement of automation everywhere possible.
There is a need to connect the dots at scale—using IoT sensors for real-time predictions, tapping into social sentiment for deeper insights, and gleaning insights from customer touchpoints for personalized marketing in CRM. Manual processes simply aren’t sufficient. Given the need for customer-centricity and the timely delivery of relevant services and products, data science and artificial intelligence (AI) are the only ways to tackle the vast quantity of data that is constantly changing and evolving in context. Cloud platforms and data science programs also continue to grow rapidly as new use cases drive capabilities and adoption.
AI, Data Integration, Big Data Engines, and Cloud
Today, many new technologies and capabilities are in play. They can be loosely broken into four broad categories. Arguably, the leading trend in 2017 is the adoption of AI, both in the enterprise and within consumer products, in order to achieve greater efficiency and move to probabilistic decision making and recommendations. Data integration, which has been the mainstay of the data warehousing and business intelligence era, is evolving too, driven by the need for data to be understood and combined rapidly from disparate sources for new business models—and those yet-to-be-developed. In addition, modern data platforms require data engines that can work with data in elastic, scalable, and fault-tolerant ways—as well as leverage data lakes—because big and real-time data can’t be duplicated. And, finally, from an architectural perspective, today, public cloud platforms are abstracting infrastructure and architecting ecosystems of managed services and data pipelines.
The Democratization of AI
The overall category of data science tools and platforms has been gaining traction for several years, but there was a major breakout this year in machine learning (ML). While data science and advanced analytics have made up a broad field of algorithms and statistical routines, ML has emerged as a focal point for significant business impact by specializing in one approach.
AI can be considered an umbrella that includes spokes such as ML, natural language processing, and machine vision. These components share the intersection of statistics, computer science, and predictive analytics. However, it’s the ML branch—including deep learning and predictive analytics—that has quickly caught the attention of companies by providing a new approach to solving complex business problems.
For complex business questions, it is challenging to build comprehensive, rule-based systems that require domain knowledge and expert logic. To be successful, many variables must be weighed and factored in, and rules change over time. A machine learning-supervised approach inputs records of data and enables its corresponding output determination. With enough of the right input data, the ML algorithm builds its own logic and begins to predict the desired output. Using this approach with business problems and questions produces a more reliable result. The rapid rise in popularity in ML is partly due to the need for tools and the business necessity to solve complex problems.
Because AI and ML need training and feedback to learn and evolve, these technologies have also boosted the requirement for human-to-machine interfaces. When keyboards are not fast enough, human speech becomes the fastest way for machines to learn. AI also requires better computing horsepower, as well as better microphones and speakers on smart devices, enabling personal assistants, such as Alexa and Siri, to learn from us.
While massive data centers exist with elastic available CPU cores, a different kind of chip—the graphical processing unit (GPU)—is much faster and more efficient at executing the specialized tasks of AI processing. We are also seeing the rise of databases and data science tools specifically built to take advantage of these GPUs, along with GPUs becoming available in public cloud platforms such as Amazon and Google to enable more widespread use of these applications.
The Evolution of Data Integration
Data integration has matured from the days of hand-coded data transformations to component-based graphical developer environments and server-side processing and scheduling. We’ve seen extensions in the form of data abstraction tools and in-database processing capabilities, but many of these still work within the same paradigm: You need to understand and define your transformations before you can execute them. The need to go faster has been manifested over the decades as architected data marts, agile BI development, and data visualization and analytics tools that extract and integrate data locally.
With digital disruption resulting from big data technologies touching upon every industry, the need to go faster is compounded by the need to perform data discovery, determine what data is useful and what it means, and figure out what innovative business models can be developed to stay competitive. Accordingly, data integration is giving way to self-service data discovery and preparation capabilities that enable the people who know the business environment to work with data and realize insights. These people are proving that the “You know it when you see it” adage does hold true.
As we relate this back to our original business drivers for scalability and automation, it is clear that new integration technologies are being delivered to help handle the “unknowns” that accompany a self-service environment in which everyone in the enterprise—not just the power users, IT developers, and business analysts—works with data in their everyday business life. This is a big part of being truly data-centric. It requires tools for finding and understanding (profiling) data and integrating it with existing enterprise datasets, as well as external data, and producing reusable resultant datasets to drive actions in the business. All of this needs to be turned loose in the enterprise within working processes, such as enterprise search, collaboration, governance, and security. Additionally, AI and ML are now often embedded in many of these technologies to assist with assessing data quality, recommending integrations, and leveraging and optimizing user behavior patterns at scale.
Another trend in working with big data has been to stream data into modern data platforms with tools such as Apache Kafka, NiFi, and Flink from the open source world. With the “Touch it once” mantra in big data, data pipelines facilitate inline processing for the delivery of data to multiple consumers, including repositories and data science routines. Furthermore, Apache Spark clusters are becoming a very popular integration hub to transform data with programming languages such as Java, Python, and Scala.
Driving Analytics and Big Data Engines
Hadoop has been practically synonymous with big data and last year celebrated its 10th anniversary. Since its inception, Hadoop has matured with help from a community of developers and the support of Cloudera, MapR, and Hortonworks to deliver resource management, SQL, and many other features for analytics. However, despite using massively parallel commodity servers to overcome performance requirements, big data platforms struggled to deliver both big and fast data. SQL and NoSQL databases aside, even on high-performance solid state drives, distributed in-memory data access is still required for high performance. Apache Spark is now the popular distributed platform for in-memory processing due to its resilient distributed data and data frames components. Moreover, Spark has become an ideal environment for processing streams of data or fetching it from Hadoop and other databases with SparkSQL.
While in-memory databases have historically suffered from the stigma of being price-prohibitive and therefore hard to justify, their time has come as memory prices have dropped. And, with the additional capability of easily running an in-memory database on a distributed shared-nothing or sharded architecture of servers in a cluster, large in-memory databases can now scale smoothly and accept data streams. Some vendors offer MPP in-memory databases both on-premises and in the cloud, but they typically differ in how data is stored in memory (in rows or columns), how data is persisted on SSD drives, and how the systems handle transactional processing.
Taking performance one step further will be the new GPU processing databases that are built for AI processing. Not only does this special processing chip lend itself to the intensive processing that AI routines execute, but it has also been helpful in some of the complex areas of BI with geospatial analytics, predictive analytics, and data science. GPU-optimized databases support distributed in-memory clusters and streaming data ingestions from several data integration vendors and open source tools such as Apache Storm, Kafka, NiFi, and Spark Streaming. Of course, these databases are hardware-dependent (must have GPUs), but all three major public cloud vendors (Amazon Web Services [AWS], Microsoft Azure, and Google Cloud) have GPU options available.
Faster at Scale With Cloud Platforms
While the most mature public cloud platforms continue to be AWS, Azure, and Google Cloud, there has been plenty of continued innovation in the platforms. AWS also recently celebrated its 10th birthday, and its momentum continues to grow, as does the rate of new services being released. AWS is not alone in this regard, as both Azure and Google Cloud are rapidly adding innovation to their offerings as well. Many of these enhancements deal with AI and ML services that leverage GPUs and make development faster and easier when there are massive amounts of data to process. Cloud platforms have become business analytics enablers by allowing any company to scale effortlessly and affordably.
In particular, a significant cloud trend is the growth of cloud-based data lakes. A few years ago, the embattled data lake concept was deployed mostly in Hadoop clusters and the Hadoop Distributed File System (HDFS). Now, cloud platforms offer very low-cost, distributed storage without data ingestion network costs so that companies no longer have to worry about growing their on-premises Hadoop clusters—or even managing a cluster at all. The security features and lower barrier to adoption in the cloud have accelerated data lake adoption overall. Working with data files in the data lake is easy with schema-on-read SQL engines, compute-on-demand processing, or applications built in Spark.
However, after years of new services development, the architectures around cloud services have become more complex. Cloud advances have also given rise to a new job role that many are referring to as the “ecosystem architect.” Similar to the “enterprise architect,” this new EA is focused on services and products in the cloud platform for compatibility and integration. Today, there are architects who select cloud services whenever possible, leverage on-demand elasticity, and, therefore, are able to focus on the data work at hand.
Some companies are even tackling multi-cloud architectures that leverage the best features of a particular cloud platform, accommodate the preferences of different internal company initiatives, or simply satisfy their concerns about single cloud platform lock-in. The multi-cloud architecture also provides an approach to what many have referred to as “data gravity.” With data gravity, architects accommodate for corporate SaaS business applications such as Salesforce, Workday, and Adobe that hold large amounts of corporate data and offer their own analytic solutions. Interestingly, in some respects, corporate territorialism still exists in some companies, and data lakes have become the new data silos. To handle this, the concept of a logical data lake has emerged to stitch together portions of a data lake, whether on-premises or spread across multiple cloud platforms.
With the need for portability across on-premises and cloud environments, container technologies are gaining traction to enable applications to be easily deployed in any environment. Another emerging technology that is expected to make an impact in the coming years is the ability to shift on-premises workloads to and from cloud platforms as needed, driving further abstraction of the computing infrastructure. Moreover, as IoT data processing increases big data by another order of magnitude, emerging technologies that enable processing “at the edge” and avoid the streaming and centralizing of big data entirely will also become important.
What Does It All Mean?
With the transformative potential of Industry 4.0 technologies, companies are putting a sharp focus on how to engage people and generate the data they need rather than looking for it. In the process, they are demanding faster development and more agility when it comes to leveraging big data at scale and are increasingly tapping into automation and AI. These capabilities enable data science efforts to reveal customer behaviors and preferences based on data from more engagements and touchpoints, for example. Better understanding of customers leads to better and more relevant products, and therefore business growth and opportunities for true innovation. Ultimately, utilizing big data technologies effectively will drive operational efficiencies and enable the insights necessary for new business models.