Leveraging Containers Within the Big Data Space

Virtualization is not the hottest topic in the industry, but it has been a popular approach to deploy and manage enterprise applications. It allows a business to carve up its data center in a granular way to meet the needs of the organization. One of the downsides of virtualization has been that traditional approaches are very heavy. They need dedicated amounts of CPU and memory to function properly, but they do deliver on the ability to isolate applications from one another.

Containerization on the other hand is one of the hottest topics in the industry right now. It is difficult to find someone not talking about or considering using containers to deploy and manage their enterprise applications. This alternative leverages a much lighter approach. A container just looks like another process running on a system; a dedicated CPU and pre-allocated memory aren’t required in order to run a container. The simplicity of building, deploying, and managing containers is among the reasons that containers are growing rapidly in popularity.

However, there are still details that need to be considered when putting applications inside of a container. The one I want to focus on is persistence of data. Within the realm of containers, it is a generally understood best practice to never store data in the container that you want to make sure persists beyond the life of the container.

One may ask, if the concern is writing data in the container—where the application exists—then where does the data get written? Consider a database, for instance. Most software depends on a database which usually sits on a remote server from the application, while multiple applications use the database. This allows the data to live beyond the life of the servers running the business applications. The same practice is applied for applications running in containers. If you want to save the data, write the data outside the container. Most business applications write log files and those log files would typically just be written to a path in the file system. Containers have the ability to leverage external storage so that they may write their own log files outside of the container.

This matters in the big data space because all of these business applications running in containers are fully capable of writing their data directly to a company’s converged data platform via the standard file system. At the point the data is written to the file system, it is immediately available for access in the exact same platform where the analytics are performed. No data movement or transformations are required.

Imagine for a moment that the web servers running your business—without making any changes to the software—could write their logs directly to the platform used for your big data analytics. Now think about being able to run a SQL query on those logs in-place, on-demand. No additional work required. Figuring out how to move the logs from the web servers is no longer required. In this situation, there is no longer any latency between the time the logs are created to the time the logs can be consumed. There is no  need to wait for log collectors to move the data and make it available to run queries.

Let’s expand on this further: If there are applications using a database for persistence and the database runs on a converged data platform, then all of the data in the database, along with all the other data in the platform, can be combined on-the-fly for any analytics needs without moving or transforming the data from one database to another. Scaling a database application becomes as easy as spinning up more instances of the container with the application and ensuring it is load-balanced.

Further, taking advantage of complex data workflows by leveraging a message-driven architecture for decoupling communication between components is a powerful option. Multiple containers can make up a larger set of capabilities to deliver the needs for an application. The communication between the containers can be decoupled via a message stream and the containers can be scaled to meet the required service level. All containers could then be scaled independently of all the others. The greatest benefit is that the software does not need to be re-architected in order to attain high service levels.

Combining containers with a converged data platform enables a breadth of high-scale, low-latency applications. This combination enables containers to serve as a key building block for the digital transformation.


Subscribe to Big Data Quarterly E-Edition