DevOps for Big Data

Despite the large investments that organizations are making in big data applications, difficulties still persist for developers and operators who need to find efficient ways to adjust and correct their applications’ code. To address these challenges, Pepperdata has introduced a new product based on the Dr. Elephant project that gives developers an understanding of bottlenecks and provides suggestions on how to fix them throughout the big data DevOps lifecycle.

Pepperdata CEO Ash Munshi recently discussed the need for DevOps for big data, and the role of Dr. Elephant, which was open sourced in 2016 by LinkedIn and is available under the Apache v2 License.

What is happening in the big data space now?

We are seeing more and more customers going to production with big data. Our company tripled its growth last year and we have a nice vector for this year planned, as well. Our customers are proof that the technology and the solutions are leaving the lab and actually becoming business-critical.

What is changing as customers go into production with big data?

When we think about big data going into production, there are three big components. The first is making sure that things are reliable; the second is that they scale; and the third is that they perform. It is the performance aspect that we focus on as a company. The reason that performance is so hard for big data is that you are dealing with hundreds and thousands of computers, you are dealing with datasets that are usually two orders of magnitude larger than what classic IT had to deal with, and you are dealing with data that changes rapidly. And then, you are dealing with a lot of people that are doing things simultaneously. They are doing interactive work and decision support all on the same machines. That combination of variables is very hard to get your hands around, and the performance implications of that are even more difficult to understand. That is really why performance is such a big deal for big data. We like to say that performance can mean the difference between business-critical and “business-useless” for big data systems.

How is DevOps for big data different?

Classical DevOps was all about creating velocity between business requirements and needs—the developers writing the code, the systems that actually embody the code and solve the business problems—and it was all around processes like Agile, continuous integration, and continuous delivery. That is all well and good, but for big data, there is another big component which is this performance aspect. We believe that performance needs to be a first-level player in DevOps for big data.

What does this involve?

Taking information about how things are actually performing and providing feedback to earlier parts of the DevOps chain is vital for big data. In particular, if you collect information about resource utilization, contention for the resources, the applications, and whether the places where they are deployed match or don’t match, and give that back to the people who do the release part of it, and can also say that the developers made a set of changes but they might be detrimental to performance—that is an important aspect of feedback. In addition, going back to the developers and saying that they might want to change their algorithm because it is not using the cluster efficiently or it is taking up too many resources, and then going back to the people who actually provisioned the cluster, and saying that the assumptions that they made around the number of users, data volume, and the workload, are not actually resulting in the response times that they are expecting—these are all important feedback loops back into the DevOps chain, and they are vital for big data. That is really our fundamental thesis.

How do you address these requirements?

The products that we have today—the Cluster Analyzer, the Capacity Optimizer, and the Policy Enforcer—provide that type of feedback to the operators.

The Cluster Analyzer gathers all of the data and answers questions about what resource is being used for what and how they are correlated. The Capacity Analyzer takes automated action and says there are additional resources that are available now to run more jobs or run them quicker if more resources are allocated to them—so it does automated analysis using machine learning to use the resources better. And, then the Policy Enforcer guarantees that the important jobs are never starved for resources. They are all focused on providing performance feedback into the operators.

What are you adding?

With a new product, which is the Application Profiler, we are providing that performance feedback all the way up to the developers. We heard from operators that they wanted feedback to be provided to the developers because if the developers can make changes in their code, it will make the operators’ jobs much easier. That is why we embraced it and why we are going into that direction.

What does it do?

Strategically, the more we can provide to the developers, the more issues that we can catch at an earlier value stage, which in turn means that there are fewer problems in production later on. The Application Profiler takes the data that was gathered and provides recommendations to the developers to make changes in their code so the code will run more efficiently.

How does is it deployed?

The Application Profiler is built on an open source project called Dr. Elephant that was originally started by LinkedIn. We are now actively contributing to that project and we have integrated that code into our suite of products. That means that our customers who buy Application Profiler from us don’t have to go and install Dr. Elephant on a separate cluster with a separate user interface. It is provided as a software-as-a-service solution that is integrated into our suite.

The importance of integrating this into our suite is that in addition to providing recommendations to the developer, it is critical for the developers to understand the context that the jobs ran in. It is necessary for the developer to understand what was happening on the cluster at the time that the job was running in order for them to be able to determine how seriously to take some of these recommendations.

Dr. Elephant by itself doesn’t provide that, but by integrating it into our dashboard and our Cluster Analyzer, we are able to provide that context so it makes Dr. Elephant much more powerful in addition to the fact that we take the headache of deploying and supporting it away from mere mortals, so to speak.  It is the integration, plus the hosted solution, that is the power of the Application Profiler.

What’s next?

We are contributing everything we are doing back to the community and we will continue to do that. The heuristics are going to be contributed back to the main code base with Dr. Elephant. We think that is a really important thing to do and, obviously, the community will benefit and, as the community makes changes, we will also benefit, and so it is a very important step to take. LinkedIn has embraced us, we have embraced them, and others have started to join, as well.

Application Profiler early access is going on now, and it will be generally available in Q2 2017.


Subscribe to Big Data Quarterly E-Edition