The rise of big data and the growing popularity of cloud is a combination that presents valuable new opportunities to leverage data with greater efficiency. But organizations also need to be aware of some key differences between on-premise and cloud deployments, says Charles Zedlewski, senior vice president, products, at Cloudera, which provides a data management and analytics platform built on Hadoop and open source technologies. Key things organizations should focus on to ensure success in the cloud are a greater degree of convenience to users and a smaller cost footprint, he advises.
How do you define Cloudera’s role?
Our mission is to help organizations profit from all of their data, and that has been our mission since the start of the company. If you go back to when Cloudera was founded, at that time, in the typical large enterprise, the fraction of data that ever got analyzed or put to productive use was incredibly small. Very little of it could conform nicely to traditional database technologies and the cost of traditional database technologies was too prohibitive to consider it. Fast forward to today and the number of our customers that have multiple petabytes of data under active management for processing, analysis, and predictive models has grown. It is very much consistent with the mission that we have always had.
How big a part of the data management picture is cloud today?
It is the fastest growing kind of infrastructure that we run on, that is mostly public cloud infrastructure. We have customers that use Cloudera in all of the three big cloud providers—Amazon, Azure, and Google. And at this point, a very high fraction of our customers run Cloudera somewhat or entirely on the public cloud and I think that proportion is going to grow significantly.
One of the reasons is simply the rising popularity of public cloud as an infrastructure. The second reason cloud is growing much faster for big data platforms such as Cloudera’s than for other data management technologies is that platforms such as ours were elastic from the get-go. If you think about what you want to capitalize on using cloud as your infrastructure, the big advantage is supposed to be that the infrastructure is much more flexible and much more elastic. To take advantage of that, you need your platform or your applications to be elastic. The Cloudera platform can scale to many, many thousands of instances which is something that a lot of traditional database platforms can’t do. We can grow and shrink without taking any downtime or interrupting workloads, and this again this is not true of more traditional database products. We are also very at home with high volumes of commodity infrastructure that fail frequently—and that is a pretty good description of what you find in a lot of cloud infrastructure. So, our platform winds up being a strong technical fit for the unique properties and advantages of the public cloud. If you look at analyst statistics, today, roughly 8% of all the compute in the world is running in the public cloud, and roughly 12% of all the analytic data management is running in the public cloud, and I can tell you that for Cloudera, we are tracking well ahead of that.
Does the cloud provide particular advantages to companies that want to leverage big data technologies such as Hadoop?
What we have tried to do with our products in the cloud where possible is further reduce the required expertise to run Cloudera. We can leverage features that the cloud providers give us to create much more standardized and simplified experiences. For example, we can pre-select and pre-optimize for a particular configuration. In the data center, we have to accept a much greater diversity of configurations since organizations buy different servers from various vendors with a mix of configurations—and we work with all of them. In the public cloud, we can give customers an assortment of options, but we can also standardize a lot more in the process to make things a lot simpler for them.
What are the challenges that customers face as they move to the public cloud?
There are many things that stay the same in the cloud and many that are different. The big aspect that is different is that in order to run any workload in the public cloud effectively you need to think about how you are going to capitalize on the fact that cloud infrastructure is elastic in order to shrink your cost footprint. If you just treat cloud like a collection of fixed servers, you will find that running in the cloud can get pretty expensive. The way you can make the cloud equal to, or less expensive than, what you are doing in the data center is by looking for opportunities to shrink your footprint in terms of your infrastructure cost.
For example, if a customer is running our platform and only needs to run it for half of the day—let’s say, there is a data mart that needs to be available to users for 10 hours out of the day—we can shut down the cluster for the off-hours, and then regenerate the cluster whenever the users need it again. In the process, we can reclaim a significant amount of infrastructure cost.
What remains the same?
Data security is still enormously important in the cloud, just as it was in the data center, and we have spent a lot of time integrating our data management security framework with the infrastructure security framework of the cloud providers. Data governance is also still hugely important because there are more policies that companies are applying to their data and they need a mechanism to track and enforce them and stay compliant with all kinds of regulations such as GDPR. And then the other aspect of data management in the cloud that stays the same is the business problem that’s being solved with data, and the techniques such as BI, predictive analytics, and real-time applications that are built to solve problems. All of that is largely the same, whether you talk about cloud or on-premise. The apps, the workloads, the users, and the requirements for security, strong governance, and tight operations are the same.
How is the hybrid approach taking shape?
We have customers today that run Cloudera in two different clouds, with some mix of Amazon, Azure, or Google—and they are asking us to support these different clouds and we have customers that have a mix of on-premise dedicated servers and public cloud, and still other customers that predominantly have private cloud, such as OpenStack, for example. All of those different permutations are quite prevalent, and I think are going to be increasingly common over time.
With this diversity, is there a greater recognition of the need for standards?
That is something that is going to rise in importance. When companies were 100% in their own data centers, it was a very highly standardized and highly governed world. Now, with public cloud and lots of folks that are just funding their projects in the public cloud out of their line-of-business budgets, there is a much greater emphasis on diversity of choice for different applications and teams inside of an enterprise, and so, there is a lot more flexibility and lot more freedom for these teams. I suspect, as a result, you are going to find a lot of sprawl and a lot of misallocated resources, and that will drive companies to apply more governance to some of those technology choices. But I think it is early for that and, right now, the rise of cloud adoption has been largely the convenience and speed.
If there was one piece of advice that you could offer to an organization that is trying to make cloud a bigger part of their mix, what would it be?
When you think about doing big data and the cloud and the transition from on premise, the main thing to fixate on is that your wins are going to come from delivering a greater degree of convenience to users or a smaller cost footprint. If you are not able to get that, people are going to question the value of the move. A related bit of advice is that the way you find greater convenience or lower cost is by better understanding the nature of each workload. Now more than ever, it is important to really understand your workloads before you move to the cloud, and then, as you are moving to the cloud, to think about how you can run those workloads less expensively or with a greater degree of convenience for your internal users.
Interview has been condensed and edited.