Big Data at a Turning Point: Q&A with Joe Caserta


The rise of big data technologies in enterprise IT is now seen as an inevitability, but adoption has occurred at a slower pace than expected, according to Joe Caserta, president and CEO of Caserta Concepts, a firm focused on big data strategy and technology implementation. Caserta recently discussed the trends in big data projects, the technologies that offer key advantages now, and why he thinks big data is reaching a turning point.

How fast is adoption of big data technologies taking place?

I am seeing big data technologies proliferating throughout corporate America, although not as fast as I would have predicted 2 years ago. It has gained a lot of momentum in recent months, and I think in 2016, we are going to see a big growth spurt for the big data paradigm—and that is because we are starting to figure out a few things.

What is changing?

The reason the adoption rate hasn’t risen as quickly as I thought it would is due to a couple of things. One is that governing the data within the big data paradigm has been challenging but also getting skills, resources, and staffing that can actually work with it has been challenging, as well. Now that it has been essentially 5 or 6 years since big data technologies have been out on the market, people are starting to learn how to use big data, which is very different than using traditional data. And, I think we are going to see great growth in the coming year.

Why do you think 2016 is going to represent a turning point?

One reason is data governance. I think everyone is finally starting to understand how to govern big data and they are also starting to realize that, when it comes to data governance, big data is not the problem. Big data is actually the change agent for making data governance better. Since the 1970s, IT has established very heavy-handed processes and procedures on how to onboard data, interrogate data, and structure data before users could analyze the data.

And now?

Data has become more and more important to business growth. In order for an organization to understand its customers and products, or any aspect of its business today, it needs data—so if there is latency holding back the users from access to the data, they will come up with alternative solutions. What has been happening is that there has been a growth of what we call “shadow IT” departments. This has resulted in copies of data proliferating throughout the enterprise, and it is all ungoverned.

How do big data technologies help?

My view is that big data—and the inception of the data lake concept—is really going to alleviate shadow IT. There will be rigorous processes for making data available as quickly as possible through the introduction of the data lake. For years, you had to structure your data, do your data modeling processes and your requirements processes, know what you are looking for, write ETL processes, and then load your data warehouse, and then finally, months later, the users could have access to it. That is actually being turned on its head with the introduction of the data lake.

Why?

The data lake allows you to ingest your data in any state, format, and structure, and then it allows trusted users—usually in the data science division or just power business users or data analysts—to analyze the data, figure out what is useful and what is not, and start getting value out of it immediately. Once they have done that, they can start taking what they have created and build structures around it so that it is available to the masses. Eventually, it does wind up in the data warehouse, but the data lake is kind of the precursor to the warehouse. It allows trusted users to have access to the data immediately. That is how the data lake is going to stop the proliferation of shadow IT and it is going to allow us to put more rigor around governing unstructured data.

What are the key technologies you see in the big data space now?

The biggest thing is cloud. The best way to get full advantage of these new technologies is by moving off in-house, on-premise iron—monolithic boxes that you may or may not ever grow into and then by the time that you do grow into them, they are dated. The cloud offers this aspect of elasticity where you can start very small and you build the infrastructure only for what you need at that moment, and then you can grow instantaneously as you are getting more data. Cloud—and specifically, what I am seeing as a trend is Amazon Web Services—is the hottest growth area in my business, and even outside of my business, is probably the most enabling set of technologies for big data. If you want to spin up some infrastructure, you can do that in minutes, rather than taking months to procure servers and put them in a data center and configure and install them.

What other technologies are becoming important?

There are two others, which are related to the cloud. One is Lambda. It essentially removes the need to even know about the infrastructure. Think of baby steps: We went from buying servers on-premise to moving to the cloud and manually spinning up servers before data is processed. Lambda even removes that step so that, as you process your data, it spins up the servers that it needs and then when that is over, it shuts those servers back down. It has been coined an “infrastructure-less environment,” which isn’t really true. That is kind of a misnomer. It removes the burden of knowing about the infrastructure from corporate IT.

What is the other one?

Spark. Spark is gaining ground because it is incredibly flexible. It is much faster as a data processor than Hadoop or any of the other big data solutions. And, you don’t have to throw away anything that you have already invested in. If you have already invested in a Hadoop solution, Spark fits very comfortably in that ecosystem and it runs in the Hadoop environment. If you didn’t do that, it will still run. You can run Spark purely on AWS and, instead of using HDFS, the Hadoop Distributed File System, you can just use S3. Spark is very portable. With all of the technologies changing, people are hesitant about being locked into a specific vendor or technology. Something portable is very appealing. I think that is one of the reasons Spark is gaining ground.

What else makes Spark important?

Over the decades, but more importantly with the introduction of Hadoop, there has been a great divide between the technologists and the data users. The data preparers always had a set of languages and platforms doing ETL and database management and then the users had different languages and were doing business intelligence primarily. With Spark, it is one of the only platforms out there, where, whether you are an engineer or a data scientist, it doesn’t really matter, they are all equally happy because Spark speaks Python, which is becoming the new language for data, and it also speaks SQL, which will always be the king of data. That is another reason that Spark is important.

If you could give someone embarking on a big data project one piece of advice, what would that be?

Approach with caution but approach nonetheless. Don’t be afraid of the governance aspect. It is plausible at this point and the staffing and required skills for these new technologies are becoming more common. Training is becoming more widely available and the open source community is sharing knowledge.


Interview conducted, condensed, and edited by Joyce Wells.

Image courtesy of Shutterstock.



Newsletters

Subscribe to Big Data Quarterly E-Edition