Getting Started with Big Data Projects

Sep 25, 2014

By Joyce Wells

Q&A with Think Big Analytics' Ron Bodkin

Ron Bodkin founded Think Big Analytics to help organizations gain value from big data. Before that, he was vice president of engineering at Quantcast where he led the data science and engineering teams deploying Hadoop and NoSQL for batch and real-time decision making. In this interview, Bodkin who is CEO of Think Big, now a Teradata company, discusses the challenges organizations face in getting big data projects off the ground and what they need to consider when they embark on projects to leverage data from the Internet of Things and social media.

How far along is the industry with regard to big data maturity?

Ron Bodkin: The first adopters, the most mature users of the open source big data platforms are web-scale companies - whether it Quantcast where I was VP of engineering before starting Think Big - or Linkedin, Google, or Facebook. Those are the kinds of companies that have gotten to be quite mature and sophisticated.

Over the last couple of years, we have also started to see the broader adoption and success in more traditional enterprises with a greater range of use cases.

Such as?

RB: We have been able to help get customers that are not startups in the high-tech manufacturing arena, as well as in online and media. These are companies that have been around for a while and are starting to get their first meaningful business applications in the hands of larger numbers of users, which is one big threshold. There are a lot of organizations that are doing starter projects to build a data lake and load data into a big data environment, but they aren’t really yet achieving a business result. That is a big threshold – is the organization really getting new insights, driving new analytics, and doing new things with the data? And then, how broad is that access?

Are midsize companies getting on board?

RB: It tends to be larger companies that have raised big data, but you do see midsize companies that are specialists in digital data, whether it is an advertising business model or online or an e-commerce business model, and they have a lot of data. Those are some of the smaller businesses that have enough at stake that they need to move quickly and embrace these technologies. That tends to be where you see smaller organizations and the global 2000 investing in these technologies.

How well-understood is the data lake approach?

RB: If it is not universally cited as a term, it is certainly well on its way as a concept for storing a variety of data in a common place, where the data is less governed. That to me would be the high-level definition. There is a lot of excitement about it. When you relax constraints - the time and complexity of managing data and having the metadata management of a traditional database environment - the benefits are clear but there is also a cost. A lot of organizations are excited about building data lakes but then they don’t have a plan for how they can take that data and start to refine it and govern it, so they can do useful things once they have proven value.

But it is generally agreed that it involves Hadoop?

RB: Yes, I would say that Hadoop is the only technology that is really being used to build data lakes. Anything else that could have been considered is not generally being embraced, but I don’t think there is anybody who is trying to build a data lake on any other environment – or, at least, there are not many.

Are most big data efforts actually integration projects?

RB: The vast majority of our customers are coming to us with existing assets and capabilities around data storage and analytics. There are cases where we have had greenfield opportunities. For example, we are dealing with a large publishing company that is starting up a new in-house agency. The company wants a strategy and roadmap to build the technology support for that new group the right way. But even there, they will have to integrate with existing assets from other parts of the company. In most cases, people are saying, let’s layer this in and do some net-new analytics to create value on top of what we already have.

Is most existing technology relational?

RB: The majority of companies have existing relational technology; they may well have more traditional analytic grids, things like SAS or SPSS, and they typically will look to have us to guide them on a strategy and roadmap.

Where do organizations typically start?

RB: Often, organizations will start with Hadoop because it is easier to do some offline analytics and processing before they get into using NoSQL for real time, which tends to be for more advanced use cases.

Some of our customers though have already started down a path; they have already done something with Hadoop - such as a technology proof-of-concept or they may have done experiments from a business standpoint in siloed groups. In those cases, the customers have already had some level of effort and investment in the space and they are then looking to kick it up to production level and start to get real business value. And even in a few cases, we have done work with pioneers who started working with Hadoop and had a meaningful business use but wanted to update and extend it to new use cases.

Is there a common characteristic for success?

RB: There are a number of important elements. One is always having the right level of executive sponsorship, and having business and technology sponsorship that is aligned and working together is critical. Big data is not a “technology” problem; you need to have a business group that invests in getting output in analytics for the business.

Which approaches are less successful?

RB: We have seen technology organizations try to tackle big data projects the way they would have tackled projects in the last decade - as if it is a relatively small change in technology with not a lot of difference in business process, to just minimize cost. They want to ship work offshore, or try to handle it like it is a technology that has been in the market for 20 years. We see those attempts flounder. Afterwards, they conclude that the technology is not mature and usable. But the reality is that they just followed the wrong approach to make the technology successful.

What else?

RB: Those two are probably the most significant scenarios that we see around adoption of new technologies in general. The last one is to crawl, walk, run. Start off with some low-hanging fruit use cases to get value. Start with projects that are relatively easy to achieve, and then scale up to more ambitious goals.

How do they do that?

RB: Most organizations start by getting value out of some basic analytics on their big data datasets before they get to advanced data science models and can think about how they can change their business model and add new business lines. That will often mean they are taking customer behavior data or product behavior data and just starting to get insight into what is going on in a way that was never before feasible with traditional technologies. They do that before they start building advanced models on those datasets to predict success or failure of those productsor to try to develop more sophisticated, next-best offers for customers.

What would be next?

RB: And then, the next phase might be efforts like creating a revenue stream by selling benchmark data and recommendations to customers or maybe selling value-added data to third parties about consumers. Those are examples of even more sophisticated use cases.

So, bottom line, to be successful companies should start small?

RB: Start with a strategy that covers what you are trying to achieve over the medium term, with a roadmap of capabilities from low-hanging fruit to more advanced capabilities. We have seen that be successful to guide investment. With that broad plan, then there tends to be an agile approach, and the plan can be adjusted. It is that notion of planning and then course-correcting and learning quickly as you go.