<< back Page 2 of 2

Hadoop and Its Accompanying Ecosystem Are Here to Stay

By Jim Scott

Jan 19, 2016

JSON data models will come to the forefront within applications that leverage these big data technologies. A good example of this is Yelp, which has been managing its data in JSON format for years. While JSON has clearly been gaining momentum, the new advances in SQL query engines such as Drill enable both sides of the house to leverage the same data source. The applications being written to handle transactional workloads will utilize JSON for simplified data serialization, with as few as two lines of code in a language such as Java. The data science side of the business will be able to query the data in place, with zero latency, meaning that there’s no delay in data preparation. They will have real-time access to the data in its native form, allowing for unrivaled convenience and agility.

Real Time and Near Real Time

Competition breeds innovation, as evidenced by the Apache Spark and Apache Flink technologies. Spark Streaming has undergone sweeping changes in the last year, but Apache Flink is gaining a strong following. The similarities for how they are used are many, but their underlying architecture and implementations couldn’t be more different. There is plenty of room for both of these technologies to flourish, and they will compel each other to be the best versions of themselves they can possibly be.

Apache Spark has gained significant momentum in replacing Hadoop MapReduce within the Hadoop stack, and there is no sign that this momentum will slow down any time soon—especially as users find new and creative ways to exploit it.

Apache Storm is probably the most heavily-deployed open source streaming engine used in production deployments, but all indications point to a decline in the use of Storm in favor of the two aforementioned streaming engines. By most accounts, these two new engines are easier to program with, scale, and manage.

In either case, the move to real time will happen at a much larger scale in 2016. There will be a heavy dependence upon reliable and scalable messaging platforms. Take Apache Kafka as an example, which currently enjoys the benefits brought by the Hadoop distributed file system. Kafka delivers publish/subscribe functionality at a significant scale, but lacks many enterprise features to manage it adequately across data centers, as well as certain functional reliability features to guarantee against data loss. We can look forward to this space heating up as much as the streaming engines that depend on a solid messaging platform. The messaging platform is, after all, the most critical link in the chain to delivering a real-time processing platform.

The Rise of Non-Java Languages in Big Data

Java and languages such as Scala which run on the Java Virtual Machine (JVM) have been the benefactors of most of the big data technologies that were created in Java. This has greatly reduced the time for developers of Java-based languages to adopt big data technologies.

Developers have, until now, left alternative programming languages out in the cold. Those that use alternative programming languages have only been able to work with big data technologies via ReST interfaces. However, they will soon be able to benefit from the same advantages that Java developers have enjoyed for the past 6 years.

More enterprises will be able to move to big data technologies due to more APIs being supported for languages such as Python and JavaScript via servers such as Node.js. This trend will open up the Hadoop ecosystem to a new group of engineers who don’t use Java.

For more articles on the state of big data, download the third edition of The Big Data Sourcebook, your guide to the enterprise and technology issues IT professionals are being asked to cope with in 2016 as business or organizational leadership increasingly defines strategies that leverage the "big data" phenomenon.

Python and Node.js are two of the most popular platforms for developing applications for the enterprise outside of Java. These platforms will see radical growth in terms of their capabilities to access big data technologies. The Web 2.0 crowd will have access to scalable backends that play in the Hadoop ecosystem, and they will actively begin to connect directly to the big data platforms for their storage tier. They’ll also be able to bypass ReST interfaces and get access to the same native performance as experienced in Java.

These advances will open the door to an entirely new group of software developers, which will further solidify these big data technologies as the platform for effortlessly building new, linearly scalable applications. The pace at which these applications are built will greatly overshadow applications built on top of traditional RDBMSs, as the effort to serialize and deserialize complex data structures will become a thing of the past.

Backing Up Big Data and Multi-Data Centers

More and more applications will be integrated into big data platforms this coming year, exposing new capabilities that enable even more diverse tasks. Apache Hadoop users will begin looking more heavily at data backup scenarios, as well as running their applications in multiple data centers. Those users will continue to feel the pain of backing up their big data between data centers and keeping them in sync as data volumes continue to increase.

Users of more advanced distributions which support multiple data centers and have mirroring capabilities will not feel such pain. They will advance rapidly into scenarios where they run their applications in multiple data centers. Their applications will not be concerned with keeping databases in sync between data centers, as this will be handled for them in an automatic real-time fashion by their distributed data platform. They will begin to improve customer satisfaction through speed and responsiveness, because they will be running applications in data centers closer to their customers. A greater mix of on-premise and cloud data centers will be leveraged. Customers will reap the benefits of a simplified deployment architecture. Businesses will benefit by not requiring application developers to handle complex syncing between data centers, because the underlying platform delivers these capabilities out of the box.

Digital advertisers have been reaping the benefits of these technologies to deliver advertisements within a few hundred milliseconds anywhere in the world, but that leading edge will take on an entirely new shape as others begin adopting similar practices.

What’s Ahead

Proofs of concept are already out for 2016. Of course, engineers will continue to learn how to use these technologies, but they will no longer need to prove that these technologies can work for most use cases. It is known and documented that these technologies work, and they scale like no other technology stack ever created. Those Hadoop distributions that do the best job of easing the burden of deploying and managing Hadoop clusters and making it easy for engineers to build their own real-time applications will see tremendous growth.

<< back Page 2 of 2