Big Data Notes

“Big data” represents a paradigm shift in the technologies and techniques for storing, analyzing and leveraging information assets. In this column, we track the progress of technologies such as Hadoop, NoSQL and data science and see how they are revolutionizing database management, business practice, and our everyday lives.

Apache Drill Brings SQL to NoSQL

It's been amusing to watch the NoSQL movement transition from a "We don't need no stinking SQL" attitude to a "Can I please have some SQL with that?" philosophy. The nonrelational databases that emerged over the past 8 years initially offered no SQL capabilities. However, today we have an embarrassment of SQL options for "NoSQL." Hive offers SQL for Hadoop systems, Spark has SparkSQL, MongoDB has a SQL-based BI connector, and so on.

Posted December 01, 2016

DataStax Extends Cassandra Capabilities

For many years now, Cassandra has been renowned for its ability to handle massive scaling and global availability. Based on Amazon's Dynamo, Cassandra implements a masterless architecture which allows database transactions to continue even when the database is subjected to massive network or data center disruption. Even in the circumstance in which two geographically separate data centers are completely isolated through a network outage, a Cassandra database may continue to operate in both geographies, reconciling conflicting transactions—albeit possibly imperfectly—when the outage is resolved.

Posted October 07, 2016

MongoDB Continues to Advance

With 20 million downloads to date, MongoDB is arguably today's fastest-growing database technology. MongoDB's rapid growth has been driven primarily by its attractiveness to developers. By using JavaScript Object Notation (JSON) documents as the native database format, MongoDB reduces the impedance mismatch between program code and database, allowing more agile and rapid application development.

Posted August 04, 2016

Dealing With Big Data’s Trough of Disillusionment

For those who haven't encountered the term, the "trough of disillusionment" is a standard phase within the Gartner hype cycle. New technologies are expected to pass from a "peak of inflated expectations" through the trough of disillusionment before eventually reaching the "plateau of productivity." Most new technologies are expected to go through this trough, so it's hardly surprising to find big data entering this phase.

Posted June 09, 2016

Kafka Is the New Standard in Big Data Messaging

It's become almost a standard career path in Silicon Valley: A talented engineer creates a valuable open source software commodity inside of a larger organization, then leaves that company to create a new startup to commercialize the open source product. Indeed, this is virtually the plot line for the hilarious HBO comedy series, Silicon Valley. Jay Krepes, a well-known engineer at LinkedIn and creator of the NoSQL database system, Voldemort, has such a story.

Posted March 31, 2016

What Oracle’s NoSQL SQL Database Reveals

Say what you will about Oracle, it certainly can't be accused of failing to move with the times. Typically, Oracle comes late to a technology party but arrives dressed to kill.

Posted February 10, 2016

Apache Kudu Bridges NoSQL Analytics and OLTP

It's commonly asserted—and generally accepted—that the era of the "one-size-fits-all" database is over. We expect that enterprises will use a combination of database technologies to meet the distinct needs created by various application architectures.

Posted December 02, 2015

Redis Distinguishes Itself Among In-Memory NoSQL Databases

There are quite a few databases competing to be "king" of NoSQL. MongoDB claims to have the fastest-growing NoSQL database ecosystem, MarkLogic claims to be the only Enterprise NoSQL database, while other databases claim to be the fastest or most scalable system.

Posted October 07, 2015

NoSQL Means Yes SQL! With Couchbase’s N1QL

Shortly after the explosion of non-relational databases, around 2009, it became apparent that rather than being part of the problem, SQL would instead continue to be part of the solution. If the new wave of database systems excluded the vast population of SQL-literate professionals, then their uptake in the business world would be impeded. Furthermore, a whole generation of business intelligence tools use SQL as the common way of translating user information requests into database queries. Nowhere was the drive toward SQL adoption more clear than in the case of Hadoop.

Posted August 10, 2015

The Lessons of Database History

There's no doubt that the new wave of nonrelational systems represents an important and necessary revolution in database technology. But while we need to avoid being wedded to the technologies of the past and continuously innovate, ignoring the lessons of history is never a good idea.

Posted June 09, 2015

The Increasing Complexity of Enterprise Big Data

While the new data stores and other software components are generally open source and incur little or no licensing costs, the architecture of the new stacks grows ever more complex, and this complexity is creating a barrier to adoption for more modestly sized organizations.

Posted April 06, 2015

Decoding the Mixed Messages of Hadoop

Someone new to big data and Hadoop might be forgiven for feeling a bit confused after reading some of the recent press coverage on Hadoop. On one hand, Hadoop has achieved very bullish coverage in mainstream media. However, counter to this positive coverage, there have been a number of claims that Hadoop is overhyped. What's a person to make of all these mixed messages?

Posted February 11, 2015

Will Transactions Return to NoSQL?

The introduction of increased transactional capability into non-relational databases makes sense—in the same way that providing SQL layers on top of Hadoop and many other non-relational stores makes sense. But it does raise the possibility of convergence of relational and non-relational systems. After all, if I take a non-relational database and add SQL and ACID transactions, have I still got a non-relational database, or have I come full circle back to the relational model?

Posted December 03, 2014

Real-Time Big Data With the Lambda Architecture

One feature of the big data revolution is the acknowledgement that a single database management system architecture cannot meet all needs. However, the Lambda Architecture provides a useful pattern for combining multiple big data technologies to achieve multiple enterprise objectives. First proposed by Nathan Marz, it attempts to provide a combination of technologies that together can provide the characteristics of a web-scale system that can satisfy requirements for availability, maintainability, and fault-tolerance.

Posted October 08, 2014

Data Marketplaces Have Yet to Deliver on Early Promise

The pioneers of big data, such as Google, Amazon, and eBay, generated a "data exhaust" from their core operations that was more than sufficient to allow them to create data-driven process automation. But, for smaller enterprises, data might be the scarcest commodity. Hence, the emergence of data marketplaces.

Posted August 05, 2014

The Fundamentals of Big Data Analytics

Big data analytics is a complex field, but if you understand the basic concepts—such as the difference between supervised and unsupervised learning—you are sure to be ahead of the person who wants to talk data science at your next cocktail party!

Posted June 11, 2014

Is the Berkeley Data Analytics Stack the Future of Hadoop?

About 3 years ago, the AMP (Algorithms, Machines, People) lab was established at U.C. Berkeley to attack the emerging challenges of advanced analytics and machine learning on big data. The resulting Berkeley Data Analytics Stack—particularly the Spark processing engine—has shown rapid uptake and tremendous promise.

Posted April 04, 2014

Aerospike Merges Flash and NoSQL Technologies

Solid State Disk (SSD)—particularly flash SSD—promised to revolutionize database performance by providing a storage media that was orders of magnitude faster than magnetic disk, offering the first significant improvement in disk I/O latency for decades. Aerospike is a NoSQL database that attempts to provide a database architecture that can fully exploit the I/O characteristics of flash SSD.

Posted February 10, 2014