Hadoop Day  
The Elephant is coming to Boston ...


Since its beginning as a project aimed at building a better web search engine for Yahoo – inspired by Google’s well-known MapReduce paper – Hadoop has grown to occupy the center of the big data marketplace. From data offloading to preprocessing, Hadoop is not only enabling the analysis of new data sources amongst a growing legion of enterprise users; it is changing the economics of data. Alongside this momentum is a budding ecosystem of Hadoop-related solutions, from open source projects like Spark, Hive and Drill, to commercial products offered on-premises and in the cloud. These new technologies are solving real-world big data challenges today.

Tuesday, May 22



Welcome & Keynote - Once We Know Everything

Tuesday, May 22: 9:00 a.m. - 9:45 a.m.

We, of course, will never know everything. But with the arrival of Big Data, machine learning, data interoperability, and all-to-all connections, our machines are changing the long-settled basics of what we know, how we know, and what we do with what we know. Our old—ancient—strategy was to find ways to narrow knowledge down to what our 3-pound brains could manage. Now it’s cheaper to include it all than to try to filter it on the way in. But in connecting all those tiny datapoints, we are finding that the world is far more complex, delicately balanced, and unruly than we’d imagined. This is leading us to switch our fundamental strategies from preparing to unanticipating, from explaining to optimizing, from looking for causality to increasing interoperability. The risks are legion, as we have all been told over and over. But the change is epochal, and the opportunities are transformative.


, Senior Researcher, Harvard's Berkman Center for Internet & Society


Sponsored Keynote - Oracle

Tuesday, May 22: 9:45 a.m. - 10:00 a.m.


Tuesday, May 22

Track H: Hadoop Day

Joe McKendrick, Principal Researcher, Unisphere Research, A Division of Information Today, Inc.

H101. The Big Data Ecosystem Today

Tuesday, May 22: 10:45 a.m. - 11:45 a.m.

The expanding array of data, data types, and data management systems is making the enterprise data landscape more complicated. It is all about finding the right balance for data access and management.

SQL's Sequel: Hadoop & the Post-Relational Revolution

We are now in the Big Data era, thanks to an explosion in the volume, velocity, and variety of data. We are also now in the post-relational era, thanks to a proliferation of options for handling Big Data more naturally and efficiently than relational database management systems (RDBMS). That’s not to say that we’re done with RDBMS; rather, that Big Data is better handled by technologies such as Hadoop, HBase, Cassandra, and MongoDB, which provide scale-out, massively parallel processing (MPP) architectures. This presentation discusses the rise of Hadoop and other MPP technologies and where they fit into an enterprise architecture in the Big Data era.


, Founder & CEO, Integra Technology Consulting

SQL on Big Data—Technology, Architecture, & Innovations

This comprehensive overview of SQL engines on Big Data focuses on low latency. SQL has been with us for more than 40 years and Big Data technologies for about 10 years. Both are here to stay. Pal covers how SQL engines are architected for processing structured, unstructured, and streaming data and the concepts behind them. He also covers the rapidly evolving landscape and innovations happening in the space—with products such as OLAP on Big Data, probabilistic SQL engines such as BlinkDB, HTAP-based solutions such as NuoDB, exciting solutions using GPUs with 40,000 cores to build massively parallel SQL engines for large-scale datasets with low latency, and the TPC-Benchmark 2.0 for evaluating the performance of SQL engines on Big Data.


, Big Data and Data Science Architect, Independent Consultant and Author - Book - Apress - "SQL on Big Data - Technology, Architecture and Innovation"


H102. Data Lake Best Practices

Tuesday, May 22: 12:00 p.m. - 12:45 p.m.

The concept of a data lake that encompasses data of all types is highly appealing. Before diving in, it is important to consider the key attributes of a successful data lake and the products and processes that make it possible.

The Data Lake Toolkit

Making the case for collaboration and diverse analytical workloads are the two key goals when designing a data lake. Modern data lakes contain an incredible variety of datasets, varying in size, formats, quality, and update frequency. The only way to manage this complexity is to enable collaboration, which not only promotes reuse, but also enables the network effect that helps solve some of the vexing problems of quality and reusability. Given the scale and complexity of data, moving it outside of the lake is not only impractical but also expensive, so the data lake needs to support diverse needs and the resulting diverse workloads.


, VP, Data Analytics, Accelerite


H103. The Hybrid Future of Big Data

Tuesday, May 22: 2:00 p.m. - 2:45 p.m.

Cutting-edge Big Data technologies are easily accessible in the cloud today. However, overcoming integration challenges and operationalizing, securing, governing, and enabling self-service usage in the cloud can still be vexing concerns, just as they are on-premise.

Architecting & Operationalizing a Hybrid Data Lake

Standardization, automation, and deep integration technologies allow enterprises to transform their business by building and operating successful, self-service data lakes with IT guardrails on-premises and in the cloud, while avoiding the undesirable complexities, inefficiencies, and risks resulting from the messy and diverse nature of Big Data. Gray covers the many benefits of implementing a data lake, both on-premises and in the cloud, and addresses the associated challenges, including data integration, sanitization, security, governance, and more. Cloud technologies, such as Azure WASB, Azure Data Lake Storage (ADLS), GCS, S3, and Redshift, help enterprises overcome their migration challenges.


, CEO & Founder, Cask


H104. Unleashing Your Big Data With Spark

Tuesday, May 22: 3:15 p.m. - 4:00 p.m.

Big Data requires processing on a massive scale. Newer open source technologies such as Spark can help to enable Big Data processing for use cases that were previously unimaginable.

Building a Recommender System With Machine-Learning & Spark

Outbrain is the world’s largest discovery platform, bringing personalized and relevant content to audiences while helping publishers understand their audiences through data. Outbrain uses a multiple-stage machine learning workflow over Spark to deliver personalized content recommendations to hundreds of millions of monthly users. This talk covers its journey toward solutions that would not compromise on scale or on model complexity and design of a dynamic framework that shortens the cycle between research and production. It also covers the different stages of the framework, including important takeaway lessons for data scientists as well as software engineers.


, Tech Lead & Algorithm Engineer, Outbrain


H105. Panel Discussion

Tuesday, May 22: 4:15 p.m. - 5:00 p.m.

Don’t Miss These Special Events