Data Summit 2018

May 22 - 23, 2018 // Boston, MA

The Elephant is coming to Boston ...

Since its beginning as a project aimed at building a better web search engine for Yahoo – inspired by Google’s well-known MapReduce paper – Hadoop has grown to occupy the center of the big data marketplace. From data offloading to preprocessing, Hadoop is not only enabling the analysis of new data sources amongst a growing legion of enterprise users; it is changing the economics of data. Alongside this momentum is a budding ecosystem of Hadoop-related solutions, from open source projects like Spark, Hive and Drill, to commercial products offered on-premises and in the cloud. These new technologies are solving real-world big data challenges today.

Tuesday, May 22

Keynotes

Welcome & Keynote - Once We Know Everything

Tuesday, May 22: 9:00 a.m. - 9:45 a.m.

We, of course, will never know everything. But with the arrival of Big Data, machine learning, data interoperability, and all-to-all connections, our machines are changing the long-settled basics of what we know, how we know, and what we do with what we know. Our old—ancient—strategy was to find ways to narrow knowledge down to what our 3-pound brains could manage. Now it’s cheaper to include it all than to try to filter it on the way in. But in connecting all those tiny datapoints, we are finding that the world is far more complex, delicately balanced, and unruly than we’d imagined. This is leading us to switch our fundamental strategies from preparing to unanticipating, from explaining to optimizing, from looking for causality to increasing interoperability. The risks are legion, as we have all been told over and over. But the change is epochal, and the opportunities are transformative.

Speaker:

David Weinberger, Harvard metaLAB and Harvard Berkman Klein Center

Track H: Hadoop Day

Moderator:

Joe McKendrick, Principal Researcher, Unisphere Research

H101. The Big Data Ecosystem Today

Tuesday, May 22: 10:45 a.m. - 11:45 a.m.

The expanding array of data, data types, and data management systems is making the enterprise data landscape more complicated. It is all about finding the right balance for data access and management.

SQL's Sequel: Hadoop & the Post-Relational Revolution

10:45 a.m. - 11:45 a.m.

We are now in the Big Data era, thanks to an explosion in the volume, velocity, and variety of data. We are also now in the post-relational era, thanks to a proliferation of options for handling Big Data more naturally and efficiently than relational database management systems (RDBMS). That’s not to say that we’re done with RDBMS; rather, that Big Data is better handled by technologies such as Hadoop, HBase, Cassandra, and MongoDB, which provide scale-out, massively parallel processing (MPP) architectures. This presentation discusses the rise of Hadoop and other MPP technologies and where they fit into an enterprise architecture in the Big Data era.

Speaker:

David Teplow, Founder & CEO, Integra Technology Consulting

SQL on Big Data—Technology, Architecture, & Innovations

10:45 a.m. - 11:45 a.m.

This comprehensive overview of SQL engines on Big Data focuses on low latency. SQL has been with us for more than 40 years and Big Data technologies for about 10 years. Both are here to stay. Pal covers how SQL engines are architected for processing structured, unstructured, and streaming data and the concepts behind them. He also covers the rapidly evolving landscape and innovations happening in the space—with products such as OLAP on Big Data, probabilistic SQL engines such as BlinkDB, HTAP-based solutions such as NuoDB, exciting solutions using GPUs with 40,000 cores to build massively parallel SQL engines for large-scale datasets with low latency, and the TPC-Benchmark 2.0 for evaluating the performance of SQL engines on Big Data.

H102. Data Lake Best Practices

Tuesday, May 22: 12:00 p.m. - 12:45 p.m.

The concept of a data lake that encompasses data of all types is highly appealing. Before diving in, it is important to consider the key attributes of a successful data lake and the products and processes that make it possible.

The Data Lake Toolkit

12:00 p.m. - 12:45 p.m.

Making the case for collaboration and diverse analytical workloads are the two key goals when designing a data lake. Modern data lakes contain an incredible variety of datasets, varying in size, formats, quality, and update frequency. The only way to manage this complexity is to enable collaboration, which not only promotes reuse, but also enables the network effect that helps solve some of the vexing problems of quality and reusability. Given the scale and complexity of data, moving it outside of the lake is not only impractical but also expensive, so the data lake needs to support diverse needs and the resulting diverse workloads.

Speaker:

Mukund Deshpande, VP, Data Analytics, Accelerite

Smart Data Lakes, Knowledge Graphs, and the Semantic Layer: An Enterprise Information Fabric Road Map

12:00 p.m. - 12:45 p.m.

Only with a rich interactive semantic layer, based on knowledge graph technology and situated at the heart of the data lake, can organisations hope to delivery true on-demand access to all of the data, answers, and insights -- woven together as an enterprise information fabric.

Speaker:

Sean Martin, CTO, Cambridge Semantics

H103. The Hybrid Future of Big Data

Tuesday, May 22: 2:00 p.m. - 2:45 p.m.

Cutting-edge Big Data technologies are easily accessible in the cloud today. However, overcoming integration challenges and operationalizing, securing, governing, and enabling self-service usage in the cloud can still be vexing concerns, just as they are on-premise.

Accelerating Analytic Database Performance

2:00 p.m. - 2:45 p.m.

Database characteristics that impact query performance for BI and analytic use cases include the use of columnar structures, parallelization of operations, memory optimizations, and scaling to high numbers of concurrent users. Maguire also covers the requirements for handling updates for real-time analytics.

Speaker:

Walt Maguire, VP Systems Engineering, Actian

Three Questions You Aren't Asking That Will Make Your Data Strategy Hum

2:00 p.m. - 2:45 p.m.

What three important questions should business leaders consider asking the next time they need to make a technology decision for a data monetization project? Get your guidance from Joe de Buzna.

Speaker:

Joseph deBuzna, VP Field Engineering, HVR

H104. Unleashing Your Big Data With Spark

Tuesday, May 22: 3:15 p.m. - 4:00 p.m.

Big Data requires processing on a massive scale. Newer open source technologies such as Spark can help to enable Big Data processing for use cases that were previously unimaginable.

Building a Recommender System With Machine-Learning & Spark

3:15 p.m. - 4:00 p.m.

Outbrain is the world’s largest discovery platform, bringing personalized and relevant content to audiences while helping publishers understand their audiences through data. Outbrain uses a multiple-stage machine learning workflow over Spark to deliver personalized content recommendations to hundreds of millions of monthly users. This talk covers its journey toward solutions that would not compromise on scale or on model complexity and design of a dynamic framework that shortens the cycle between research and production. It also covers the different stages of the framework, including important takeaway lessons for data scientists as well as software engineers.

Speaker:

Shaked Bar, Tech Lead & Algorithm Engineer, Outbrain

H105. The Challenges & Tactics for Improving Analytic Performance

Tuesday, May 22: 4:15 p.m. - 5:00 p.m.

Walt Maguire introduces analytic case studies including one from Craig Strong, chief technology and product Officer at Hubble, who describes how Hubble is able to provide real-time corporate performance management (CPM) through high-speed analytics dashboards. Hubble’s dashboards draw from hybrid data sources to provide dynamic dashboards that allow for ad hoc query and analysis of near real-time corporate performance. The results from the performance tests that Hubble ran, which compared the performance seen from Actian Vector to a selection of databases including SQLServer, Mem SQL, SAP, Presto, Spark and RedShift, are presented along with the results of recent scaled, cloud-based databases and factors to consider including configurations, query complexity, database size, and concurrency, for such performance tests.