Tips for Designing a Data Warehouse for Agility and Scale at Data Summit Connect Fall 2020

At Data Summit Connect Fall 2020, John O'Brien, principal advisor and CEO, Radiant Advisors, provided guidance about the agile methodology and templates that project delivery teams can follow to build modern data infrastructures (on-prem, hybrid, and multi-cloud).

He was followed by Waqas Dhillon, product manager for machine learning, Vertica, whose presentation focused on best practices for scaling analytics workloads.

Videos of presentations from Data Summit Connect Fall 2020, a free series of data management and analytics webinars presented by DBTA and Big Data Quarterly, are available for viewing on the DBTA YouTube channel.

John O'Brien, Radiant AdvisorsIn his presentation, titled "Delivering Cloud Data Architectures with Agile Processes," O'Brien  shared how Radiant Advisors' approach allows delivery teams to follow, initiate, and leverage data and integration design patterns while working to build an enterprise data and analytics platform. This approach also defines the teams and individual roles, along with expectations for working in a prioritized and governed manner to evolve the data platform in alignment with business priorities.

O'Brien also revealed lessons learned from real-world implementations his consultancy has been involved in. According to O'Brien, a critical issue to consider is the analytics capability you want to deliver, such as for BI and reporting purposes, enterprise self-service and data analytics, or data science and AI. The first two areas are more focused on analyzing historical data while the third is more predictive and prescriptive. However, the foundation of data ingestion, data pipelines, and an enterprise data lake with raw, curated, and processed data, can be the same for all.

Key components of a modern data architecture include:

Data architecture: a data lake, data hubs, data labs, a data marketplace

Data integration: streaming data hub, database replication, data pipelines platform, orchestration

Cloud architecture: SaaS and PaaS, scalable elasticity, hybrid and multi-cloud, serverlesss functions

Data technologies: RDBMS/MPP/columnar; durable storage; in-memory databases, NoSQL databases

Data management: DataOps, data governance, data quality

Data science platforms: Apache Spark, R and Python, ML/DL libraries, analytic asset management

An architecture is always evolving, said O'Brien, and an enterprise architecture is an environment that should continually improve to enable better, faster, more trustworthy business decisions.

Data and analytics project priorities generally have three main considerations, he said:

1-Speed, the shortest distance from one point to another, 2-cost, which is typically fixed in terms of team resources and budget, and 3-quality, in terms of the architecture and maintainability

When building a data architecture, it is important to remember:

Do not over-engineer the solution and do not try to have the architecture solve future problems that don't exist yet.

Do evaluate the project delivered on a regular basis and look for ways to improve and evolve the architecture.

Architecture projects should:

  • Have MVPs—"minimum viable patterns" to establish end to end delivery on reusable architecture patterns.
  • Follow a "nail it then scale it" model because you will learn the majority of what's needed on an end-to-end small scale first.
  • Be lean initially to save cost, then very scalable for ROI.

Waqas Dhillon, VerticaDhillon's presentation was titled, "Scaling Analytics Workloads Using Distributed Analytical Data Warehouse Vertica." According to Dhillon, most companies face challenges in storing and managing their increasing volumes of data, let alone trying to perform analytics to learn patterns and trends in that data. Vertica offers an advanced unified analytical warehouse that enables organizations to keep up with the size and complexity of enormous data volumes.

Vertica supports machine learning at scale to transform the way data scientists and analysts interact with data, while removing barriers and accelerating time to value on predictive analytics projects.

Key considerations for machine learning at scale include the need for speed at reasonable cost, the fact that it is not easy to move big datasets around—making it necessary to bring models to the data, and the reality that subsampling can compromise accuracy—making it important to have an approach that allows you to work with all of your data.

Whether you are starting out on your journey of building an enterprise and analytics platform or you are looking to switch from a legacy system which has speed and scalability limitations, Vertica can offer advantages such as built-in analytics and machine learning functions, linear scaling, native high availability and hybrid deployment options, said Dhillon.

According to Dhillon, Vertica's value in the machine learning space is its enterprise-grade architecture, support for concurrent sessions, the ability to leverage an MPP infrastructure and scale-out architecture, and the ability to manage and deploy machine learning models using simple SQL calls and integrate machine learning functions with other tools using the same SQL interface.