Trends and Best Practices for the Modern Data Engineer

While popular advancements in data science, machine learning (ML), and AI have been at the forefront of data-centric business, data engineering is the metaphorical match that keeps these exciting technologies aflame.

Supporting and enabling these solutions requires extensive and complex work on behalf of data engineers, ranging from building data warehouses and data lakes to designing data models and automating data pipelines. The role of data engineers continues to grow in importance as technology advances, inevitably necessitating enterprises to support these personas in their often quiet—yet crucial—workflows.

Leading engineering experts joined DBTA’s roundtable webinar, What's Ahead in Data Engineering: Top Trends and Emerging Best Practices, to offer viewers a comprehensive discussion regarding the ways in which data engineering can be best positioned for present and future success.

Ty Alevizos, solution engineer at Satori, highlighted a myriad of trends and case studies that drive and define the data engineering experience.

The first trend that Alevizos dove into was the evolution of DataSecOps, using the history of web applications as an analogy. He explained that, in the beginning, the “webmaster” oversaw development as a single person with no tools. As time went on, team dynamics and tools grew; though as tools advanced, teams shrank due to efficiencies offered by new technology.

As it relates to DevSecOps, the evolution of web application development is analogous to that of DevSecOps; it began as a sole person with no tools and grew to incorporate a variety of tools that dictated a small team mentality.

“After a period of time…you now have data security operations, or DataSecOps, as a formal reemergence of what used to be this very vague, ‘data wrangler’ role,” said Alevizos. “Tooling has been a strong part of that. Across the industry, data security tools have emerged not as a ‘nice to have’ afterthought, but a core foundational requirement to presenting information to the audience.”

“The data engineering part of the process has a really strong security aspect to it,” Alevizos continued.

With Satori, tools have finally caught up to the operator, according to Alevizos. Satori is a data security platform that provides immediate and secure access to data for everyone in your company, which, in this context, is particularly crucial to data engineers.

Alevizos then walked webinar attendees through two more trends—multi-cloud data meshes and the commodification of data security—as well as how Satori is positioned to help data engineers accommodate them.

Shiyi Gu, senior product marketing manager at Snowflake, pointed to the breaking down of silos as a top trend for data engineers. These silos, which have been “the bane of our existence since the early days of computing,” are even more critical when examining the disconnect between streaming and batch systems.

“Streaming, as a term…is often misunderstood,” said Gu. “It’s often pigeonholed to only instantaneous data applications where latency is sub-second. The problem with this approach is that it alienates a long list of use cases that are not instantaneous, for one reason or another, but are still better thought of as a stream.”

An example of this is having an updated view of retail inventory every ten minutes or predictive manufacturing quality information that needs to be delivered to quality engineers every minute; while these may not be “instantaneous,” they still fall under the streaming data umbrella.

When these use cases are considered anything other than streaming data, it forces data practitioners to pick a technology to address this data as opposed to focusing on their business needs, ultimately incurring an unnecessarily hefty price tag.

This compounds the fact that streaming data gets expensive very quickly, preventing organizations from economically scaling up their operations, according to Gu. Thinking of streaming as a continuum, with an architecture to match, allows you “to balance latency and cost in a way that yields untapped return on your investments,” said Gu.

She further argued that by expanding your vocabulary around streaming, you can deliver greater business value. Simplifying enterprise architectures by merging streaming and batch pipelines in one system is a key component of optimizing data engineering while reducing silos, according to Gu.

David Talaga, product marketing director at Dataiku, argued that the main challenges facing data engineers today include a lack of proactive remediation for data issues, the wild proliferation of data, and increasingly siloed teams, which is further complicated by technological barriers.

“Data engineers should no longer be tasked with playing solo parts,” said Talago. “Instead, they are expected to evolve into a conductor's role.”

Talaga explained that there is a single solution to these challenges: data trust. To build data trust for the whole organization, data must be easily discoverable, made reliable together, and controllable and shareable at speed.

Diving into the first best practice, Dataiku offers processes ranging from data discovery to model monitoring in a complete cycle. Additionally, Dataiku enables in-depth analysis of the discovered data, unveiling hidden patterns with Exploratory Data Analysis. Finally, users can generate automatic features from the discovered data to help avoid any future leakages.

To uncover more trends, remediations, and best practices for the modern data engineer, you can view an archived version of the webinar here.