Dremio has announced its launch in the data analytics market with the availability of the Dremio Self-Service Data Platform.
Despite industry promises of software unlocking the value of data, the company contends, analysts and data scientists continue to struggle to harness data for business intelligence and data science.
The company’s mission is to cut out the need for traditional ETL, data warehouses, cubes, and aggregation tables, as well as the infrastructure in order to enable users to be independent and self-directed in their use of data, thereby accelerating time to insight.
Dremio aims to allow users to be independent and self-directed in their use of data, while accessing data from a variety of sources at scale. In this way, it “liberates them to use data themselves, instead of being dependent on IT,” said Kelly Stirman, Dremio vice president of strategy and CMO.
The idea is that if analysts and data scientists can gain access to data they need through a single platform, duplicate copies of data proliferating through organizations can also be avoided, and data governance can be improved.
New Challenges
According to Tomer Shiran, co-founder and CEO of Dremio, there has been a dearth of innovation in the ETL and data warehousing space over the last 20 years, and that is impacting companies’ ability to leverage their data.
There are two conspicuous challenges today: the size and complexity of data under management and the increasingly high expectations of users accustomed to fast response from consumer devices and apps, said Shiran. When put together, the combination creates an almost unwinnable situation for companies.
Self-Service
Dremio set out to address these issues by providing a self-service approach that is more consumer-like, and incorporates execution and caching technologies that accelerate analytical processing to achieve the response that users have become accustomed to. “We hired UI engineers from Apple and Twitter to focus on building an extremely high-quality UI that business analysts, and Excel and line-of-business users would love to use,” said Shiran.
The UI allows users to discover, curate, accelerate, and share data for specific needs, without being dependent on IT, and they can also launch their favorite tools from Dremio directly, including Tableau, Qlik, Power BI, and Jupyter Notebooks.
Performance
A key differentiator of the new solution, noted Shiran, is that Dremio is the first Apache Arrow-based distributed query execution engine. This, he said, represents a breakthrough in performance for analytical workloads as it enables hardware efficiency and minimizes serialization and deserialization of in-memory data buffers between Dremio and client technologies like Python, R, Spark, and other analytical tools. Arrow is also designed for GPU and FPGA hardware acceleration.
The company has also pioneered a new technology called Reflections that isolates operational systems from analytical workloads by physically optimizing data for specific query patterns, so queries can be satisfied faster. Dremio’s query planner selects the best Reflections to provide maximum efficiency, accelerating processing by up to a factor of 1000.
In addition, Dremio optimizes processing into underlying data sources, maximizing efficiency and minimizing demands on operational systems. It rewrites SQL in the native query language of each data source, such as Elasticsearch, MongoDB, and HBase, and optimizes processing for file systems such as Amazon S3 and HDFS.
“We know how to translate query plans to the MongoDB language or the Elasticsearch, or Oracle’s SQL dialect,” said Shiran. “The way to get the best performance is to fundamentally understand what each system supports from an execution standpoint, being able to push down queries using the language of that data source.”
Governance
While greater data access for more users has been the holy grail of big data analytics, the parallel concern is that opening up more access to enterprise data may result in weaker governance and more risk. Addressing these concerns, Shiran said that Dremio's Data Graph preserves a complete view of the end-to-end flow of data for analytical processing. This allows companies to have visibility into how data is accessed, transformed, joined, and shared across sources and analytical environments to support data governance, security, knowledge management, and remediation activities.
“One of the unique advantages that Dremio has—because it is one system and is sitting in that middle layer—is that we see everything that is happening from data curation to all the queries coming from Tableau, R, and PowerBI,” said Shiran. “We know exactly who is doing what with the data and the relationships between the datasets.”
Deployment
Dremio can be run in the cloud, on premises, or as a service provisioned and managed in a Hadoop cluster.
Dremio has been in beta testing over the past few months. About 50% of customers run exclusively in the public cloud and the other half run in their own data center. In the cloud, he noted, Dremio can take advantage of elastic compute resources as well as object storage such as Amazon S3 for its Reflection Store. In addition, Dremio can analyze data from a wide variety of cloud-native and cloud-deployed data sources.
The software is being distributed as an open source and free community edition, in addition to an enterprise edition.
Dremio has been released as an open source project under the Apache license and is available for download.