New Data Loading Technology from Greenplum Offers High Speeds For Large-Scale Data Warehousing

Bookmark and Share

Greenplum, a leading provider of database software for the next generation of data warehousing and analytics, has announced new technology designed to accelerate data loading for companies dealing with exponential data growth. Greenplum's new "MPP Scatter/Gather Streaming" (SG Streaming) technology eliminates the bottlenecks associated with other approaches to data loading, enabling high-speed flows of data into the Greenplum Database for large-scale analytics and data warehousing. According to Greenplum, some of its customers are achieving production loading speeds of over four terabytes per hour with negligible impact on concurrent database operations. The SG Streaming technology is available immediately and is a standard feature of the Greenplum Database.

Greenplum utilizes a "parallel-everywhere" approach to loading in which data flows from one or more source systems to every node of the database without any sequential choke points. This differs from traditional "bulk loading" technologies that push data from a single source, often over a single or small number of parallel channels, and can result in bottlenecks and longer load times. Greenplum's approach also avoids the need for a "loader" tier of servers that can add significant complexity and cost while effectively bottlenecking the bandwidth and parallelism of communication into the database.

Greenplum's SG Streaming technology ensures parallelism by "scattering" data from all source systems across hundreds or thousands of parallel streams that simultaneously flow to all nodes of the Greenplum Database. Performance scales with the number of Greenplum Database nodes, and the technology supports both large batch and continuous near-real-time loading patterns with negligible impact on concurrent database operations. Data can be transformed and processed in-flight, utilizing all nodes of the database in parallel, for extremely high-performance ELT (extract-load-transform) and ETLT (extract-transform-load-transform) loading pipelines. Final "gathering" and storage of data to disk takes place on all nodes simultaneously, with data automatically partitioned across nodes and optionally compressed. This technology is exposed to the DBA via a flexible and programmable "external table" interface and a traditional command-line loading interface.

Ben Werther, director of product management for Greenplum, tells 5 Minute Briefing that "a standard configuration would include two master nodes and numerous segment nodes. Users create query plans on the master nodes, and use them to coordinate work among the segment nodes. The segment nodes are where all of the actual processing work occurs and data loading goes straight to these nodes, bypassing the master nodes so that there are no bottlenecks." For more information about the Greenplum database, go here.