Storage Architecture for the Petabyte Era

IBM reports that nearly 15 petabytes of data are created every day; eight times greater than the amount of information stored in all of the libraries in the U.S. The explosion of big data and cloud services is driving the development of the new storage architectures required to store unprecedented quantities of information. It is becoming increasingly clear that even a linear growth trajectory for storage is insufficient to deliver the quantity of storage needed for data produced by the Internet of Things. Current architectures have bottlenecks that, while merely inconvenient for legacy data today, are simply untenable for the scale of storage needed tomorrow.

To keep up, enterprises are deploying web-scale architectures that enable virtualization, compute and storage functionality on a vast scale.

Enabling the Data to Flow

A significant part of the allure of web-scale design is that it removes all bottlenecks from storage architecture. A bottleneck that functions as a single point of entry can become a single point of failure, especially with the demands of cloud computing on big data storage. Adding redundant, expensive, high-performance components to alleviate the bottleneck, as most service providers presently do, adds cost and complexity to a system very quickly. On the other hand, a horizontally scalable web-scale system designed to distribute data among all nodes makes it possible to choose cheaper, lower-energy hardware.

A PDF of the Fall 2015 issue of Big Data Quarterly magazine is available for download now. To become a subscriber to the print magazine, go to

Cloud providers, which must manage far more users and greater performance demands than do enterprises, are keen to solve performance problems like data bottlenecks.  While the average user of an enterprise system demands high performance, these systems typically have fewer users, and those users can access their files directly through the local network. Furthermore, enterprise system users are typically accessing, sending and saving relatively low-volume files like document files and spreadsheets, using less storage capacity and alleviating performance load. 

Outside the enterprise, though, cloud users face a different scenario. The system is being accessed simultaneously over the Internet by an order of magnitude more users, which itself becomes a performance bottleneck.  The cloud provider’s storage system not only has to scale to each additional user, but must also maintain performance across the aggregate of all users.  Significantly, the average cloud user is accessing and storing far larger files – music, photo and video files – than does the average enterprise user.  Web-scale architectures are designed to prevent the bottlenecks that this volume of usage causes in traditional legacy storage setups.

Ending Hardware Reliance

For organizations to get the performance they want while being able to scale as needed, web-scale architecture must be built on software exclusively, with no reliance on hardware. Since hardware inevitably fails (at a number of points within the machine), traditional appliances – storage hardware that has proprietary software built in – typically include multiple copies of expensive components to anticipate and prevent failure. These extra layers of identical hardware extract higher costs in energy usage, and add layers of complication to a single appliance.  Because the actual cost per appliance is quite high compared with commodity servers, cost estimates often skyrocket when companies begin examining how to scale out their data centers. One way to avoid this is by using software-defined vNAS or vSAN in a hypervisor environment, both of which offer a way to build out servers at a web-scale rate.

Going Global

Because there are now ways to improve performance at the software level that neutralize the performance advantage of a centralized data storage approach, distributed storage presents the best way to build at web-scale levels.

Because customers need to access cloud services from anywhere in the world, service providers must be able to offer data centers located across the globe to minimize load time. With global availability, however, comes a number of challenges. Load is active in the data center in a company’s region. This creates a problem, since all data stored in all locations must be in sync. From an architecture point of view, it’s important to solve these problems at the storage layer instead of up at the application layer, where it becomes more difficult and complicated to solve.

In addition, international data centers need to plan for contingencies, such as local power outages that would put a local server farm offline. If a local data center or server goes down, global data centers must reroute data quickly to available servers to minimize downtime. While there are certainly solutions today that solve these problems, they do so at the application layer. Attempting to solve these issues that high up in the hierarchy of data center infrastructure – instead of solving them at the storage level – presents significant cost and complexity disadvantages. Solving these issues directly at the storage level through web-scale architectures delivers significant benefits in efficiency, time and cost savings.

Flexibility to Move with the Market

The trend in storage is for cost-effective, more flexible options. Having an expansive, rigid network environment locked into configurations determined by an outside vendor severely curtails the ability of the organization to react nimbly to market demands, much less anticipate them in a proactive manner. Web-scale storage philosophies enable major enterprises to “future proof” their data centers.  Since the hardware and the software are separate investments, either may be switched out to a better, more appropriate option as the market dictates, at minimal cost. 

Storage for the Future

Enterprise data is doubling every 3 years. This level of accelerated storage demand is turning web-scale architecture into a business necessity. Freed from the hardware that causes bottlenecks and unacceptable costs, organizations can capitalize on new approaches including software-defined storage and hyper-converged infrastructures to scale as needed, integrating virtualization components as well. This flexibility increases performance and provides the storage approach that will enable enterprises to meet data-heavy needs both today and tomorrow.

Image courtesy of Shutterstock.


Subscribe to Big Data Quarterly E-Edition