Scaling the Web (2.0)

We often see that while software consortiums and commercial giants plug away at expensive, architecture-heavy solutions that promise to deliver the future of Web applications, the open source community beats them to it with more pragmatic, cost-effective solutions. The success and prevalence of Web 2.0 mashup applications compared to the relative paucity of standards-based Web services applications is just one example of this.

Web 2.0 applications rely heavily on open source software (OSS). MySQL is more common than Oracle, Apache more common than Microsoft Internet Information Server and Linux more common than Windows for Web 2.0 applications such as Facebook and Digg. It’s not hard to see why: If OSS can deliver adequate performance and manageability, then the reduction in software licensing fees is almost irresistible, especially if you are a cash-starved startup with dreams of scaling to infinity and beyond.

However, scale-up is the critical issue for these applications. Most Web 2.0 startups will never get beyond the load that can be satisfied by a single server, but the ones that succeed will need to scale to potentially millions of users within a few months. So these applications have to be able to both start off small and grow without limit.

Commercial vendors such as Microsoft and Oracle have invested heavily in scaling and offer sophisticated solutions, many of which require additional software licensing fees. For instance, Oracle’s RAC database and clustered application server can in theory offer near-linear scaling as new servers are added. However, the implementation complexity and licensing overheads remain prohibitive for most start-ups.

Consequently, OSS has developed software solutions and deployment patterns that allow scaling and performance optimization without requiring these commercial solutions. Two of the most significant technologies and techniques are memcached and sharding.

Memcached is an OSS utility that provides a distributed object cache. Modern applications typically manipulate application data as objects which are derived from the underlying database tables. Memcached allows these objects to be cached across many servers, reducing the load on the database. Objects are mapped to specific memcached servers using a hashing function, so that the key for the object effectively determines the memcached server that should have that object. If the server has already cached the object, then a database read is avoided. Database reads are typically an order of magnitude slower than a network reads, so this improves performance as well as reducing the load on the database.

Memcached can reduce the load on the database and–at smaller load levels–might allow a single database server to satisfy application demand. However, as the application scales up, eventually a single database will become the limiting factor. When this happens, we need a way of distributing the load across multiple database servers.

The most common OSS solution to database scaling is sharding. In a sharded application, the largest datasets are partitioned across multiple nodes. This partitioning is again based on some key value, such as a user ID. When seeking a particular record, the application must calculate the shard that will contain the data and send the query to the appropriate server. Sharding is usually coupled with replication, so that each shard has a single master but potentially multiple read-only copies. Sharding works well with memcached, since a data abstraction layer can handle both the shard access logic and the memcached management.

The sharding/memcached solution is used by many of the most high-profile Web 2.0 applications such as Flickr, YouTube, Facebook and Digg. It requires some significant setup, but of course there’s no easy way to scale up to the levels achieved by these sites. Sharding and memcached aren’t suitable for all applications, but they are established as a best-practice architecture for OSS-based Web 2.0 applications.