Spark and the Fine Art of Caching

If there is a “bottom line” to measuring the effectiveness of your big-data applications, it’s arguably performance, or how quickly those apps can finish the jobs they run. 

Let’s consider Spark.  Spark is designed for in-memory processing in a vast range of data processing scenarios. Data scientists use Spark to build and verify models. Data engineers use Spark to build data pipelines. In both of these scenarios, Spark achieves performance gains by caching the results of operations that repeat over and over again, then discards these caches once it is done with the computation.

We will use Spark as the example in this article because it offers robust support for caching:

  • Spark has made application-level caching a first-class construct and exposed it to application programmers.  That’s not the case with popular languages such as SQL, which do not have first-class constructs for caching.
  • Application programmers can express complex computations in Spark, and can also mix-and-match SQL with functions. Adding caching to the mix enables a powerful combination of expressiveness and performance. 

Caching: One Technique for Speeding up Spark Applications

Just like other big-data applications, Spark applications are sensitive to a number of factors that affect your ability to achieve the best performance.  And those factors can lie all over the stack. Performance is sensitive to application code, configuration settings, data layout and storage, multi-tenancy, resource allocation and elasticity in cloud deployments like Amazon EMR, Microsoft Azure, Google Dataproc, Qubole, etc.

Therefore, understanding the behavior of an application requires a full understanding of all the factors that could affect its performance – and realizing that you may not have rights to change the application code, or the resource allocation for the application. 

One technique, caching, gets a lot of attention and discussion among big-data practitioners, namely because the decision of what data and how much data to cache, given finite memory available, can make a world of difference in the performance you achieve. 

Caching is an optimization technique for iterative and interactive computations. Caching helps in saving interim, partial results so they can be reused in subsequent stages of computation. These interim results are stored as RDDs (Resilient Distributed Datasets) and are kept either in memory (by default) or on disk.  Data cached in memory is faster to access, of course, but cache space is always limited. 

Say, for example, you have a Spark job running on 10 nodes and it finishes in one hour today. But your boss needs it to finish in under 5 minutes. How would you go about that?  If you cache all of your RDDs, you will soon run out of memory.  Complicating matters, not every time you cache a dataset will you improve performance. So, you need to consider where you can get the most bang for the buck before you commit to caching an RDD.

Caching can provide a 10X to 100X improvement in performance.  It might be your best option, especially when the alternative – such as adding five more nodes that could each handle a number of RDDs – is entirely out of the question.  

Caching is an Art

Because caching requires a consideration for number of nodes available, the relative priority of each RDD, and the amount of memory available for caching, it often requires a good deal of trial-and-error.  And the more RDDs you have to consider, all the more complex this method becomes. 

That’s where automation plays a role.  A tool that provides full-stack performance intelligence can tell you if caching is a viable option in any particular case, and, if it would, how you could know what to cache to get maximum performance.  It can tell you the performance you can achieve if you had five additional servers available, and whether adding those five servers to the mix makes sense.  

There are generally three factors to consider in tuning memory usage: the amount of memory used by your objects, the cost of accessing those objects, and the overhead of “garbage collection” (if you have high turnover in terms of objects).  Automating performance management across the stack takes these factors into consideration, relieving you of the burden of manually tuning your system.

Applications for Caching in Spark

Caching is recommended in the following situations:

  • For RDD re-use in iterative machine learning applications
  • For RDD re-use in standalone Spark applications
  • When RDD computation is expensive, caching can help in reducing the cost of recovery in the case one executor fails

Caching in Spark is usually performed for derived (or computed) data as opposed to raw data that exists as-is on disk. For example, many machine-learning programs run in multiple iterations where some computed dataset is reused in each iteration (while other data is refined in each iteration). In such a case, understanding what data is best to cache may require understanding something about the application rather than simply tracking how data on storage is accessed.

As one example, machine learning, deep learning and other AI techniques, and graph algorithms require data to be accessed iteratively, often many times. The greater number of iterations used in an algorithm, the more times we need to access the data. In some traditional systems, data needs to be read for each iteration. When you “persist” the RDD in Spark, a machine learning algorithm runs faster because it avoids an excessive number of re-computations.

It’s All About Speed

What kind of speedups you can achieve with caching? It's easy to get 10 or even 100-fold speedup, depending on your application. Because of course, instead of re-reading everything from HDFS or S3 and doing your processing again, here you're just reading out of memory, allowing you to skip some extra processing steps.  For complicated pipelines and iterative machine-learning algorithms, there are very big speedups.

It's also important to understand that caching is gradual. Even if your dataset doesn't fit completely in memory – even if just half of that dataset fits in memory – you can get half of the expected speedup.  Here’s why:  Caching is done at “block level.”  An RDD is composed of multiple blocks. If certain RDD blocks are found in the cache, they won’t be re-evaluated.  And so you will gain the time and the resources that would otherwise be required to evaluate an RDD block that is found in the cache.  And, in Spark, the cache is fault-tolerant, as all the rest of Spark. 

However, be aware of one limitation of caching:  Caching also reduces the memory available for the execution engine itself. There may be situations when caching will degrade performance of the application compared with the run when caching was disabled.  This would happen, for example, when the executors are using all the memory resources available and garbage collection kicks in. If the application is not reusing the RDD blocks from the cache, caching may hurt performance, and in certain cases, lead to application failures. In other words, the cost of caching would exceed the benefits.

Spark Functions

In Spark, there are two function calls for caching an RDD: cache() and persist(level: StorageLevel).

The difference among them is that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on disk, or off-heap memory according to the caching strategy specified by level.  persist() without an argument is equivalent with cache().  Freeing up space from the Storage memory is performed by unpersist().


Consider the situation that some of the block partitions are so large that they will quickly fill up the storage memory used for caching.   This is where an eviction policy is used.  But implementing such a policy can be tricky.  There are various algorithms for determining which data to evict.  One of those is called Least Recently Used (LRU).  With LRU, the catch is that cached partitions may be evicted before actually being re-used.  One strategy, then, is to cache only RDDs that are expensive to re-evaluate and have a modest size, allowing them to fit entirely into memory.

Overall, a caching strategy that caches blocks both in memory and on disk is preferred. For this case, cached blocks evicted from memory are written to disk. Reading the data from disk is relatively fast compared with re-evaluating the RDD.

Driving Towards Full-Stack Performance Intelligence

Earlier in this article I touched on the trial-and-error nature of caching, which results from the many interdependencies to be considered in determining what and how much data to cache.  In a coming article I will address methods for automating performance intelligence across the full stack, removing that trial-and-error element.