Second Order Parallelism in Spark-based Data Pipelines

Zachary Ennenga
4 min readMay 24, 2021

The entire purpose of Spark is to efficiently distribute and parallelize work, and because of this, it can be easy to miss places where applying additional parallelism on top of Spark can increase the efficiency of your application.

Spark operations can be broken into Actions, and Transformations. Spark Transformations — like a map or a filter — are lazy, and simply help spark build and execution plan for when you eventually execute an Action.

Actions are blocking operations that actual perform distributed computation. These include things like count or repartition , as well as any sort of saving/serialization operation.

In some cases, your job may only contain one action, or all your actions might be dependent on one another, however, often, jobs contain multiple, independent actions.

Let’s look at some common patterns:

First, perhaps you need to split your data into multiple subsets, and save them to unique tables

val data = load().map(...).cache()val subset1 = data.filter(...)
val subset2 = data.filter(...) # <- Blocks # <- Also Blocks

Or perhaps you’re not splitting a single dataset, but your job just results in multiple unique datasets:

val (data1, data2, data3) = processData(load()) # <- Blocks # <- Blocks Again # <- Still Blocks

Maybe you’re calculating some metrics, as well as saving the underlying data:

val data = load().map(...).cache()val count = data.count() # <- # <- Blocks # <- Blocks

The point is, with a traditional application structure, Spark will only process one of these actions at a time, even though there may be no, or minimal, direct dependencies between them.

Why does this matter?

Quantifying Efficiency

When considering performance characteristics of a data pipeline, there are two quantities that come to mind; first, runtime, and second, efficiency.

While runtime is usually pretty straightforward to measure and understand, efficiency…