Second Order Parallelism in Spark-based Data Pipelines
The entire purpose of Spark is to efficiently distribute and parallelize work, and because of this, it can be easy to miss places where applying additional parallelism on top of Spark can increase the efficiency of your application.
Spark operations can be broken into Actions, and Transformations. Spark Transformations — like a map
or a filter
— are lazy, and simply help spark build and execution plan for when you eventually execute an Action.
Actions are blocking operations that actual perform distributed computation. These include things like count
or repartition
, as well as any sort of saving/serialization operation.
In some cases, your job may only contain one action, or all your actions might be dependent on one another, however, often, jobs contain multiple, independent actions.
Let’s look at some common patterns:
First, perhaps you need to split your data into multiple subsets, and save them to unique tables
val data = load().map(...).cache()val subset1 = data.filter(...)
val subset2 = data.filter(...)Table1.save(subset1) # <- Blocks
Table2.save(subset2) # <- Also Blocks
Or perhaps you’re not splitting a single dataset, but your job just results in multiple unique datasets:
val (data1, data2, data3) = processData(load())Table1.save(data1) # <- Blocks…