The entire purpose of Spark is to efficiently distribute and parallelize work, and because of this, it can be easy to miss places where applying additional parallelism on top of Spark can increase the efficiency of your application.
Spark operations can be broken into Actions, and Transformations. Spark Transformations — like a
map or a
filter — are lazy, and simply help spark build and execution plan for when you eventually execute an Action.
Actions are blocking operations that actual perform distributed computation. These include things like
repartition , as well as any sort of saving/serialization operation.
In some cases, your job may only contain one action, or all your actions might be dependent on one another, however, often, jobs contain multiple, independent actions.
Let’s look at some common patterns:
First, perhaps you need to split your data into multiple subsets, and save them to unique tables
val data = load().map(...).cache()val subset1 = data.filter(...)
val subset2 = data.filter(...)Table1.save(subset1) # <- Blocks
Table2.save(subset2) # <- Also Blocks
Or perhaps you’re not splitting a single dataset, but your job just results in multiple unique datasets:
val (data1, data2, data3) = processData(load())Table1.save(data1) # <- Blocks
Table2.save(data2) # <- Blocks Again
Table3.save(data3) # <- Still Blocks
Maybe you’re calculating some metrics, as well as saving the underlying data:
val data = load().map(...).cache()val count = data.count() # <- BlocksMetricsTable.save(count) # <- Blocks
Table.save(data) # <- Blocks
The point is, with a traditional application structure, Spark will only process one of these actions at a time, even though there may be no, or minimal, direct dependencies between them.
Why does this matter?
When considering performance characteristics of a data pipeline, there are two quantities that come to mind; first, runtime, and second, efficiency.
While runtime is usually pretty straightforward to measure and understand, efficiency…