Second Order Parallelism in Spark-based Data Pipelines
The entire purpose of Spark is to efficiently distribute and parallelize work, and because of this, it can be easy to miss places where applying additional parallelism on top of Spark can increase the efficiency of your application.
Spark operations can be broken into Actions, and Transformations. Spark Transformations — like a
map or a
filter — are lazy, and simply help spark build and execution plan for when you eventually execute an Action.