The entire purpose of Spark is to efficiently distribute and parallelize work, and because of this, it can be easy to miss places where applying additional parallelism on top of Spark can increase the efficiency of your application.
Spark operations can be broken into Actions, and Transformations. Spark Transformations — like a
map or a
filter — are lazy, and simply help spark build and execution plan for when you eventually execute an Action.
Actions are blocking operations that actual perform distributed computation. These include things like
repartition , as well as any sort of saving/serialization operation.
Author: Zachary Ennenga
At Airbnb, our offline data processing ecosystem contains many mission-critical, time-sensitive jobs — it is essential for us to maximize the stability and efficiency of our data pipeline infrastructure.
So, when a few months back, we encountered a recurring issue that caused significant outages of our data warehouse, it quickly became imperative that we understand and solve the root cause. We traced the outage back to a single job, and how it, unintentionally and unexpectedly, wrote millions of files to HDFS.
Thus, we began to investigate the various strategies that can be used to manage our Spark…
Author: Zachary Ennenga
There is often a natural evolution in the tooling, organization, and technical underpinning of data pipelines. Most data teams and data pipelines are born from a monolithic collection of queries. As the pipeline grows in its complexity, it becomes sensible to leverage the Java or Python Spark libraries, and implement your map reduce logic in code, rather than in raw queries. The monolith is broken down, and you trade complexity in orchestration for simplicity in logic. Your one monolithic job becomes a dozen beautiful, tightly scoped steps structured into some sort of dependency graph.
However, orchestration complexity…
Once upon a time, in my youth, I was enamored with a game called Scorched Earth.
The premise of the game was simple, given two variables — angle and power — you would launch an increasingly destructive arsenal of missiles and weapons at your opponents in an attempt to destroy them, and, if you play anything like me, yourself, as collateral damage.
The game, in it’s final v1.5 release, had two sorts of terrain: randomly generated single-color hills and valleys, and what were called “scanned mountains”.