The entire purpose of Spark is to efficiently distribute and parallelize work, and because of this, it can be easy to miss places where applying additional parallelism on top of Spark can increase the efficiency of your application.

Spark operations can be broken into Actions, and Transformations. Spark Transformations — like a map or a filter — are lazy, and simply help spark build and execution plan for when you eventually execute an Action.

Actions are blocking operations that actual perform distributed computation. These include things like count or repartition , as well as any sort of saving/serialization operation.

In…


One of the most common ways to store results from a Spark job is by writing the results to a Hive table stored on HDFS. While in theory, managing the output file count from your jobs should be simple, in reality, it can be one of the more complex parts of your pipeline.

Author: Zachary Ennenga

Airbnb’s new office building, 650 Townsend

Background

At Airbnb, our offline data processing ecosystem contains many mission-critical, time-sensitive jobs — it is essential for us to maximize the stability and efficiency of our data pipeline infrastructure.

So, when a few months back, we encountered a recurring issue that caused significant outages of our data warehouse, it quickly became imperative that we understand and solve the root cause. We traced the outage back to a single job, and how it, unintentionally and unexpectedly, wrote millions of files to HDFS.

Thus, we began to investigate the various strategies that can be used to manage our Spark…


There is often a hidden performance cost tied to the complexity of data pipelines — the overhead. In this post, we will introduce its concept, and examine the techniques we use to avoid it in our data pipelines.

Author: Zachary Ennenga

The view from the third floor at Airbnb HQ!

Background

There is often a natural evolution in the tooling, organization, and technical underpinning of data pipelines. Most data teams and data pipelines are born from a monolithic collection of queries. As the pipeline grows in its complexity, it becomes sensible to leverage the Java or Python Spark libraries, and implement your map reduce logic in code, rather than in raw queries. The monolith is broken down, and you trade complexity in orchestration for simplicity in logic. Your one monolithic job becomes a dozen beautiful, tightly scoped steps structured into some sort of dependency graph.

However, orchestration complexity…


And you thought I was joking.

Humble Beginnings

Once upon a time, in my youth, I was enamored with a game called Scorched Earth.

The premise of the game was simple, given two variables — angle and power — you would launch an increasingly destructive arsenal of missiles and weapons at your opponents in an attempt to destroy them, and, if you play anything like me, yourself, as collateral damage.

The game, in it’s final v1.5 release, had two sorts of terrain: randomly generated single-color hills and valleys, and what were called “scanned mountains”.

Zachary Ennenga

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store