One of the most common ways to store results from a Spark job is by writing the results to a Hive table stored on HDFS. While in theory, managing the output file count from your jobs should be simple, in reality, it can be one of the more complex parts of your pipeline. — Author: Zachary Ennenga Background At Airbnb, our offline data processing ecosystem contains many mission-critical, time-sensitive jobs — it is essential for us to maximize the stability and efficiency of our data pipeline infrastructure.