Apache Spark is a cluster computing engine, essentially an alternative computation model to MapReduce for executing jobs across large clusters. Spark’s scheduler stores pending work on a number of arrays, to keep track of which work is available to be executed where in the cluster. In Spark 1.6, a bug was introduced, that meant that any time Spark added a new task to one of these arrays, it would first exhaustively search the array to ensure it wasn’t re-adding a duplicate. Since the arra...