Testing strategy for Apache Spark jobs – Part 1 of 2
Like any other application, Apache Spark jobs deserve good testing practices and coverage.
Indeed, the costs of running jobs with production data makes unit testing a must-do to have a fast feedback loop and discover the errors earlier.
But because of its distributed nature and the RDD abstraction on top of the data, Spark requires special care for testing.
In this post, we’ll explore how to design your code for testing, how to setup a simple unit-test for your job logic and how the spark-testing-base library can help.