- Separate Data reading and Data processing, so they can be tested individually without passing the spark session.
- Split processing into smaller transform steps which can be individually tested.
- Prefer column functions to user defined functions (udf) as the former can be optimised by query planner.
- Easier to test a native scala function than an udf.
sbt test
- A single spark session is reused to avoid starting time, but might cause issues if sql tables are being registered in the tests. (See
SparkSessionTestWrapper
) - The parameters for jvm and test runner need to be tweaked. (See
build.sbt
)
We can use the loan pattern to initiate and teardown a spark instance per test.
This might be needed if tests are registering same views.
(See LoanedSparkSession
)