Codementor Events

Writing tests for your spark code using FunSuite

Published Jul 30, 2021

One of the frequently asked questions in StackOverflow or any other forum by the Data Engineer’s who create their data pipelines using Apache Spark is how to write the test cases.

In this write-up, I would like to share my knowledge on writing the Apache Spark unit test cases using the FunSuite package provided by scalatest

Here are the Apache Spark and ScalaTest dependencies and versions

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.4.0",
  "org.apache.spark" %% "spark-sql" % "2.4.0",
  "org.scalatest" %% "scalatest" % "3.0.5" % "test",
)

For the sake of keeping the code simple and concept straight, I have created a small code which wraps textFile. The function takes SparkSession and path of the file as a string.

import org.apache.spark.sql.{Dataset, SparkSession}

object Utilities {

  def readFile(spark: SparkSession,
               locationPath: String): Dataset[String] = {
    spark.read
      .textFile(locationPath)
  }

}

Now let's try to test this function with the following test cases

  • Creating Dataframe from a text file.
  • Counts should match with the number of records in a text file.
  • Data should match with sample records in a text file.
  • Reading files of different format using readFile should throw an exception.
  • Reading an invalid file location using readFile should throw an exception.
import org.apache.spark.sql
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.scalatest.{BeforeAndAfterEach, FunSuite}

class UtilitiesTestSpec extends FunSuite with BeforeAndAfterEach {

 private val master = "local"

 private val appName = "ReadFileTest"

 var spark : SparkSession = _

 override def beforeEach(): Unit = {
   spark = new sql.SparkSession.Builder().appName(appName).master(master).getOrCreate()
  }

 test("creating data frame from text file") {
   val sparkSession = spark
   import sparkSession.implicits._
   val peopleDF = ReadAndWrite.readFile(sparkSession,"src/test/resources/people.txt").map(_.split(",")).map(attributes => Person(attributes(0), attributes(1).trim.toInt)).toDF()
   peopleDF.printSchema()
 }

 test("counts should match with number of records in a text file") {
   val sparkSession = spark
   import sparkSession.implicits._
   val peopleDF = ReadAndWrite.readFile(sparkSession,"src/test/resources/people.txt").map(_.split(",")).map(attributes => Person(attributes(0), attributes(1).trim.toInt)).toDF()
   
   peopleDF.printSchema()
   assert(peopleDF.count() == 3)
 }

 test("data should match with sample records in a text file") {
    val sparkSession = spark
    import sparkSession.implicits._
    val peopleDF = ReadAndWrite.readFile(sparkSession,"src/test/resources/people.txt").map(_.split(",")).map(attributes => Person(attributes(0), attributes(1).trim.toInt)).toDF()
    
     peopleDF.printSchema()
     assert(peopleDF.take(1)(0)(0).equals("Michael"))
 }

 test("Reading files of different format using readTextfileToDataSet should throw an exception") {
     
     intercept[org.apache.spark.sql.AnalysisException] {
     val sparkSession = spark
     import org.apache.spark.sql.functions.col
      
     val df = ReadAndWrite.readFile(sparkSession,"src/test/resources/people.parquet")
     df.select(col("name"))

      }
 }

test("Reading an invalid file location using readTextfileToDataSet should throw an exception") {
        
      intercept[Exception] {
      val sparkSession = spark
      import org.apache.spark.sql.functions.col
      val df = ReadAndWrite.readFile(sparkSession,"src/test/resources/invalid.txt")
      
      df.show()

      }
 }

override def afterEach(): Unit = {
     spark.stop()
 }
}

case class Person(name: String, age: Int)

We are going to use local for spark master. The beforeEach() and afterEach() creates SparkSession and close the session before and after running each test case.

The code block intercept[Exception] is used to check if a valid exception is thrown by the code on providing invalid arguments

intercept[Exception]{ }

You can find the entire code at the GitHub repo github

Discover and read more posts from Phani Kumar Yadavilli
get started
post commentsBe the first to share your opinion
Jose3L
2 years ago

Cool! The fact that I am a decent writer gives me the confidence to enter this competition, and I believe that everything will work out for me. I just published an essay https://www.tmcnet.com/topics/articles/2022/02/24/451626-best-essay-writing-services-2021-everything-should-know.htm regarding writing services, and it received a lot of positive feedback. I am certain that you will appreciate this essay as well.

Show more replies