Mahangu Weerasinghe

Sri Lankan, Automattician, WordPress user since 0.70

Books for Getting Started with Scala and Spark

This page outlines some of the books I found useful as I skilled up to begin writing Spark Transformations. Though I had previously taken the Udacity Data Engineering Nanodegree and was therefore somewhat familiar with PySpark, the jump to Scala + Spark felt quite big at first, likely also because I do not have a formal background in CS or data science and had never seriously used a compiled language before.

Along with Datacamp’s Introduction to Scala course, the following books have helped me learn enough basic Scala to use Spark. With the encouragement of my colleagues Igor and Leo, I am documenting them here for future reference:

Learning Spark, 2nd Edition

Learning Spark: Lightning-Fast Data Analytics: Damji, Jules S., Wenig,  Brooke, Das, Tathagata, Lee, Denny: 9781492050049: Books
  • The whole of Chapter 1. Introduction to Apache Spark: A Unified Analytics Engine was really useful to read. While I had already gone through some of it on the Udacity course, this entire chapter was a really good refresher on the origins and history of Spark and its underlying philosophy.
  • It also helped me better understand some of the differences between Spark 1.x and 2.x and thereby reconcile some of the different Stack Overflow answers I had seen (and tried!) while googling.
  • Since I had a little experience with PySpark, this book was also great bridge to Scala + Spark for me, particularly because it provided side-by-side examples in both Python and Scala, such as here when it goes into how to use UDFs in each:
// In Scala
// Create cubed function
val cubed = (s: Long) => {
  s * s * s

// Register UDF
spark.udf.register("cubed", cubed)

// Create temporary view
spark.range(1, 9).createOrReplaceTempView("udf_test")
# In Python
from pyspark.sql.types import LongType

# Create cubed function
def cubed(s):
  return s * s * s

# Register UDF
spark.udf.register("cubed", cubed, LongType())

# Generate temporary view
spark.range(1, 9).createOrReplaceTempView("udf_test")
  • This made it very easy for me to translate the PySpark knowledge I had into Scala/Spark.
  • Further, despite being a very practical book, it also has really solid (but brief) descriptions of the inner workings of Spark and these really helped fill in gaps that were left over from the Udacity DEND for me. Here is an example from a section titled Transformations, Actions, and Lazy Evaluation:

Spark operations on distributed data can be classified into two types: transformations and actions. Transformations, as the name suggests, transform a Spark DataFrame into a new DataFrame without altering the original data, giving it the property of immutability.


All transformations are evaluated lazily. That is, their results are not computed immediately, but they are recorded or remembered as a lineage. A recorded lineage allows Spark, at a later time in its execution plan, to rearrange certain transformations, coalesce them, or optimize transformations into stages for more efficient execution. Lazy evaluation is Spark’s strategy for delaying execution until an action is invoked or data is “touched” (read from or written to disk).

Figure 2-6. Lazy transformations and eager actions
Table 2-1. Transformations and actions as Spark operations
  • Sections like this really developed my basic knowledge of how Spark works under the hood and I was really appreciative of them.

Scala Cookbook, 2nd Edition

Scala Cookbook, 2nd Edition | SerbianForum
  • Like most O’Reilly cookbooks, this is separated into recipes AKA examples of how to do X or Y, with code snippets provided.
  • This was a great companion to Datacamp’s Introduction to Scala course as it allowed me to jump back and forth between recipes/chapters based on where I was in the course at the time.
  • Also, given that I was primarily learning Scala for use with Spark and not just looking for a general introduction to the language, this book was perfect for me in that it did not require me to read it chronologically to make sense of it.
  • In particular, recipes like 8.4. Using Traits as Simple Mixins helped me understand traits and using them as mixins for classes.
trait Tail {
	def wagTail { println("tail is wagging") } 
	def stopTail { println("tail is stopped") }

abstract class Pet (var name: String) {
	def speak // abstract
	def ownerIsHome { println("excited") }
	def jumpForJoy { println("jumping for joy") }

class Dog (name: String) extends Pet (name) with Tail { 
	def speak { println("woof") }
	override def ownerIsHome {

Functional Programming, Simplified 🌟

Introduction · Functional Programming, Simplified
  • This is likely the best programming book I have read so far. As someone who had only ever really used PHP and Python, I felt a lot like the author as I tried to grok FP in Scala:

As I tried to learn about FP in Scala, I found that there weren’t any FP books or blogs that I liked — certainly nothing that catered to my “I’ve never heard of FP until recently” background. Everything I read was either (a) dry and theoretical, or (b) quickly jumped into topics I couldn’t understand. It seemed like people enjoyed writing words “monad” and “functor” and then watching me break out in a cold sweat.

  • I had tried several other resources before getting this book and this is the first I found that really, truly builds up to teaching FP from absolutely zero.
  • In it, the author starts by detailing Imperative vs Functional programming and then goes into immutability, pure functions, statements vs expressions, OOP classes vs FP data structures and so on.
  • I am still working through this book as there is a lot in there, but it did quickly give me the know-how to attempt a refactor of one of our team’s data quality check functions:
 def assertUnique(df: DataFrame, column: String): Unit = {
    assert( == df.count(), s"Column `$column` contains duplicate values.")

— so it would be pure, by implementing Functional Error Handling :

  def assertUnique(df: DataFrame, column: String): Try[Boolean] = Try { == df.count()
  • As it turns out, this kind of change is not necessary for our checks at the moment as we a) don’t need any extra logic here to wrap the failure and b) we just need to break the transformation execution flow so we are alerted to the breakage and can look into it.
  • Still, thanks to this book, I am at least ready to attempt to be more functional in my approach to solving problems in Scala going forward, if the need arises.

Other Books I Looked At

  • Learning Scala is an extremely detailed guide to Scala and the only reason I passed on it for now was that it was a little too detailed for the quick onboarding I was looking for. I will very likely come back to it though, and hope to buy a hard copy of it at some point when I come over to North America / Europe for a GM or meetup!
  • Similarly, Spark: The Definitive Guide seems to be a one-stop shop for everything Spark-related. As a beginner to both Scala and Spark, I did not find it as accessible as Learning Spark, 2nd Edition, but I will likely revisit in the future as well.

Other Suggested Books

When I made this post internally at Automattic, colleagues also suggested Scala for the Impatient and High Performance Spark, both of which I hope to check out in the future.

About Me

I’m Mahangu Weerasinghe, a Data Engineer at Automattic, the company behind, Jetpack, WooCommerce and Tumblr. Our team is responsible for maintaining our primary Hadoop cluster and providing support to datums across the company.

Recent Posts