Lambda serialization
Lambda serialization is one of the more confusion issues in distributed data processing in Scala. No matter which framework you choose, whether it’s Scalding, Spark, Flink or Scio, sooner or later you’ll be hit by the dreaded NotSerializableException
. In this post we’ll take a closer look at the common causes and solutions to this problem.
Setup
To demonstrate the problem, first we need a minimal setup that minics the behavior of a distributed data processing system. We start with a utility method that roundtrips an object throguh Java serialization. Anonymous functions, or lambdas, in such systems are serialized so that they can be distributed to workers for parallel processing.
import java.io.{ByteArrayInputStream, ByteArrayOutputStream, ObjectInputStream, ObjectOutputStream}
object SerDeUtil {
def serDe[T](obj: T): T = {
val buffer = new ByteArrayOutputStream()
val out = new ObjectOutputStream(buffer)
out.writeObject(obj)
out.close()
val in = new ObjectInputStream(new ByteArrayInputStream(buffer.toByteArray))
in.readObject().asInstanceOf[T]
}
}
Next we create a bare minimal Collection[T]
type that mimics an abstract distributed data set, akin to TypedPipe
, RDD
, or SCollection
in Scalding, Spark or Scio respectively. Our implementation is backed by a local in-memory Seq[T]
but does pass the function f
through serialization like …