How many copies
One topic that came up a lot when optimizing Scala data applications is the performance of standard collections, or the hidden cost of temporary copies. The collections API is easy to learn and maps well to many Python concepts where a lot of data engineers are familiar with. But the performance penalty can be pretty big when it’s repeated over millions of records in a JVM with limited heap.
Mapping values
Let’s take a look at one most naive example first, mapping the values of a Map
.
val m = Map("A" -> 1, "B" -> 2, "C" -> 3)
m.toList.map(t => (t._1, t._2 + 1)).toMap
Looks simple enough but obviously not optimal. Two temporary List[(String, Int)]
were created, one from toList
and one from map
. map
also creates 3 copies of (String, Int)
.
There are a few commonly seen variations. These don’t create temporary collections but still key-value tuples.
for ((k, v) <- m) yield k -> (v + 1)
m.map { case (k, v) => k -> (v + 1) }
If one reads the ScalaDoc closely, there’s a mapValues
method already and it probably is the shortest and most performant.
m.mapValues(_ + 1)
Java conversion
Similar problem exists …
more ...