I recently had some fun building parquet-avro-extra, an add-on module for parquet-avro using Scala macros. I did it mainly to learn Scala macros but also to make it easier to use Parquet with Avro in a data pipeline.
Parquet and Avro
Parquet is a columnar storage system designed for HDFS. It offers some nice improvements over row-major systems including better compression and less I/O with column projection and predicate pushdown. Avro is a data serialization system that enables type-safe access to structured data with complex schema. The parquet-avro
module makes it possible to store data in Parquet format on disk and process them as Avro objects inside a JVM data pipeline like Scalding or Spark.
Projection
Parquet allows reading only a subset of columns via projection. Here’s an Scalding example from Tapad.
Projection[Signal]("field1", "field2.field2a")
Note that fields specifications are strings even though the API has access to Avro type Signal
which has strongly typed getter methods.
This is slightly counter-intuitive since most Scala developers are used to transformations like pipe.map(_.getField)
. It’s however can be easily solved with macro since the syntax tree of is accessible. A modified version has signature of def apply[T](getters: (T => Any)*): Schema
and can be used like this:
Projection[Signal](_.getField1, _.getField2.getField2a)
The macro version looks more natural plus you get auto-complete support from the IDE and avoid typos.
Predicate
Predicate is an even better use case for macros. The Parquet predicate API supports a fixed set of column types and operators, and the user must use the correct factory methods to construct an expression tree. A simple lambda like i: Item => i.price < 100 && i.getReviews >= 10
becomes this:
import parquet.filter2.predicate.FilterApi;
import parquet.filter2.predicate.FilterPredicate;
FilterPredicate p = FilterApi(
FilterApi.lt(FilterApi.floatColumn("price"), 100f),
FilterApi.gteq(FilterApi.intColumn("reviews"), 10));
Obviously this very cumbersome and even worse than the projection case. But with macros it feels almost like writing a regular Scala predicate lambda:
Predicate[Item](i => i.getPrice < 100 && i.getReviews >= 10)
Lessons learned
I have some experience with C++ templates and Clojure macros, but Scala macros is still pretty challenging to get started. A couple of notes:
- Since one can write pretty complex code inside a macro, I had to remind myself that they don’t compute data at runtime, but merely transform the syntax tree.
- Pattern matching and deconstruction are handy for extracting tree elements.
- Quasiquotes can be chained and returned in recursion for complex transformations, just be aware of the different types first.
- It’s great exercise for your recursion skills.
Comments
comments powered by Disqus