NEScala 2015 talk
I gave a lightning talk at Northeast Scala Symposium last week on Macros in Datapipelines and here are the slides.
more ...I gave a lightning talk at Northeast Scala Symposium last week on Macros in Datapipelines and here are the slides.
more ...I recently had some fun building parquet-avro-extra, an add-on module for parquet-avro using Scala macros. I did it mainly to learn Scala macros but also to make it easier to use Parquet with Avro in a data pipeline.
Parquet is a columnar storage system designed for HDFS. It offers some nice improvements over row-major systems including better compression and less I/O with column projection and predicate pushdown. Avro is a data serialization system that enables type-safe access to structured data with complex schema. The parquet-avro
module makes it possible to store data in Parquet format on disk and process them as Avro objects inside a JVM data pipeline like Scalding or Spark.
Parquet allows reading only a subset of columns via projection. Here’s an Scalding example from Tapad.
Projection[Signal]("field1", "field2.field2a")
Note that fields specifications are strings even though the API has access to Avro type Signal
which has strongly typed getter methods.
This is slightly counter-intuitive since most Scala developers are used to transformations like pipe.map(_.getField)
. It’s however can be easily solved with macro since the syntax tree of is accessible. A modified version has signature of …