It’s been another 6 months since my talk about Scio at Scala by the Bay. We’ve seen huge adoption and improvements since then. The number of production Scio pipelines has grown from ~70 to 400+ within Spotify. A lot of other companies are using and contributing to it as well. In the most recent edition of the Spotify data university, an internal week long big data training camp for non-data engineers, we revamped the curriculum to cover Scio, BigQuery and other Google Cloud Big Data products instead of Hadoop, Scalding and Hive.
Spotify data university round 3 & 1st time covering Scio, @ApacheBeam & @GCPBigData 👋 Hadoop, HDFS, M/R, YARN 🍾 batch + streaming pic.twitter.com/1gWIEbN0mW
— Neville Li (@sinisa_lyh) March 28, 2017
And here’s a list of some notable improvements in Scio.
- Master branch is now based on Apache Beam
- Graduate type safe BigQuery API form experimental to stable
- Sparkey side input support
- TensorFlow TFRecord file IO
- Cloud Pub/Sub attributes support
- Named transformations for streaming update
- Safe-guard against malformed tests and better error messages
- Flexible custom IO wiring
- KryoRegistrar for custom Kryo serialization
- Table description for type-safe BigQuery
- Lots of performance improvements and bug fixes
I talked about Scio at Philly ETE last week and here are the slides.
Comments
comments powered by Disqus