Why Functional?

Why Scala?

Jul 2014

Monoid!

Actually it's a semigroup, monoid just sounds more interesting :)

A Little Teaser

PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
                                  CombineFn<K,V> reduceFn)

Crunch: CombineFns are used to represent the associative operations...

KeyedList[K, T]::reduce(fn: (T, T) => T)

Scalding: reduce with fn which must be associative and commutative

PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)

Spark: Merge the values for each key using an associative reduce function

All of them work on both mapper and reducer side

My story

Before

Mostly Python/C++ (and PHP...)
No Java experience at all
Started using Scala early 2013

Now

Discovery's^* Java backend/riemann guy
The Scalding/Spark/Storm guy
Contributor to Spark, chill, cascading.avro

* Spotify's machine learning and recommendation team

Why this talk?

Not a tutorial
Discovery's experience
Why FP matters
Why Scala matters
Common misconceptions

What we already use

Kafka
Scalding
Spark / MLLib
Stratosphere
Storm / Riemann (Clojure)

What we want to investigate

Summingbird (Scala for Storm + Hadoop)
Spark Streaming
Shark / SparkSQL
GraphX (Spark)
BIDMach (GPU ML with GPU)

Discovery

Mid 2013: 100+ Python jobs
10+ hires since (half since new year)
Few with Java experience, none with Scala
As of May 2014: ~100 Scalding jobs & 90 tests
More uncommited ad-hoc jobs
12+ commiters, 4+ using Spark

Discovery

rec-sys-scalding.git

Discovery

Guess how many jobs
written by yours truely?

3

Why Functional

Immutable data
Copy and transform
Not mutate in place
HDFS with M/R jobs
Storm tuples, Riemann streams

Why Functional

Higher order functions
Expressions, not statements
Focus on problem solving
Not solving programming problems

Why Functional

Word count in Python

lyrics = ["We all live in Amerika", "Amerika ist wunderbar"]
wc = defaultdict(int)
for l in lyrics:
  for w in l.split():
    wc[w] += 1

Screen too small for the Java version

Why Functional

Map and reduce are key concepts in FP

val lyrics = List("We all live in Amerika", "Amerika ist wunderbar")
lyrics.flatMap(_.split(" "))               // map
      .groupBy(identity)                   // shuffle
      .map { case (k, g) => (k, g.size) }  // reduce

(def lyrics ["We all live in Amerika" "Amerika ist wunderbar"])
(->> lyrics (mapcat #(clojure.string/split % #"\s"))
            (group-by identity)
            (map (fn [[k g]] [k (count g)])))

import Control.Arrow
import Data.List
let lyrics = ["We all live in Amerika", "Amerika ist wunderbar"]
map words >>> concat
          >>> sort >>> group
          >>> map (\x -> (head x, length x)) $ lyrics

Why Functional

Linear equation in ALS matrix factorization
\(x_u=(Y^TY + Y^T(C^u-I)Y)^{-1} Y^TC^up(u)\)

vectors.map { case (id, vec) => (id, vec * vec.T) }  // YtY
       .map(_._2).reduce(_ + _)

ratings.keyBy(fixedKey).join(outerProducts)  // YtCuIY
       .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }
       .reduceByKey(_ + _)

ratings.keyBy(fixedKey).join(vectors)  // YtCupu
       .map { case (_, (r, vec)) =>
         val Cui = r.rating * alpha + 1
         val pui = if (Cui > 0.0) 1.0 else 0.0
         (solveKey(r), vec * (Cui * pui))
       }.reduceByKey(_ + _)

Why Scala

JVM - libraries and tools
Pythonesque syntax
Static typing with inference
Transition from imperative to FP

Why Scala

Performance vs. agility

http://nicholassterling.wordpress.com/2012/11/16/scala-performance/

Why Scala

Type inference

class ComplexDecorationService {
  public List<ListenableFuture<Map<String, Metadata>>>
  lookupMetadata(List<String> keys) { /* ... */ }
}

val data = service.lookupMetadata(keys)

type DF = List[ListenableFuture[Map[String, Track]]]
def process(data: DF) = { /* ... */ }

Why Scala

Higher order functions

List<Integer> list = Lists.newArrayList(1, 2, 3);
Lists.transform(list, new Function<Integer, Integer>() {
  @Override
  public Integer apply(Integer input) {
    return input + 1;
  }
});

val list = List(1, 2, 3)
list.map(_ + 1)  // List(2, 3, 4)

And then imagine if you have to chain or nested functions

Why Scala

Collections API

val l = List(1, 2, 3, 4, 5)
l.map(_ + 1)                      // List(2, 3, 4, 5, 6)
l.filter(_ > 3)                   // 4 5

l.zip(List("a", "b", "c")).toMap  // Map(1 -> a, 2 -> b, 3 -> c)
l.partition(_ % 2 == 0)           // (List(2, 4),List(1, 3, 5))
List(l, l.map(_ * 2)).flatten     // List(1, 2, 3, 4, 5, 2, 4, 6, 8, 10)

l.reduce(_ + _)                   // 15
l.fold(100)(_ + _)                // 115

"We all live in Amerika".split(" ").groupBy(_.size)
// Map(2 -> Array(We, in), 4 -> Array(live),
//     7 -> Array(Amerika), 3 -> Array(all))

Why Scala

Scalding field based word count

TextLine(path))
  .flatMap('line -> 'word) { line: String => line.split("""\W+""") }
  .groupBy('word) { _.size }

Scalding type-safe word count

TextLine(path).read.toTypedPipe[String](Fields.ALL)
  .flatMap(_.split(""\W+""))
  .groupBy(identity).size

Scrunch word count

read(from.textFile(file))
  .flatMap(_.split("""\W+""")
  .count

Why Scala

Summingbird word count

source
  .flatMap { line: String => line.split("""\W+""").map((_, 1)) }
  .sumByKey(store)

Spark word count

sc.textFile(path)
  .flatMap(_.split("""\W+"""))
  .map(word => (word, 1))
  .reduceByKey(_ + _)

Stratosphere word count

TextFile(textInput)
  .flatMap(_.split("""\W+"""))
  .map(word => (word, 1))
  .groupBy(_._1)
  .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }

Why Scala

Many patterns also common in Java

Java 8 lambdas and streams
Guava, Crunch, etc.
Optional, Predicate
Collection transformations
ListenableFuture and transform
parallelDo, DoFn, MapFn, CombineFn

Common misconceptions

It's complex

True for language features
Not from user's perspective
We only use 20% features
Not more than needed in Java

Common misconceptions

It's slow

No slower than Python
Depend on how pure FP
Trade off with productivity
Drop down to Java or native libraries

Common misconceptions

I don't want to learn a new language

How about flatMap, reduce, fold, etc.?
Unnecessary overhead
interfacing with Python or Java
You've used monoids, monads,
or higher order functions already

Why Functional?

Why Scala?

Monoid!

My story

Why this talk?

What we already use

What we want to investigate

Discovery

Discovery

Discovery

Guess how many jobs
written by yours truely?

3

Why Functional

Why Functional

Why Functional

Why Functional

Why Functional

Why Scala

Why Scala

Why Scala

Why Scala

Why Scala

Why Scala

Why Scala

Why Scala

Common misconceptions

Common misconceptions

Common misconceptions

The End

Thank You

Why Functional?

Why Scala?

Monoid!

My story

Why this talk?

What we already use

What we want to investigate

Discovery

Discovery

Discovery

Guess how many jobswritten by yours truely?

3

Why Functional

Why Functional

Why Functional

Why Functional

Why Functional

Why Scala

Why Scala

Why Scala

Why Scala

Why Scala

Why Scala

Why Scala

Why Scala

Common misconceptions

Common misconceptions

Common misconceptions

The End

Thank You

Guess how many jobs
written by yours truely?