Stackdiver as a Service

Scala Workshop

While there are many Scala tutorials and books available, very few of them focus on big data. I did a couple of workshops at Spotify focusing on these areas and here are the slides.

more ...

Using CQL with legacy column families

We use Cassandra extensively at work, and up till recently we’ve been using mostly Cassandra 1.2 with Astyanax and Thrift protocol in Java applications. Very recently we started adopting Cassandra 2.0 with CQL, DataStax Java Driver and binary protocol.

While one should move to CQL schema to take full advantage of the new protocol and storage engine, it’s still possible to use CQL and the new driver on existing clusters. Say we have a legacy column family with UTF8Type for row/column keys and BytesType for values, it would look like this in cassandra-cli:

create column family data
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'UTF8Type';

And this in cqlsh after setting start_native_transport: true in cassandra.yaml:

CREATE TABLE data (
  key text,
  column1 text,
  value blob,
  PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE;

In this table, key and column1 corresponds to row and column keys in the legacy column family and value corresponds to column value.

Queries to look up a column value, an entire row, and selected columns in a row would look like this:

SELECT value FROM mykeyspace.data WHERE key = 'rowkey' AND column1 = 'colkey';
SELECT column1, value FROM mykeyspace …

more ...

dotfiles update

I’ve been using my current dotfiles setup for a while and felt it’s time to freshen up. I focused on updating the look and feel of Vim and tmux this round.

First I switched to molokai color theme for Vim, TextMate (monokai) and IntelliJ IDEA (using this). I guess I grew tired of the old trusted solarized, plus with my new MacBook Pro 13” at highest resolution, it just doesn’t feel sharp enough.

The vim-powerline plugin I was using is being deprecated and replaced by powerline, which supports vim, tmux, zsh, and many others. However it requires Python and I had trouble using it with some really old Vim versions at work. So instead I switched to a pure VimL plugin, vim-airline. Not surprisingly there’s a companion plugin, tmuxline for tmux as well. Both have no extra dependencies which is a big plus for me since I use the same dotfiles on Mac, my Ubuntu Trusty destop at work, and many Debian Squeeze servers.

I also updated a couple of other Vim plugins along the process, replacing vim-snipmate with ultisnips, vim-bad-whitespace with vim-better-whitespace (no pun intended), and adding vim-gutter. The biggest discovery is vim-easymotion though, perfect …

more ...

On being a polyglot

I’m kind of known as a polyglot among coworkers. We would often argue that instead of hiring great Java/Python/C++ developers, we should rather strive to hire great engineers with strong CS fundamentals who can pick up any language easily. I came from scientific computing background, doing mostly C/C++/Python many years ago. Over the course of the last three years at my current job I coded seven languages professionally, some out of interest and some necessity. I enjoyed the experience learning all these different things and want to share my experience here, what I learned from each one of them and how it helps me becoming a better engineer.

C

The first language I used seriously, apart from LOGO & BASIC when I was a kid of course. It’s probably the closest thing one can get to the operating system and bare metal without dropping down to assembly (while you still can in C). It’s a simple language whose syntax served as the basis of many successors like C++ & Java. It doesn’t offer any fancy features like OOP or namespaces, but rather depends on the developer’s skill for organizing large code base (think …

more ...

How many copies

One topic that came up a lot when optimizing Scala data applications is the performance of standard collections, or the hidden cost of temporary copies. The collections API is easy to learn and maps well to many Python concepts where a lot of data engineers are familiar with. But the performance penalty can be pretty big when it’s repeated over millions of records in a JVM with limited heap.

Mapping values

Let’s take a look at one most naive example first, mapping the values of a Map.

val m = Map("A" -> 1, "B" -> 2, "C" -> 3)
m.toList.map(t => (t._1, t._2 + 1)).toMap

Looks simple enough but obviously not optimal. Two temporary List[(String, Int)] were created, one from toList and one from map. map also creates 3 copies of (String, Int).

There are a few commonly seen variations. These don’t create temporary collections but still key-value tuples.

for ((k, v) <- m) yield k -> (v + 1)
m.map { case (k, v) => k -> (v + 1) }

If one reads the ScalaDoc closely, there’s a mapValues method already and it probably is the shortest and most performant.

m.mapValues(_ + 1)

Java conversion

Why Functional? Why Scala?

I recently did an internal talk at Spotify on why every data engineer should know something about functional programming languages and Scala. And here are the slides.

more ...

Light Table

I recently picked up Light Table for Clojure development and liked it. Form evaluation works out of the box and indentation is better than that in La Clojure plugin for IntelliJ IDEA.

I particularly like the idea of command bar, which allows you to search for Light Table commands by name and execute them quickly. I was already used to IDEA’s key map though (Mac OS X 10.5+ which is more natural to Mac users than the default Mac OS X), and wanted something similar. The setting files are in Clojure so it’s easy to customize. This is what I got so far for user.keymap:

{:+ {:app {"alt-space" [:show-commandbar-transient]}

     :editor {"alt-w" [:editor.watch.watch-selection]
              "alt-shift-w" [:editor.watch.unwatch]
              "ctrl-alt-i" [:smart-indent-selection]
              "ctrl-alt-c" [:toggle-console]
              "ctrl-shift-j" [:editor.sublime.joinLines]
              "pmeta-d" [:editor.sublime.duplicateLine]
              "pmeta-shift-up" [:editor.sublime.swapLineUp]
              "pmeta-shift-down" [:editor.sublime.swapLineDown]
              "pmeta-/" [:toggle-comment-selection :editor.line-down]}}}

Apart from these, I found myself using "pmeta-enter" [:eval-editor-form] and "ctrl-d" [:editor.doc.toggle] most when writing Clojure code. After all they are probably the most essential ones no matter what editor you use :)

more ...

dotfiles

My dotfiles is probably the most copied code among my coworkers and today I will give a little break down of the code base.

zsh

I switched to zsh 3 years ago and never looked back. There’s also oh-my-zsh, a framework for managing ZSH configuration. The features I found most useful are:

Tab completion, including hostnames and arguments
History across multiple sessions
Plugins and themes

My .zshrc is mostly out of the box with some aliases and a few plugins thrown in but decided to create my own theme. I use colors for hostname in the prompt, green for local and red for remote. I also tweaked git status a bit to show untracked files (red dots), unstaged (yellow) & staged (green) changes, plus number stashed changes since that’s the one thing I keep doing and forgetting about.

git

I use git both for work and personal projects, plus contributing to open source projects on GitHub. My .gitconfig includes both a global gitignore file and a templatedir, which includes hooks for ctags and Gerrit. The hooks are installed automatically for every repo.

Since I type hundreds of git commands on a daily basis, I aliased git to simply g …

more ...

First (real) post with Pelican

Finally decided to jump (back) on the blogging bandwagon. This time I decided to use a static site generator, since that seems the cool thing to do these days, and found this site. I want something in a language I know well, so Ruby or JavaScript is out. It should also be actively maintained, so Scala is out since monkeyman, the only entry there, seems abandoned. I eventually settled on Pelican, the top ranked Python framework.

I set up a new virtualenv with virtualenvwrapper and also discovered autoenv along the way. It was easy to get started with the pelican-quickstart script and in a few minutes I have a working site already. Next I went shopping for themes in pelican-themes and picked pelican-bootstrap3. Turns out it doesn’t work with Spotify icon yet so I forked the repo and made a quick PR.

After some further tweaking with the settings I was pretty happy with the results. I went on to set up Disqus and Google Analytics for the site, and published it to my Linode with make ssh_upload.

more ...