dimisit/articles/Two_years_with_clojure.md
Yann Esposito c7d0e6f4c0 typos
2015-11-07 19:48:05 +01:00

11 KiB

TODO: choose a title

TODO: tl;dr: ... (3 sentences max)

TODO: introduction (20 lines max)

Plan

TODO: Remove the detailled plan

  • Start with the end
    • show a pulse
    • explain what is simple / hard
  • The situation before
    • pb with volume (MongoDB / PHP, etc...)
    • securities issues
    • pb with abilities
    • angular complexity
    • refactoring issues
    • deployment issues
  • The choices
    • why clojure?
    • why Haskell?
    • why not full Haskell?
    • why reagent?
    • why Kafka?
    • why Mesos / Marathon?
    • why Druid?
    • why still MongoDB?
  • The firsts weeks
    • first impressions
    • what was harder?
    • what was easier?
  • Once used to clojure
    • how does it feels?
    • was it a mistake?
    • Do we have any doubts?
  • One year later (maintenance and impressions)

The Elephant Graveyard

Imagine you can get all twitter data in realtime.

Imagine you need to count tweets. Imagine you need to filter them by keywords. Imagine you have to answer complex questions about all this data in realtime. For instance, how many tweets from women, containing the word clojure expressing a positive sentiment were submitted last hour. Now, Imagine the same question, but you have to deal with a year of data.

How would you do it?

First you'll need to absorb tweets in realtime. The twitter streaming API is here for that. However you are limited to 1% of all twitter volume. If you want not to be limited, you either need to deal directly with twitter or use GNIP.

Next, you'll need to keep only tweet of interest and discard the rest. For instance, you'll have to filter them by keyword.

Right after that, you have to add extra data for each received tweet. For instance, Twitter doesn't provide gender information so you have to guess it somehow. The same applies for the sentiment expressed in the Tweet.

In the end you'll need to display all these informations in real-time. By real-time we mean with very low latency.

Under the minute is generally acceptable. Our processing pipeline usually process Tweets in only less than half a second of latency.

However total latency is due to twitter streaming API (about 2s) or gnip endpoints (about 15s).

The Elephant

When we got the job, we inherited the prototypes builded by externs. back then, everything was done in PHP. Yes, in PHP! From retrieving tweets, to aggregates generation and display. Let's not even talk about code quality.

At the beginning nothing was saved over time, the system could only display data since midninght. If you only need to count how many tweets were received and you were the only client looking at the website, this "architecture" could handle about 200 tweets/s max. That was achieved on very large machines and which, knowing how everything was coded, wasn't so bad.

But if you needed to add enrichments to achieve complex drilldowns, for instance how many men or women, how many positive, negative or informative tweets, then the system couldn't handle more than 7 tweets by second.

MongoDB the Destroyer!

MongoDB the destroyer

Behind the scene data was saved in MongoDB. Saved was a big word talking about MongoDB.

More seriously, Mongo was a great DB, really easy to start with. Really fast and nice to use.

Nice until you reach the hard limit. At that time it was, Mongo 2.6. So there was a Database Level Lock.

Yes, I repeat: Database Level Lock. Each time you read or write, nobody could read or write at the same time.

And even using very expensive clusters, they couldn't handle these hard limits.

As a result, when we asked Mongo to read and write a lot at the same time, we started to witness data loss... If you can't write them, let's destroy them. Furthermore the code dealing with tweet insertion in MongoDB was really hard to manipulate. No correct error handling. In the end, data loss...

There is a lot to say about MongoDB and a lot was already written. But the main point is yes. MongoDB couldn't be trusted nor used for intensive data manipulation.

Now, the situation might have changed. But there are better tools for the same job.

When we got the job, many customers were already on board And many new products were planned.

So what we did first is:

  • create an incremental map reduce system for mongoDB with node.js.
  • use HHVM to somewhat boost PHP performance
  • create a real data API to be called with Angular JS
  • lots of code optimizations

In the end our system was able to deal with far more volume than before. It could display all the informations we talked about before with a volume of about 120 tweets/min. Which was about x17 progress. But as in the optimisation of old nitendo. We reached the maximal of power we could from this old legacy system.

let's not even speak about our frontend. Code was a mess, to deal with it, we had to convert from a single file with 10 000 lines of JS code to an Angular JS application. Anyway, we quickly started to experience complexity limits with our Angular JS architecture.

It was clear that each new component we created, added more than a linear complexity to all the project. It was more about quadratic complexity. We started to experience weird bug very hard to reproduce. The lack of expliciteness was also a real problem.

9 months of gestation

It was clear from here that nothing could work correctly and everything should be rewritten.

The startup had, at that time, only two people to rewrite everything from scratch. The chosen language was Clojure for multiple reasons.

Something what made this evolution possible was the time taken to explore new technologies. About half a day per week was focused toward experimentation of new technologies.

For instance, we created a Haskell tweet absorber. It was clear that it could handle thousands of tweets per seconds. In fact, at that time the system was able to absorb about 6000 tweets/s. That means barely the full firhose on a single machine.

We thought that is was certainly a good idea to use it instead of another java client.

Next we experimented clojure projects intended to create a restful API. And the experience was great. It was really fast to develop and create new things.

Also lot of success stories with clojure we noticed throughout the internet made us confident we could use it in production.

At that time, Haskell wasn't suitable for production. The cabal hell was really problematic. We had to use cabal freeze often. There were other problems. It was hard to install, compile and deploy.

NOW, Thank to stack this is no more the case1.

Further more, dealing with realtime processing at that time was all about java ecosystem. There was Storm, Kafka, Zookeeper, etc...

So using a language which could use all the java libraries seemed very important to us. With that in mind we simply had to choose between Scala and Clojure. Looking at scala, it was clear that it will be to Java what C++ is to C. While Clojure being a descendant from the LISP family, we found that everything was simple, easier to read and understand. Clojure community sounded great.

Furthermore Storm was first written in Clojure. So go go go!

It was during summer that most technical choices were made.

  1. We wanted to deploy using Mesos / Marathon,
  2. Use Kafka and Storm,
  3. Use Druid and ElasticSearch for tweets DB,
  4. Still use MongoDB for resources (users, projects, keywords by projects, etc...).
  5. Use compojure-api for the API server
  6. Go on with Haskell to absorb tweets
  7. Use reagent instead of Angular for the frontend

Each choice was balanced. In the end some of those choices were changed throught practice.

For instance, we discarded Storm. The power of core.async was more than enough efficiently exploit all the juice of our machines, Storm added some unnecessary latency and complexity.

Today you can see a result here:

Long live the new flesh

Long Live the new Flesh

Difficulties with the new mindset. As everything new, there is a period of adaptation. Typically the most difficult part was to deal with reversed arrays.

In javascript one would write

foo["a"]="value-for-a"
foo["b"]="value-for-b"
foo["c"]="value-for-c"

foreach (i in foo) {v[foo[i]]=i;}

Or doing things like:

var foo = [[1,2,3],[4,5,6]];
tmp=0;
foreach (i in foo) {
  foreach (j in foo[i])  {
   tmp += foo[i][j] + 2;
  }
}
return tmp;

Now that I am used to reduce and filters this is like a second nature. And the new solution is far better.

For example the preceeding example write:

(def foo [[1 2 3] [4 5 6]])
(defn plus2 [x y] (+ x 2))
(defn sum [l] (reduce + 0 l))
(sum (map (fn [l] (reduce plus2 0 l)) foo))

;; or

(->> foo
     (map #(reduce plus2 0 %))
     (sum))

The resulting code is much better: modular, easier to read and to modify.

  • Java null pointer exception!
  • Unreadable stacktrace

What were the immediate wins!

Deep access

For the brave an true there is the lenses Haskell library. But for clojurist, the basic access function should be good enough.

Let's compare Javascript with Clojure:

foo={"db": [{"name":"John Doe","age":30},{"name":"Rich","age":40},{"age":20}]
    // other stuff , ....
    }

var val = function() {
            x = foo[db];
            if (x) {
              let y = x[1];
              if (y) {
                return y.age;
              } else return nil;
            } else return nil;
          }();

Yes, you have to manually check at each level if the value is null or not. Without this manual check, your code is going to crash at runtime!

Now lets compare the situation with clojure:

(-> foo :db second :age)

Yes, that's all. The default value in case of problem is nil.

Merges

Seriously!!!!!

(into map1 map2)

I don't even want to compare to javascript as it would be ridiculous. Mainly, you can't2, or you need jQuery and its ugly.

Syntax

Learning Clojure syntax takes about 3 minutes, thanks to homoiconicity It is clean, no fucking comma, semicolons, etc...

  • Arrays: [a b c] in javascript [a,b,c] (why the commas?)
  • Hash Map (Associative arrays): {:key1 value1 :key2 value2} in javascript you need to define an Object and keys are generally strings: {"key1":value1, "key2":value2}. Multiline object declaration always have bad number of commas.
  • Set: #{:a :b :c} in javascript sets doesn't even exists you have to simulate them with Objects: {"a":true, "b":true, "c":true}
  • inline function declaration; compare #(* % 2) in clojure with function(x){return x * 2;} in javascript

TODO: faudrait m'aider à en rajouter quelques tonnes. Avec de meilleurs exemples.