TODO: choose a title

TODO: tl;dr: … (3 sentences max)

TODO: introduction (20 lines max)

Plan

TODO: Remove the detailled plan

The Elephant Graveyard

Imagine you could get all tweets in realtime.

Imagine you need to count them. Imagine you need to filter them by keywords. Imagine you need to answer complex questions about them in realtime. For example, how many tweets from women, containing the word clojure expressing a positive sentiment during the last hour. Imagine the same question about the last year.

How would you do it?

First you’ll need to absorb tweets in realtime. The twitter streaming API is here for that. But you are limited to 1% of all twitter volume. If you want not to be limited, you need either deal directly with twitter or use GNIP.

Next, you’ll need to keep only tweet of interest. By example, you’ll need to filter by keyword.

Just after that, you need to add informations for each received tweet. You need to enrich them by information it doesn’t already possess. For example, the gender of the author of a tweet must be guessed. The same for the sentiment expressed by the tweet.

In the end you’ll need to display all these informations in real-time. By real-time we mean with a very low latency.

Under the minute is generally acceptable. But under the hood, we generally have less than half a second of latency.

Most of the lantency is due to twitter (about 2s) or gnip (about 15s).

The Elephant

Before everything was done in PHP. Yes, in PHP. From retreiving tweets, to count them and display them. Let’s not even talk about the quality of the code.

At the beginning nothing was saved, the system could only display the data since midninght. If you only need to count how many tweets was received and you were the only client looking at the website, it could handle about 200 tweets/s. On very large machines and which, knowing how everything was coded, wasn’t so bad.

But if you wanted to display also the number of men and women, the number of tweets for each of the three sentiments (positive, negative or informative), then the system couldn’t handle more than 7 tweets by second.

MongoDB the Destroyer!

MongoDB the destroyer
MongoDB the destroyer

Behind the scene data was saved in MongoDB. Saved was a big word talking about MongoDB.

More seriously, Mongo was a great DB, really easy to start with. Really fast and nice to use.

Nice until you reach the hard limit. At that time it was, Mongo 2.6. So there was a Database Level Lock.

Yes, I repeat: Database Level Lock. Each time you read or write, nobody could read or write at the same time.

And even using very expensive clusters, these can’t handle the hard limits.

The result, when the MongoDB was asked to write and read a lot (even using batches), you start to lose datas. If you can write them, let’s destroy them. Furthermore the code dealing with tweet insertion in MongoDB was really hard to manipulate. No correct error handling. In the end, data loss…

There is a lot to say about MongoDB and a lot was already written. But the main point is yes. MongoDB couldn’t be trusted nor used for intensive data manipulation.

Now, the situation might have changed. But there are better tools for the same job.

When we arrived, many client had already paid. And many product should come to life.

So what was done:

In the end the system was able to deal with far more volume than before. It could display all informations with a volume of about 120 tweets/min. Which was about x17 progress. But as in the optimisation of old nitendo. We reached the maximal of power we could from this old age system.

I didn’t spoke about the frontend. Even passing from generated from the server page in PHP to AngularJS wasn’t enough. We started to experience the limit of the complexity we could reach using Angular JS architecture.

It was clear that each new component added more than a linear complexity to all the project. It was more about a quadratic complexity. We started to experience stranges bug very hard to reproduce. The lack of expliciteness was also a problem.

9 months of gestation

It was clear from here that nothing could work correctly and everything should ber rewritten.

The startup had, at that time, only two people to rewrite everything from scratch. The chosen language was Clojure for multiple reasons.

Something that made the evolution possible was the time taken to explore new technologies. About half a day per week was focused toward experimentation of new possible architecture.

For example, we created a Haskell tweet absorber. It was clear that it could handle thousands of tweets per seconds. In fact, at that time the system was able to absorb about 6000 tweets/s. That mean mostly the full firhose on a single machine.

Certainly it will be a good idea to use it instead of another java client.

Next we experimented clojure clients for making restful API. And the experience was great. It was really fast to develop and create new things.

Also lot of success stories with clojure made us confident we could use it in production.

At that time, Haskell wasn’t suitable for production. The cabal hell was really problematic. We had to use cabal freeze often. There were other problems. It was hard to install, compile and deploy.

Thank to stack this is no more the case1.

Further more, dealing with realtime processing at that time was all about java ecosystem. There were Storm, Kafka, Zookeeper, etc…

So using a language which could use all the java libraries was very important. Narrowed by that we simply had the choice between Scala and Clojure. Looking at scala, it was clear that it will be to Java what C++ is to C. While Clojure as a LISP. Everything was easier to read, to understand. The community seemed great.

Furthermore Storm was first written in Clojure. So go go go!

During the summer most technical choices was made.

  1. Deploy using Mesos / Marathon,
  2. Use Kafka and Storm,
  3. Use Druid and ElasticSearch for tweets DB,
  4. Still use MongoDB for resources (users, projects, keywords by projects, etc…).
  5. Use compojure-api for the API server
  6. Continue to use Haskell to absorb tweets
  7. Use reagent instead of Angular for the frontend

Each choice was balanced. In the end some of those choices were changed.

For example, we don’t use Storm at all now. The power of core.async was far from enough to deal with taking all resources of our machines.

Today you could see a result here:

Long live the new flesh

Long Live the new Flesh
Long Live the new Flesh

Difficulties with the new mindset. As everything new, there is a period of adaptation. Typically the most difficult part was to deal with reversed arrays.

In javascript one would write

foo["a"]="value-for-a"
foo["b"]="value-for-b"
foo["c"]="value-for-c"

foreach (i in foo) {v[foo[i]]=i;}

Or doing things like:

var foo = [[1,2,3],[4,5,6]];
tmp=0;
foreach (i in foo) {
  foreach (j in foo[i])  {
   tmp += foo[i][j] + 2;
  }
}
return tmp;

Now that I am used to reduce and filters this is like a second nature. And the new solution is far better.

For example the preceeding example write:

(def foo [[1 2 3] [4 5 6]])
(defn plus2 [x y] (+ x 2))
(defn sum [l] (reduce + 0 l))
(sum (map (fn [l] (reduce plus2 0 l)) foo))

;; or

(->> foo
     (map #(reduce plus2 0 %))
     (sum))

The code is more modulable, easier to read and to modify.

What were the immediate wins!

Deep access

For the brave an true there is the lenses Haskell library. But for clojurist, the basic access function should be good enough.

Let’s compare Javascript with Clojure:

foo={"db": [{"name":"John Doe","age":30},{"name":"Rich","age":40},{"age":20}]
    // other stuff , ....
    }

var val = function() {
            x = foo[db];
            if (x) {
              let y = x[1];
              if (y) {
                return y.age;
              } else return nil;
            } else return nil;
          }();

Yes, you have to manually check at each level if the value is null or not. Without this manual check, your code is going to crash at runtime!

Now lets compare the situation with clojure:

(-> foo :db second :age)

Yes, that’s all. The default value in case of problem is nil.

Merges

Seriously!!!!!

(into map1 map2)

I don’t even want to compare to javascript as it would be ridiculous. Mainly, you can’t2, or you need jQuery and its ugly.

Syntax

Learning Clojure syntax take about 3 minutes. It is clean, no fucking comma, semicolons, etc…

TODO: faudrait m’aider à en rajouter quelques tonnes. Avec de meilleurs exemples.


  1. Just a great thank you to FPComplete and in particular Michael Snoyman!

  2. http://stackoverflow.com/questions/171251/how-can-i-merge-properties-of-two-javascript-objects-dynamically