diff --git a/articles/Two_years_with_clojure.md b/articles/Two_years_with_clojure.md index f0fe10a..f56b9ec 100644 --- a/articles/Two_years_with_clojure.md +++ b/articles/Two_years_with_clojure.md @@ -39,51 +39,51 @@ TODO: Remove the detailled plan ## The Elephant Graveyard -Imagine you could get all tweets in realtime. +Imagine you can get all twitter data in realtime. -Imagine you need to count them. +Imagine you need to count tweets. Imagine you need to filter them by keywords. -Imagine you need to answer complex questions about them in realtime. -For example, how many tweets from women, containing the word `clojure` expressing a positive sentiment during the last hour. -Imagine the same question about the last year. +Imagine you have to answer complex questions about all this data in realtime. +For instance, how many tweets from women, containing the word `clojure` expressing a positive sentiment were submitted last hour. +Now, Imagine the same question, but you have to deal with a year of data. How would you do it? First you'll need to absorb tweets in realtime. The twitter streaming API is here for that. -But you are limited to 1% of all twitter volume. -If you want not to be limited, you need either deal directly with twitter or use GNIP. +However you are limited to 1% of all twitter volume. +If you want not to be limited, you either need to deal directly with twitter or use GNIP. -Next, you'll need to keep only tweet of interest. -By example, you'll need to filter by keyword. +Next, you'll need to keep only tweet of interest and discard the rest. +For instance, you'll have to filter them by keyword. -Just after that, you need to add informations for each received tweet. -You need to enrich them by information it doesn't already possess. -For example, the gender of the author of a tweet must be guessed. -The same for the sentiment expressed by the tweet. +Right after that, you have to add extra data for each received tweet. +For instance, Twitter doesn't provide gender information so you have to guess it somehow. +The same applies for the sentiment expressed in the Tweet. In the end you'll need to display all these informations in real-time. -By real-time we mean with a very low latency. +By real-time we mean with very low latency. Under the minute is generally acceptable. -But under the hood, we generally have less than half a second of latency. +Our processing pipeline usually process Tweets in only less than half a second of latency. -Most of the lantency is due to twitter (about 2s) or gnip (about 15s). +However total latency is due to twitter streaming API (about 2s) or gnip endpoints (about 15s). ### The Elephant +When we got the job, we inherited the prototypes builded by externs. +back then, everything was done in PHP. +Yes, in PHP! From retrieving tweets, to aggregates generation and display. +Let's not even talk about code quality. -Before everything was done in PHP. -Yes, in PHP. From retreiving tweets, to count them and display them. -Let's not even talk about the quality of the code. - -At the beginning nothing was saved, the system could only display the data +At the beginning nothing was saved over time, the system could only display data since midninght. -If you only need to count how many tweets was received and you were the -only client looking at the website, it could handle about 200 tweets/s. -On very large machines and which, knowing how everything was coded, wasn't so bad. +If you only need to count how many tweets were received and you were the +only client looking at the website, this "architecture" could handle about 200 tweets/s max. +That was achieved on very large machines and which, knowing how everything was coded, wasn't so bad. -But if you wanted to display also the number of men and women, -the number of tweets for each of the three sentiments (positive, negative or informative), then the system couldn't handle more than 7 tweets by second. +But if you needed to add enrichments to achieve complex drilldowns, +for instance how many men or women, +how many positive, negative or informative tweets, then the system couldn't handle more than 7 tweets by second. ### MongoDB the Destroyer! @@ -102,10 +102,10 @@ So there was a **Database Level Lock**. Yes, I repeat: **Database Level Lock**. Each time you read or write, nobody could read or write at the same time. -And even using very expensive clusters, these can't handle the hard limits. +And even using very expensive clusters, they couldn't handle these hard limits. -The result, when the MongoDB was asked to write and read a lot (even using batches), you start to lose datas. -If you can write them, let's destroy them. +As a result, when we asked Mongo to reas and write a lot at the same time, we started to witness data loss... +If you can't write them, let's destroy them. Furthermore the code dealing with tweet insertion in MongoDB was really hard to manipulate. No correct error handling. In the end, data loss... There is a lot to say about MongoDB and a lot was already written. @@ -115,51 +115,52 @@ MongoDB couldn't be trusted nor used for intensive data manipulation. Now, the situation might have changed. But there are better tools for the same job. -When we arrived, many client had already paid. -And many product should come to life. +When we got the job, many customers were already on board +And many new products were planned. -So what was done: +So what we did first is: -- create an incremental map reduce system for mongoDB in node.js. -- use HHVM to boost PHP performances -- added a real API called with Angular JS -- lot of optimizations +- create an incremental map reduce system for mongoDB with node.js. +- use HHVM to somewhat boost PHP performance +- create a real data API to be called with Angular JS +- lots of code optimizations -In the end the system was able to deal with _far_ more volume than before. -It could display all informations with a volume of about 120 tweets/min. +In the end our system was able to deal with _far_ more volume than before. +It could display all the informations we talked about before with a volume of about 120 tweets/min. Which was about x17 progress. But as in the optimisation of old nitendo. -We reached the maximal of power we could from this old age system. +We reached the maximal of power we could from this old legacy system. -I didn't spoke about the frontend. -Even passing from generated from the server page in PHP to AngularJS wasn't enough. -We started to experience the limit of the complexity we could reach using Angular JS architecture. +let's not even speak about our frontend. +Code was a mess, to deal with it, we had to convert from a single file with 10 000 lines of JS code to an Angular JS application. +Anyway, we quickly started to experience complexity limits with our Angular JS architecture. -It was clear that each new component added more than a linear complexity to all the project. It was more about a quadratic complexity. -We started to experience stranges bug very hard to reproduce. -The lack of expliciteness was also a problem. +It was clear that each new component we created, added more than a linear complexity to all the project. +It was more about quadratic complexity. +We started to experience weird bug very hard to reproduce. +The lack of expliciteness was also a real problem. ### 9 months of gestation -It was clear from here that nothing could work correctly and everything should ber rewritten. +It was clear from here that nothing could work correctly and everything should be rewritten. The startup had, at that time, only two people to rewrite everything from scratch. The chosen language was Clojure for multiple reasons. -Something that made the evolution possible was the time taken to explore new technologies. -About half a day per week was focused toward experimentation of new possible architecture. +Something what made this evolution possible was the time taken to explore new technologies. +About half a day per week was focused toward experimentation of new technologies. -For example, we created a Haskell tweet absorber. +For instance, we created a Haskell tweet absorber. It was clear that it could handle thousands of tweets per seconds. In fact, at that time the system was able to absorb about 6000 tweets/s. -That mean mostly the full firhose on a single machine. +That means barely the full firhose on a single machine. -Certainly it will be a good idea to use it instead of another java client. +We thought that is was certainly a good idea to use it instead of another java client. -Next we experimented clojure clients for making restful API. +Next we experimented clojure projects intended to create a restful API. And the experience was great. It was really fast to develop and create new things. -Also lot of success stories with clojure made us confident we could use it +Also lot of success stories with clojure we noticed throughout the internet made us confident we could use it in production. At that time, Haskell wasn't suitable for production. @@ -168,40 +169,40 @@ We had to use `cabal freeze` often. There were other problems. It was hard to install, compile and deploy. -Thank to `stack` this is no more the case[^1]. +NOW, Thank to `stack` this is no more the case[^1]. [^1]: Just a great thank you to FPComplete and in particular Michael Snoyman! Further more, dealing with realtime processing at that time was all about java ecosystem. -There were Storm, Kafka, Zookeeper, etc... +There was Storm, Kafka, Zookeeper, etc... -So using a language which could use all the java libraries was very important. -Narrowed by that we simply had the choice between Scala and Clojure. +So using a language which could use all the java libraries seemed very important to us. +With that in mind we simply had t choose between Scala and Clojure. Looking at scala, it was clear that it will be to Java what C++ is to C. -While Clojure as a LISP. Everything was easier to read, to understand. -The community seemed great. +While Clojure being a descendant from the LISP family, we found that everything was simple, easier to read and understand. +Clojure community sounded great. Furthermore Storm was first written in Clojure. So go go go! -During the summer most technical choices was made. +t was during summer that most technical choices were made. -1. Deploy using Mesos / Marathon, +1. We wanted to deploy using Mesos / Marathon, 2. Use Kafka and Storm, 3. Use Druid and ElasticSearch for tweets DB, 4. Still use MongoDB for resources (users, projects, keywords by projects, etc...). 5. Use `compojure-api` for the API server -6. Continue to use Haskell to absorb tweets +6. Go on with Haskell to absorb tweets 7. Use reagent instead of Angular for the frontend Each choice was balanced. -In the end some of those choices were changed. +In the end some of those choices were changed throught practice. -For example, we don't use Storm at all now. -The power of core.async was far from enough to deal with taking all -resources of our machines. +For instance, we discarded Storm. +The power of core.async was more than enough efficiently exploit all the juice of our machines, +Storm added some unnecessary latency and complexity. -Today you could see a result here: +Today you can see a result here:
@@ -257,7 +258,7 @@ For example the preceeding example write: (sum)) ~~~ -The code is more modulable, easier to read and to modify. +The resulting code is much better: modular, easier to read and to modify. - Java null pointer exception! - Unreadable stacktrace @@ -313,7 +314,7 @@ Mainly, you can't[^2], or you need jQuery and its ugly. ### Syntax -Learning Clojure syntax take about 3 minutes. +Learning Clojure syntax takes about 3 minutes, thanks to homoiconicity It is clean, no _fucking_ comma, semicolons, etc... - Arrays: `[a b c]` in javascript `[a,b,c]` (why the commas?)