better english, small improvements
This commit is contained in:
parent
3c21335521
commit
2a9d8741a8
1 changed files with 70 additions and 69 deletions
|
@ -39,51 +39,51 @@ TODO: Remove the detailled plan
|
|||
|
||||
## The Elephant Graveyard
|
||||
|
||||
Imagine you could get all tweets in realtime.
|
||||
Imagine you can get all twitter data in realtime.
|
||||
|
||||
Imagine you need to count them.
|
||||
Imagine you need to count tweets.
|
||||
Imagine you need to filter them by keywords.
|
||||
Imagine you need to answer complex questions about them in realtime.
|
||||
For example, how many tweets from women, containing the word `clojure` expressing a positive sentiment during the last hour.
|
||||
Imagine the same question about the last year.
|
||||
Imagine you have to answer complex questions about all this data in realtime.
|
||||
For instance, how many tweets from women, containing the word `clojure` expressing a positive sentiment were submitted last hour.
|
||||
Now, Imagine the same question, but you have to deal with a year of data.
|
||||
|
||||
How would you do it?
|
||||
|
||||
First you'll need to absorb tweets in realtime.
|
||||
The twitter streaming API is here for that.
|
||||
But you are limited to 1% of all twitter volume.
|
||||
If you want not to be limited, you need either deal directly with twitter or use GNIP.
|
||||
However you are limited to 1% of all twitter volume.
|
||||
If you want not to be limited, you either need to deal directly with twitter or use GNIP.
|
||||
|
||||
Next, you'll need to keep only tweet of interest.
|
||||
By example, you'll need to filter by keyword.
|
||||
Next, you'll need to keep only tweet of interest and discard the rest.
|
||||
For instance, you'll have to filter them by keyword.
|
||||
|
||||
Just after that, you need to add informations for each received tweet.
|
||||
You need to enrich them by information it doesn't already possess.
|
||||
For example, the gender of the author of a tweet must be guessed.
|
||||
The same for the sentiment expressed by the tweet.
|
||||
Right after that, you have to add extra data for each received tweet.
|
||||
For instance, Twitter doesn't provide gender information so you have to guess it somehow.
|
||||
The same applies for the sentiment expressed in the Tweet.
|
||||
|
||||
In the end you'll need to display all these informations in real-time.
|
||||
By real-time we mean with a very low latency.
|
||||
By real-time we mean with very low latency.
|
||||
|
||||
Under the minute is generally acceptable.
|
||||
But under the hood, we generally have less than half a second of latency.
|
||||
Our processing pipeline usually process Tweets in only less than half a second of latency.
|
||||
|
||||
Most of the lantency is due to twitter (about 2s) or gnip (about 15s).
|
||||
However total latency is due to twitter streaming API (about 2s) or gnip endpoints (about 15s).
|
||||
|
||||
### The Elephant
|
||||
When we got the job, we inherited the prototypes builded by externs.
|
||||
back then, everything was done in PHP.
|
||||
Yes, in PHP! From retrieving tweets, to aggregates generation and display.
|
||||
Let's not even talk about code quality.
|
||||
|
||||
Before everything was done in PHP.
|
||||
Yes, in PHP. From retreiving tweets, to count them and display them.
|
||||
Let's not even talk about the quality of the code.
|
||||
|
||||
At the beginning nothing was saved, the system could only display the data
|
||||
At the beginning nothing was saved over time, the system could only display data
|
||||
since midninght.
|
||||
If you only need to count how many tweets was received and you were the
|
||||
only client looking at the website, it could handle about 200 tweets/s.
|
||||
On very large machines and which, knowing how everything was coded, wasn't so bad.
|
||||
If you only need to count how many tweets were received and you were the
|
||||
only client looking at the website, this "architecture" could handle about 200 tweets/s max.
|
||||
That was achieved on very large machines and which, knowing how everything was coded, wasn't so bad.
|
||||
|
||||
But if you wanted to display also the number of men and women,
|
||||
the number of tweets for each of the three sentiments (positive, negative or informative), then the system couldn't handle more than 7 tweets by second.
|
||||
But if you needed to add enrichments to achieve complex drilldowns,
|
||||
for instance how many men or women,
|
||||
how many positive, negative or informative tweets, then the system couldn't handle more than 7 tweets by second.
|
||||
|
||||
### MongoDB the Destroyer!
|
||||
|
||||
|
@ -102,10 +102,10 @@ So there was a **Database Level Lock**.
|
|||
Yes, I repeat: **Database Level Lock**.
|
||||
Each time you read or write, nobody could read or write at the same time.
|
||||
|
||||
And even using very expensive clusters, these can't handle the hard limits.
|
||||
And even using very expensive clusters, they couldn't handle these hard limits.
|
||||
|
||||
The result, when the MongoDB was asked to write and read a lot (even using batches), you start to lose datas.
|
||||
If you can write them, let's destroy them.
|
||||
As a result, when we asked Mongo to reas and write a lot at the same time, we started to witness data loss...
|
||||
If you can't write them, let's destroy them.
|
||||
Furthermore the code dealing with tweet insertion in MongoDB was really hard to manipulate. No correct error handling. In the end, data loss...
|
||||
|
||||
There is a lot to say about MongoDB and a lot was already written.
|
||||
|
@ -115,51 +115,52 @@ MongoDB couldn't be trusted nor used for intensive data manipulation.
|
|||
Now, the situation might have changed.
|
||||
But there are better tools for the same job.
|
||||
|
||||
When we arrived, many client had already paid.
|
||||
And many product should come to life.
|
||||
When we got the job, many customers were already on board
|
||||
And many new products were planned.
|
||||
|
||||
So what was done:
|
||||
So what we did first is:
|
||||
|
||||
- create an incremental map reduce system for mongoDB in node.js.
|
||||
- use HHVM to boost PHP performances
|
||||
- added a real API called with Angular JS
|
||||
- lot of optimizations
|
||||
- create an incremental map reduce system for mongoDB with node.js.
|
||||
- use HHVM to somewhat boost PHP performance
|
||||
- create a real data API to be called with Angular JS
|
||||
- lots of code optimizations
|
||||
|
||||
In the end the system was able to deal with _far_ more volume than before.
|
||||
It could display all informations with a volume of about 120 tweets/min.
|
||||
In the end our system was able to deal with _far_ more volume than before.
|
||||
It could display all the informations we talked about before with a volume of about 120 tweets/min.
|
||||
Which was about x17 progress. But as in the optimisation of old nitendo.
|
||||
We reached the maximal of power we could from this old age system.
|
||||
We reached the maximal of power we could from this old legacy system.
|
||||
|
||||
I didn't spoke about the frontend.
|
||||
Even passing from generated from the server page in PHP to AngularJS wasn't enough.
|
||||
We started to experience the limit of the complexity we could reach using Angular JS architecture.
|
||||
let's not even speak about our frontend.
|
||||
Code was a mess, to deal with it, we had to convert from a single file with 10 000 lines of JS code to an Angular JS application.
|
||||
Anyway, we quickly started to experience complexity limits with our Angular JS architecture.
|
||||
|
||||
It was clear that each new component added more than a linear complexity to all the project. It was more about a quadratic complexity.
|
||||
We started to experience stranges bug very hard to reproduce.
|
||||
The lack of expliciteness was also a problem.
|
||||
It was clear that each new component we created, added more than a linear complexity to all the project.
|
||||
It was more about quadratic complexity.
|
||||
We started to experience weird bug very hard to reproduce.
|
||||
The lack of expliciteness was also a real problem.
|
||||
|
||||
### 9 months of gestation
|
||||
|
||||
It was clear from here that nothing could work correctly and everything should ber rewritten.
|
||||
It was clear from here that nothing could work correctly and everything should be rewritten.
|
||||
|
||||
The startup had, at that time, only two people to rewrite everything from scratch.
|
||||
The chosen language was Clojure for multiple reasons.
|
||||
|
||||
Something that made the evolution possible was the time taken to explore new technologies.
|
||||
About half a day per week was focused toward experimentation of new possible architecture.
|
||||
Something what made this evolution possible was the time taken to explore new technologies.
|
||||
About half a day per week was focused toward experimentation of new technologies.
|
||||
|
||||
For example, we created a Haskell tweet absorber.
|
||||
For instance, we created a Haskell tweet absorber.
|
||||
It was clear that it could handle thousands of tweets per seconds.
|
||||
In fact, at that time the system was able to absorb about 6000 tweets/s.
|
||||
That mean mostly the full firhose on a single machine.
|
||||
That means barely the full firhose on a single machine.
|
||||
|
||||
Certainly it will be a good idea to use it instead of another java client.
|
||||
We thought that is was certainly a good idea to use it instead of another java client.
|
||||
|
||||
Next we experimented clojure clients for making restful API.
|
||||
Next we experimented clojure projects intended to create a restful API.
|
||||
And the experience was great.
|
||||
It was really fast to develop and create new things.
|
||||
|
||||
Also lot of success stories with clojure made us confident we could use it
|
||||
Also lot of success stories with clojure we noticed throughout the internet made us confident we could use it
|
||||
in production.
|
||||
|
||||
At that time, Haskell wasn't suitable for production.
|
||||
|
@ -168,40 +169,40 @@ We had to use `cabal freeze` often.
|
|||
There were other problems.
|
||||
It was hard to install, compile and deploy.
|
||||
|
||||
Thank to `stack` this is no more the case[^1].
|
||||
NOW, Thank to `stack` this is no more the case[^1].
|
||||
|
||||
[^1]: Just a great thank you to FPComplete and in particular Michael Snoyman!
|
||||
|
||||
Further more, dealing with realtime processing at that time was all about java ecosystem.
|
||||
There were Storm, Kafka, Zookeeper, etc...
|
||||
There was Storm, Kafka, Zookeeper, etc...
|
||||
|
||||
So using a language which could use all the java libraries was very important.
|
||||
Narrowed by that we simply had the choice between Scala and Clojure.
|
||||
So using a language which could use all the java libraries seemed very important to us.
|
||||
With that in mind we simply had t choose between Scala and Clojure.
|
||||
Looking at scala, it was clear that it will be to Java what C++ is to C.
|
||||
While Clojure as a LISP. Everything was easier to read, to understand.
|
||||
The community seemed great.
|
||||
While Clojure being a descendant from the LISP family, we found that everything was simple, easier to read and understand.
|
||||
Clojure community sounded great.
|
||||
|
||||
Furthermore Storm was first written in Clojure.
|
||||
So go go go!
|
||||
|
||||
During the summer most technical choices was made.
|
||||
t was during summer that most technical choices were made.
|
||||
|
||||
1. Deploy using Mesos / Marathon,
|
||||
1. We wanted to deploy using Mesos / Marathon,
|
||||
2. Use Kafka and Storm,
|
||||
3. Use Druid and ElasticSearch for tweets DB,
|
||||
4. Still use MongoDB for resources (users, projects, keywords by projects, etc...).
|
||||
5. Use `compojure-api` for the API server
|
||||
6. Continue to use Haskell to absorb tweets
|
||||
6. Go on with Haskell to absorb tweets
|
||||
7. Use reagent instead of Angular for the frontend
|
||||
|
||||
Each choice was balanced.
|
||||
In the end some of those choices were changed.
|
||||
In the end some of those choices were changed throught practice.
|
||||
|
||||
For example, we don't use Storm at all now.
|
||||
The power of core.async was far from enough to deal with taking all
|
||||
resources of our machines.
|
||||
For instance, we discarded Storm.
|
||||
The power of core.async was more than enough efficiently exploit all the juice of our machines,
|
||||
Storm added some unnecessary latency and complexity.
|
||||
|
||||
Today you could see a result here:
|
||||
Today you can see a result here:
|
||||
|
||||
<div class="wrap" style="height: 630px; width: 100%;">
|
||||
<iframe src="https://dev.pulse.vigiglo.be/#/vgteam/TV_Shows/dashboard" style="width:200%; border:solid 2px #DDD; padding: none; margin: 20px 0; height: 1200px; -ms-zoom:0.5; -moz-transform: scale(0.5); -moz-transform-origin: 0 0; -o-transform: scale(0.5); -o-transform-origin: 0 0; -webkit-transform: scale(0.5); -webkit-transform-origin: 0 0"></iframe>
|
||||
|
@ -257,7 +258,7 @@ For example the preceeding example write:
|
|||
(sum))
|
||||
~~~
|
||||
|
||||
The code is more modulable, easier to read and to modify.
|
||||
The resulting code is much better: modular, easier to read and to modify.
|
||||
|
||||
- Java null pointer exception!
|
||||
- Unreadable stacktrace
|
||||
|
@ -313,7 +314,7 @@ Mainly, you can't[^2], or you need jQuery and its ugly.
|
|||
|
||||
### Syntax
|
||||
|
||||
Learning Clojure syntax take about 3 minutes.
|
||||
Learning Clojure syntax takes about 3 minutes, thanks to homoiconicity
|
||||
It is clean, no _fucking_ comma, semicolons, etc...
|
||||
|
||||
- Arrays: `[a b c]` in javascript `[a,b,c]` (why the commas?)
|
||||
|
|
Loading…
Reference in a new issue