This commit is contained in:
Yann Esposito 2015-10-26 15:41:19 +01:00
parent 85aa98b5e2
commit 015a1a0417
5 changed files with 167 additions and 7 deletions

View file

@ -19,6 +19,7 @@
<li><a href="#the-elephant-graveyard">The Elephant Graveyard</a><ul> <li><a href="#the-elephant-graveyard">The Elephant Graveyard</a><ul>
<li><a href="#the-elephant">The Elephant</a></li> <li><a href="#the-elephant">The Elephant</a></li>
<li><a href="#mongodb-the-destroyer">MongoDB the Destroyer!</a></li> <li><a href="#mongodb-the-destroyer">MongoDB the Destroyer!</a></li>
<li><a href="#months-of-gestation">9 months of gestation</a></li>
</ul></li> </ul></li>
</ul></li> </ul></li>
</ul> </ul>
@ -79,15 +80,65 @@
<p>Under the minute is generally acceptable. But under the hood, we generally have less than half a second of latency.</p> <p>Under the minute is generally acceptable. But under the hood, we generally have less than half a second of latency.</p>
<p>Most of the lantency is due to twitter (about 2s) or gnip (about 15s).</p> <p>Most of the lantency is due to twitter (about 2s) or gnip (about 15s).</p>
<h3 id="the-elephant">The Elephant</h3> <h3 id="the-elephant">The Elephant</h3>
<p>Before everything was done in PHP. Yes, in PHP. From retreiving tweets, to count them and display them.</p> <p>Before everything was done in PHP. Yes, in PHP. From retreiving tweets, to count them and display them. Lets not even talk about the quality of the code.</p>
<p>At the beginning nothing was saved, the system could only display the data since midninght. If you only need to count how many tweets was received and you were the only client looking at the website, it could handle about 200 tweets/s. On very large machines and which, knowing how everything was coded, wasnt so bad.</p> <p>At the beginning nothing was saved, the system could only display the data since midninght. If you only need to count how many tweets was received and you were the only client looking at the website, it could handle about 200 tweets/s. On very large machines and which, knowing how everything was coded, wasnt so bad.</p>
<p>But if you wanted to display also the number of men and women, the number of tweets for each of the three sentiments (positive, negative or informative), then the system couldnt handle more than 7 tweets by second.</p> <p>But if you wanted to display also the number of men and women, the number of tweets for each of the three sentiments (positive, negative or informative), then the system couldnt handle more than 7 tweets by second.</p>
<h3 id="mongodb-the-destroyer">MongoDB the Destroyer!</h3> <h3 id="mongodb-the-destroyer">MongoDB the Destroyer!</h3>
<figure> <figure>
<img src="img/zuul.png" alt="Zuul the destroyer" /><figcaption>Zuul the destroyer</figcaption> <img src="img/mongodb-the-destroyer.png" alt="MongoDB the destroyer" /><figcaption>MongoDB the destroyer</figcaption>
</figure> </figure>
<p>Behind the scene there were MongoDB the destructor of the worlds.</p> <p>Behind the scene data was saved in MongoDB. Saved was a big word talking about MongoDB.</p>
<p>If you wanted to display more informations,</p> <p>More seriously, Mongo was a great DB, really easy to start with. Really fast and nice to use.</p>
<p>Nice until you reach the hard limit. At that time it was, Mongo 2.6. So there was a <strong>Database Level Lock</strong>.</p>
<p>Yes, I repeat: <strong>Database Level Lock</strong>. Each time you read or write, nobody could read or write at the same time.</p>
<p>And even using very expensive clusters, these cant handle the hard limits.</p>
<p>There is a lot to say about MongoDB and a lot was already written. But the main point is yes. MongoDB couldnt be trusted nor used for intensive data manipulation.</p>
<p>Now, the situation might have changed. But there are better tools for the same job.</p>
<p>When we arrived, many client had already paid. And many product should come to life.</p>
<p>So what was done:</p>
<ul>
<li>create an incremental map reduce system for mongoDB in node.js.</li>
<li>use HHVM to boost PHP performances</li>
<li>added a real API called with Angular JS</li>
<li>lot of optimizations</li>
</ul>
<p>In the end the system was able to deal with <em>far</em> more volume than before. It could display all informations with a volume of about 120 tweets/min. Which was about x17 progress. But as in the optimisation of old nitendo. We reached the maximal of power we could from this old age system.</p>
<p>I didnt spoke about the frontend. Even passing from generated from the server page in PHP to AngularJS wasnt enough. We started to experience the limit of the complexity we could reach using Angular JS architecture.</p>
<p>It was clear that each new component added more than a linear complexity to all the project. It was more about a quadratic complexity. We started to experience stranges bug very hard to reproduce. The lack of expliciteness was also a problem.</p>
<h3 id="months-of-gestation">9 months of gestation</h3>
<p>It was clear from here that nothing could work correctly and everything should ber rewritten.</p>
<p>The startup had, at that time, only two people to rewrite everything from scratch. The chosen language was Clojure for multiple reasons.</p>
<p>Something that made the evolution possible was the time taken to explore new technologies. About half a day per week was focused toward experimentation of new possible architecture.</p>
<p>For example, we created a Haskell tweet absorber. It was clear that it could handle thousands of tweets per seconds. In fact, at that time the system was able to absorb about 6000 tweets/s. That mean mostly the full firhose on a single machine.</p>
<p>Certainly it will be a good idea to use it instead of another java client.</p>
<p>Next we experimented clojure clients for making restful API. And the experience was great. It was really fast to develop and create new things.</p>
<p>Also lot of success stories with clojure made us confident we could use it in production.</p>
<p>At that time, Haskell wasnt suitable for production. The cabal hell was really problematic. We had to use <code>cabal freeze</code> often. There were other problems. It was hard to install, compile and deploy.</p>
<p>Thank to <code>stack</code> this is no more the case<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<p>Further more, dealing with realtime processing at that time was all about java ecosystem. There were Storm, Kafka, Zookeeper, etc…</p>
<p>So using a language which could use all the java libraries was very important. Narrowed by that we simply had the choice between Scala and Clojure. Looking at scala, it was clear that it will be to Java what C++ is to C. While Clojure as a LISP. Everything was easier to read, to understand. The community seemed great.</p>
<p>Furthermore Storm was first written in Clojure. So go go go!</p>
<p>During the summer most technical choices was made.</p>
<ol type="1">
<li>Deploy using Mesos / Marathon,</li>
<li>Use Kafka and Storm,</li>
<li>Use Druid and ElasticSearch for tweets DB,</li>
<li>Still use MongoDB for resources (users, projects, keywords by projects, etc…).</li>
<li>Use <code>compojure-api</code> for the API server</li>
<li>Continue to use Haskell to absorb tweets</li>
<li>Use reagent instead of Angular for the frontend</li>
</ol>
<p>Each choice was balanced. In the end some of those choices were changed.</p>
<p>For example, we dont use Storm at all now. The power of core.async was far from enough to deal with taking all resources of our machines.</p>
<p>Today you could see a result here:</p>
<iframe src="https://dev.pulse.vigiglo.be/#/vgteam/TV_Shows" style="width:100%; border:solid 2px #DDD; padding: none; margin: 20px 0; height: 500px;">
</iframe>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Just a great thank you to FPComplete and in particular Michael Snoyman!<a href="#fnref1"></a></p></li>
</ol>
</section>
<hr/> <hr/>
<div id="footer"> <div id="footer">
<a href="http://about.me/gbuisson">G</a> - <a href="http://about.me/gbuisson">G</a> -

View file

@ -74,6 +74,7 @@ Most of the lantency is due to twitter (about 2s) or gnip (about 15s).
Before everything was done in PHP. Before everything was done in PHP.
Yes, in PHP. From retreiving tweets, to count them and display them. Yes, in PHP. From retreiving tweets, to count them and display them.
Let's not even talk about the quality of the code.
At the beginning nothing was saved, the system could only display the data At the beginning nothing was saved, the system could only display the data
since midninght. since midninght.
@ -86,8 +87,116 @@ the number of tweets for each of the three sentiments (positive, negative or inf
### MongoDB the Destroyer! ### MongoDB the Destroyer!
![Zuul the destroyer](img/zuul.png) ![MongoDB the destroyer](img/mongodb-the-destroyer.png)
Behind the scene there were MongoDB the destructor of the worlds. Behind the scene data was saved in MongoDB.
Saved was a big word talking about MongoDB.
If you wanted to display more informations, More seriously, Mongo was a great DB, really easy to start with.
Really fast and nice to use.
Nice until you reach the hard limit.
At that time it was, Mongo 2.6.
So there was a **Database Level Lock**.
Yes, I repeat: **Database Level Lock**.
Each time you read or write, nobody could read or write at the same time.
And even using very expensive clusters, these can't handle the hard limits.
There is a lot to say about MongoDB and a lot was already written.
But the main point is yes.
MongoDB couldn't be trusted nor used for intensive data manipulation.
Now, the situation might have changed.
But there are better tools for the same job.
When we arrived, many client had already paid.
And many product should come to life.
So what was done:
- create an incremental map reduce system for mongoDB in node.js.
- use HHVM to boost PHP performances
- added a real API called with Angular JS
- lot of optimizations
In the end the system was able to deal with _far_ more volume than before.
It could display all informations with a volume of about 120 tweets/min.
Which was about x17 progress. But as in the optimisation of old nitendo.
We reached the maximal of power we could from this old age system.
I didn't spoke about the frontend.
Even passing from generated from the server page in PHP to AngularJS wasn't enough.
We started to experience the limit of the complexity we could reach using Angular JS architecture.
It was clear that each new component added more than a linear complexity to all the project. It was more about a quadratic complexity.
We started to experience stranges bug very hard to reproduce.
The lack of expliciteness was also a problem.
### 9 months of gestation
It was clear from here that nothing could work correctly and everything should ber rewritten.
The startup had, at that time, only two people to rewrite everything from scratch.
The chosen language was Clojure for multiple reasons.
Something that made the evolution possible was the time taken to explore new technologies.
About half a day per week was focused toward experimentation of new possible architecture.
For example, we created a Haskell tweet absorber.
It was clear that it could handle thousands of tweets per seconds.
In fact, at that time the system was able to absorb about 6000 tweets/s.
That mean mostly the full firhose on a single machine.
Certainly it will be a good idea to use it instead of another java client.
Next we experimented clojure clients for making restful API.
And the experience was great.
It was really fast to develop and create new things.
Also lot of success stories with clojure made us confident we could use it
in production.
At that time, Haskell wasn't suitable for production.
The cabal hell was really problematic.
We had to use `cabal freeze` often.
There were other problems.
It was hard to install, compile and deploy.
Thank to `stack` this is no more the case[^1].
[^1]: Just a great thank you to FPComplete and in particular Michael Snoyman!
Further more, dealing with realtime processing at that time was all about java ecosystem.
There were Storm, Kafka, Zookeeper, etc...
So using a language which could use all the java libraries was very important.
Narrowed by that we simply had the choice between Scala and Clojure.
Looking at scala, it was clear that it will be to Java what C++ is to C.
While Clojure as a LISP. Everything was easier to read, to understand.
The community seemed great.
Furthermore Storm was first written in Clojure.
So go go go!
During the summer most technical choices was made.
1. Deploy using Mesos / Marathon,
2. Use Kafka and Storm,
3. Use Druid and ElasticSearch for tweets DB,
4. Still use MongoDB for resources (users, projects, keywords by projects, etc...).
5. Use `compojure-api` for the API server
6. Continue to use Haskell to absorb tweets
7. Use reagent instead of Angular for the frontend
Each choice was balanced.
In the end some of those choices were changed.
For example, we don't use Storm at all now.
The power of core.async was far from enough to deal with taking all
resources of our machines.
Today you could see a result here:
<iframe src="https://dev.pulse.vigiglo.be/#/vgteam/TV_Shows" style="width:100%; border:solid 2px #DDD; padding: none; margin: 20px 0; height: 500px;"></iframe>

Binary file not shown.

After

Width:  |  Height:  |  Size: 692 KiB

BIN
articles/img/mongodb.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.6 KiB

BIN
articles/img/mongodb.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB