dimisit/articles/Two_years_with_clojure.html
2015-10-26 18:04:45 +01:00

229 lines
16 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<title></title>
<style type="text/css">code{white-space: pre;}</style>
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<link rel="stylesheet" href="../styling.css">
</head>
<body>
<nav id="TOC">
<ul>
<li><a href="#todo-choose-a-title">TODO: choose a title</a><ul>
<li><a href="#plan">Plan</a></li>
<li><a href="#the-elephant-graveyard">The Elephant Graveyard</a><ul>
<li><a href="#the-elephant">The Elephant</a></li>
<li><a href="#mongodb-the-destroyer">MongoDB the Destroyer!</a></li>
<li><a href="#months-of-gestation">9 months of gestation</a></li>
</ul></li>
<li><a href="#long-live-the-new-flesh">Long live the new flesh</a><ul>
<li><a href="#deep-access">Deep access</a></li>
<li><a href="#merges">Merges</a></li>
<li><a href="#syntax">Syntax</a></li>
</ul></li>
</ul></li>
</ul>
</nav>
<h1 id="todo-choose-a-title">TODO: choose a title</h1>
<p>TODO: tl;dr: … (3 sentences max)</p>
<p>TODO: introduction (20 lines max)</p>
<h2 id="plan">Plan</h2>
<p>TODO: Remove the detailled plan</p>
<ul>
<li>Start with the end
<ul>
<li>show a pulse</li>
<li>explain what is simple / hard</li>
</ul></li>
<li>The situation before
<ul>
<li>pb with volume (MongoDB / PHP, etc…)</li>
<li>securities issues</li>
<li>pb with abilities</li>
<li>angular complexity</li>
<li>refactoring issues</li>
<li>deployment issues</li>
</ul></li>
<li>The choices
<ul>
<li>why clojure?</li>
<li>why Haskell?</li>
<li>why not full Haskell?</li>
<li>why reagent?</li>
<li>why Kafka?</li>
<li>why Mesos / Marathon?</li>
<li>why Druid?</li>
<li>why still MongoDB?</li>
</ul></li>
<li>The firsts weeks
<ul>
<li>first impressions</li>
<li>what was harder?</li>
<li>what was easier?</li>
</ul></li>
<li>Once used to clojure
<ul>
<li>how does it feels?</li>
<li>was it a mistake?</li>
<li>Do we have any doubts?</li>
</ul></li>
<li>One year later (maintenance and impressions)</li>
</ul>
<h2 id="the-elephant-graveyard">The Elephant Graveyard</h2>
<p>Imagine you could get all tweets in realtime.</p>
<p>Imagine you need to count them. Imagine you need to filter them by keywords. Imagine you need to answer complex questions about them in realtime. For example, how many tweets from women, containing the word <code>clojure</code> expressing a positive sentiment during the last hour. Imagine the same question about the last year.</p>
<p>How would you do it?</p>
<p>First youll need to absorb tweets in realtime. The twitter streaming API is here for that. But you are limited to 1% of all twitter volume. If you want not to be limited, you need either deal directly with twitter or use GNIP.</p>
<p>Next, youll need to keep only tweet of interest. By example, youll need to filter by keyword.</p>
<p>Just after that, you need to add informations for each received tweet. You need to enrich them by information it doesnt already possess. For example, the gender of the author of a tweet must be guessed. The same for the sentiment expressed by the tweet.</p>
<p>In the end youll need to display all these informations in real-time. By real-time we mean with a very low latency.</p>
<p>Under the minute is generally acceptable. But under the hood, we generally have less than half a second of latency.</p>
<p>Most of the lantency is due to twitter (about 2s) or gnip (about 15s).</p>
<h3 id="the-elephant">The Elephant</h3>
<p>Before everything was done in PHP. Yes, in PHP. From retreiving tweets, to count them and display them. Lets not even talk about the quality of the code.</p>
<p>At the beginning nothing was saved, the system could only display the data since midninght. If you only need to count how many tweets was received and you were the only client looking at the website, it could handle about 200 tweets/s. On very large machines and which, knowing how everything was coded, wasnt so bad.</p>
<p>But if you wanted to display also the number of men and women, the number of tweets for each of the three sentiments (positive, negative or informative), then the system couldnt handle more than 7 tweets by second.</p>
<h3 id="mongodb-the-destroyer">MongoDB the Destroyer!</h3>
<figure>
<img src="img/mongodb-the-destroyer.png" alt="MongoDB the destroyer" /><figcaption>MongoDB the destroyer</figcaption>
</figure>
<p>Behind the scene data was saved in MongoDB. Saved was a big word talking about MongoDB.</p>
<p>More seriously, Mongo was a great DB, really easy to start with. Really fast and nice to use.</p>
<p>Nice until you reach the hard limit. At that time it was, Mongo 2.6. So there was a <strong>Database Level Lock</strong>.</p>
<p>Yes, I repeat: <strong>Database Level Lock</strong>. Each time you read or write, nobody could read or write at the same time.</p>
<p>And even using very expensive clusters, these cant handle the hard limits.</p>
<p>The result, when the MongoDB was asked to write and read a lot (even using batches), you start to lose datas. If you can write them, lets destroy them. Furthermore the code dealing with tweet insertion in MongoDB was really hard to manipulate. No correct error handling. In the end, data loss…</p>
<p>There is a lot to say about MongoDB and a lot was already written. But the main point is yes. MongoDB couldnt be trusted nor used for intensive data manipulation.</p>
<p>Now, the situation might have changed. But there are better tools for the same job.</p>
<p>When we arrived, many client had already paid. And many product should come to life.</p>
<p>So what was done:</p>
<ul>
<li>create an incremental map reduce system for mongoDB in node.js.</li>
<li>use HHVM to boost PHP performances</li>
<li>added a real API called with Angular JS</li>
<li>lot of optimizations</li>
</ul>
<p>In the end the system was able to deal with <em>far</em> more volume than before. It could display all informations with a volume of about 120 tweets/min. Which was about x17 progress. But as in the optimisation of old nitendo. We reached the maximal of power we could from this old age system.</p>
<p>I didnt spoke about the frontend. Even passing from generated from the server page in PHP to AngularJS wasnt enough. We started to experience the limit of the complexity we could reach using Angular JS architecture.</p>
<p>It was clear that each new component added more than a linear complexity to all the project. It was more about a quadratic complexity. We started to experience stranges bug very hard to reproduce. The lack of expliciteness was also a problem.</p>
<h3 id="months-of-gestation">9 months of gestation</h3>
<p>It was clear from here that nothing could work correctly and everything should ber rewritten.</p>
<p>The startup had, at that time, only two people to rewrite everything from scratch. The chosen language was Clojure for multiple reasons.</p>
<p>Something that made the evolution possible was the time taken to explore new technologies. About half a day per week was focused toward experimentation of new possible architecture.</p>
<p>For example, we created a Haskell tweet absorber. It was clear that it could handle thousands of tweets per seconds. In fact, at that time the system was able to absorb about 6000 tweets/s. That mean mostly the full firhose on a single machine.</p>
<p>Certainly it will be a good idea to use it instead of another java client.</p>
<p>Next we experimented clojure clients for making restful API. And the experience was great. It was really fast to develop and create new things.</p>
<p>Also lot of success stories with clojure made us confident we could use it in production.</p>
<p>At that time, Haskell wasnt suitable for production. The cabal hell was really problematic. We had to use <code>cabal freeze</code> often. There were other problems. It was hard to install, compile and deploy.</p>
<p>Thank to <code>stack</code> this is no more the case<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<p>Further more, dealing with realtime processing at that time was all about java ecosystem. There were Storm, Kafka, Zookeeper, etc…</p>
<p>So using a language which could use all the java libraries was very important. Narrowed by that we simply had the choice between Scala and Clojure. Looking at scala, it was clear that it will be to Java what C++ is to C. While Clojure as a LISP. Everything was easier to read, to understand. The community seemed great.</p>
<p>Furthermore Storm was first written in Clojure. So go go go!</p>
<p>During the summer most technical choices was made.</p>
<ol type="1">
<li>Deploy using Mesos / Marathon,</li>
<li>Use Kafka and Storm,</li>
<li>Use Druid and ElasticSearch for tweets DB,</li>
<li>Still use MongoDB for resources (users, projects, keywords by projects, etc…).</li>
<li>Use <code>compojure-api</code> for the API server</li>
<li>Continue to use Haskell to absorb tweets</li>
<li>Use reagent instead of Angular for the frontend</li>
</ol>
<p>Each choice was balanced. In the end some of those choices were changed.</p>
<p>For example, we dont use Storm at all now. The power of core.async was far from enough to deal with taking all resources of our machines.</p>
<p>Today you could see a result here:</p>
<div class="wrap" style="height: 630px; width: 100%;">
<iframe src="https://dev.pulse.vigiglo.be/#/vgteam/TV_Shows/dashboard" style="width:200%; border:solid 2px #DDD; padding: none; margin: 20px 0; height: 1200px; -ms-zoom:0.5; -moz-transform: scale(0.5); -moz-transform-origin: 0 0; -o-transform: scale(0.5); -o-transform-origin: 0 0; -webkit-transform: scale(0.5); -webkit-transform-origin: 0 0">
</iframe>
</div>
<h2 id="long-live-the-new-flesh">Long live the new flesh</h2>
<figure>
<img src="img/videodrome.jpg" alt="Long Live the new Flesh" /><figcaption>Long Live the new Flesh</figcaption>
</figure>
<p>Difficulties with the new mindset. As everything new, there is a period of adaptation. Typically the most difficult part was to deal with reversed arrays.</p>
<p>In javascript one would write</p>
<pre class="sourceCode javascript"><code class="sourceCode javascript">foo[<span class="st">&quot;a&quot;</span>]=<span class="st">&quot;value-for-a&quot;</span>
foo[<span class="st">&quot;b&quot;</span>]=<span class="st">&quot;value-for-b&quot;</span>
foo[<span class="st">&quot;c&quot;</span>]=<span class="st">&quot;value-for-c&quot;</span>
<span class="fu">foreach</span> (i <span class="kw">in</span> foo) {v[foo[i]]=i;}</code></pre>
<p>Or doing things like:</p>
<pre class="sourceCode javascript"><code class="sourceCode javascript"><span class="kw">var</span> foo = [[<span class="dv">1</span>,<span class="dv">2</span>,<span class="dv">3</span>],[<span class="dv">4</span>,<span class="dv">5</span>,<span class="dv">6</span>]];
tmp=<span class="dv">0</span>;
<span class="fu">foreach</span> (i <span class="kw">in</span> foo) {
<span class="fu">foreach</span> (j <span class="kw">in</span> foo[i]) {
tmp += foo[i][j] + <span class="dv">2</span>;
}
}
<span class="kw">return</span> tmp;</code></pre>
<p>Now that I am used to reduce and filters this is like a second nature. And the new solution is far better.</p>
<p>For example the preceeding example write:</p>
<pre class="sourceCode clojure"><code class="sourceCode clojure">(<span class="kw">def</span><span class="fu"> foo </span>[[<span class="dv">1</span> <span class="dv">2</span> <span class="dv">3</span>] [<span class="dv">4</span> <span class="dv">5</span> <span class="dv">6</span>]])
(<span class="kw">defn</span><span class="fu"> plus2 </span>[x y] (<span class="kw">+</span> x <span class="dv">2</span>))
(<span class="kw">defn</span><span class="fu"> sum </span>[l] (<span class="kw">reduce</span> <span class="kw">+</span> <span class="dv">0</span> l))
(sum (<span class="kw">map</span> (<span class="kw">fn</span> [l] (<span class="kw">reduce</span> plus2 <span class="dv">0</span> l)) foo))
<span class="co">;; or</span>
(<span class="kw">-&gt;&gt;</span> foo
(<span class="kw">map</span> #(<span class="kw">reduce</span> plus2 <span class="dv">0</span> %))
(sum))</code></pre>
<p>The code is more modulable, easier to read and to modify.</p>
<ul>
<li>Java null pointer exception!</li>
<li>Unreadable stacktrace</li>
</ul>
<p>What were the immediate wins!</p>
<h3 id="deep-access">Deep access</h3>
<p>For the brave an true there is the lenses Haskell library. But for clojurist, the basic access function should be good enough.</p>
<p>Lets compare Javascript with Clojure:</p>
<pre class="sourceCode javascript"><code class="sourceCode javascript">foo={<span class="st">&quot;db&quot;</span>: [{<span class="st">&quot;name&quot;</span>:<span class="st">&quot;John Doe&quot;</span>,<span class="st">&quot;age&quot;</span>:<span class="dv">30</span>},{<span class="st">&quot;name&quot;</span>:<span class="st">&quot;Rich&quot;</span>,<span class="st">&quot;age&quot;</span>:<span class="dv">40</span>},{<span class="st">&quot;age&quot;</span>:<span class="dv">20</span>}]
<span class="co">// other stuff , ....</span>
}
<span class="kw">var</span> val = <span class="kw">function</span>() {
x = foo[db];
<span class="kw">if</span> (x) {
<span class="kw">let</span> y = x[<span class="dv">1</span>];
<span class="kw">if</span> (y) {
<span class="kw">return</span> <span class="ot">y</span>.<span class="fu">age</span>;
} <span class="kw">else</span> <span class="kw">return</span> nil;
} <span class="kw">else</span> <span class="kw">return</span> nil;
}();</code></pre>
<p>Yes, you have to manually check at each level if the value is null or not. Without this manual check, your code is going to crash at runtime!</p>
<p>Now lets compare the situation with clojure:</p>
<pre class="sourceCode clojure"><code class="sourceCode clojure">(<span class="kw">-&gt;</span> foo <span class="kw">:db</span> <span class="kw">second</span> <span class="kw">:age</span>)</code></pre>
<p>Yes, thats all. The default value in case of problem is <code>nil</code>.</p>
<h3 id="merges">Merges</h3>
<p><strong>Seriously!!!!!</strong></p>
<pre class="sourceCode clojure"><code class="sourceCode clojure">(<span class="kw">into</span> map1 map2)</code></pre>
<p>I dont even want to compare to javascript as it would be ridiculous. Mainly, you cant<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>, or you need jQuery and its ugly.</p>
<h3 id="syntax">Syntax</h3>
<p>Learning Clojure syntax take about 3 minutes. It is clean, no <em>fucking</em> comma, semicolons, etc…</p>
<ul>
<li>Arrays: <code>[a b c]</code> in javascript <code>[a,b,c]</code> (why the commas?)</li>
<li>Hash Map (Associative arrays): <code>{:key1 value1 :key2 value2}</code> in javascript you need to define an Object and keys are generally strings: <code>{&quot;key1&quot;:value1, &quot;key2&quot;:value2}</code>. Multiline object declaration always have bad number of commas.</li>
<li>Set: <code>#{:a :b :c}</code> in javascript sets doesnt even exists you have to simulate them with Objects: <code>{&quot;a&quot;:true, &quot;b&quot;:true, &quot;c&quot;:true}</code></li>
<li>inline function declaration; compare <code>#(* % 2)</code> in clojure with <code>function(x){return x * 2;}</code> in javascript</li>
</ul>
<p>TODO: faudrait maider à en rajouter quelques tonnes. Avec de meilleurs exemples.</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Just a great thank you to FPComplete and in particular Michael Snoyman!<a href="#fnref1"></a></p></li>
<li id="fn2"><p><a href="http://stackoverflow.com/questions/171251/how-can-i-merge-properties-of-two-javascript-objects-dynamically" class="uri">http://stackoverflow.com/questions/171251/how-can-i-merge-properties-of-two-javascript-objects-dynamically</a><a href="#fnref2"></a></p></li>
</ol>
</section>
<hr/>
<div id="footer">
<a href="http://about.me/gbuisson">G</a> -
<a href="http://sprunck.com">S</a> -
<a href="http://yannesposito.com">Y</a>
</div>
</body>
</html>