talk wip

2015-11-12 23:25:00 +01:00 · 2015-11-12 23:25:00 +01:00 · 85c5a4edb1
commit 85c5a4edb1
parent c7d0e6f4c0
1 changed files with 74 additions and 0 deletions
--- a/articles/talk.md
+++ b/articles/talk.md
@ -0,0 +1,74 @@
+=== Title ===
+
+real time social media analytics with clojure
+
+Hi, My name is Guillaume Buisson, and this is Yann Esposito, we are here to talk about social media analytics with clojure.
+
+by real time we mean that when you tweet, your data is ingested, processed and aggregated by our systems with minimal latency and can be displayed right away by our customers
+
+by analytics we mean gender detection, sentiment analysis, entity detection, n-grams tokenization.
+
+To display our data we created what we call a "pulse", this is a single page application providing a dashboard showing all our indicators in realtime.
+
+here you can see it displaying analytics about Twitter's currently trending topics worldwide in real time.
+
+- on top left, you can the see the pulse chart, it provides indication of the current volume of messages down to second granularity
+
+- on top rights you have current mood and gender repartition charts, unique authors count, retweets and engagement ratios.
+
+- then you have a wordcloud of ngrams, as well as a top subjects by volume table
+
+- and right at the bottom, a timeline showing evolution of all our main indicators at the top.
+
+You will see a sample of it right at the bottom of the screen, during our talk, tweet with #clojurex or #clojureX2015 to see the indicators evolving.
+
+=== The Big picture ===
+To create these analytics, data transits over many phases:
+
+1 - Twitter provides us a stream of tweets, Facebook POSTS its data as xml payloads
+
+2 - That's when we come to play, the first element of our architecture ingests these streams as fast as possible and hand over these messages to a data store
+
+3 - A stream processing application reads this data store, process these messages add enrichlents and hand off the messages to another data store
+
+4 - An aggregator reads the messages aggregates data and transform it into timeseries
+
+5 - An API Server exploit these timeseries and enables our customers to query it
+
+=== Legacy ===
+
+Times were not always so confortable at Vigiglobe, when we started we inherited from a legacy prototype coded by extern contractors, it was as follows:
+
+- The Ingestor was coded in PHP
+
+- The stream processing application was a mixture of PHP, Perl and Makefiles running in batches throught the help of a cron task
+
+- Our datastore was MongoDB
+
+- Our timeserie generator was node.js application doing some intensive incremental map reduce jobs on mongodb collections
+
+- The API Server was also a node.js application, coded with the ActionHero Framework
+
+The goal was set and clear, we had to evolve this prototype into a full fledged production product, however we had many issues with it:
+
+1 - Performance
+
+PHP being single threaded, the ingestor was not able to use our machine's cores and couldn't handle high volumes of messages, what's more the ingestor was doing too much work at it's level
+
+MongoDB with it's database locking was a real headhache, we couldn't handle high volumes and also display our dashboards to high traffic websites
+
+Our ActionHero API Server was using redis queues to dispatch queries accross workers and servers, and would eventually fail under heavy load.
+
+2 - Stability, correctness and data safety
+
+As mongo was locking, our incremental map reduce jobs were failing and we started to get massive data losses.
+
+Using PHP we had a lot of troubles safely accessing deeply nested data structures, jobs would fail with slight data structure changes.
+
+Maintaining the API was a real burden as we also had to ensure query validation and documentation by hand.
+
+3 - Productivity
+
+with such a polyglot architecture progress was slow, we needed to change our mindset with every language and we couldn't share methods or libraries accross projects, we coded the same thing multiple times with different results.
+
+These languages were clearly not adapted to our goals, there was not a lot of community projects or libraries to help us.