This commit is contained in:
Guillaume Buisson 2015-11-12 23:25:00 +01:00
parent c7d0e6f4c0
commit 85c5a4edb1

74
articles/talk.md Normal file
View file

@ -0,0 +1,74 @@
=== Title ===
real time social media analytics with clojure
Hi, My name is Guillaume Buisson, and this is Yann Esposito, we are here to talk about social media analytics with clojure.
by real time we mean that when you tweet, your data is ingested, processed and aggregated by our systems with minimal latency and can be displayed right away by our customers
by analytics we mean gender detection, sentiment analysis, entity detection, n-grams tokenization.
To display our data we created what we call a "pulse", this is a single page application providing a dashboard showing all our indicators in realtime.
here you can see it displaying analytics about Twitter's currently trending topics worldwide in real time.
- on top left, you can the see the pulse chart, it provides indication of the current volume of messages down to second granularity
- on top rights you have current mood and gender repartition charts, unique authors count, retweets and engagement ratios.
- then you have a wordcloud of ngrams, as well as a top subjects by volume table
- and right at the bottom, a timeline showing evolution of all our main indicators at the top.
You will see a sample of it right at the bottom of the screen, during our talk, tweet with #clojurex or #clojureX2015 to see the indicators evolving.
=== The Big picture ===
To create these analytics, data transits over many phases:
1 - Twitter provides us a stream of tweets, Facebook POSTS its data as xml payloads
2 - That's when we come to play, the first element of our architecture ingests these streams as fast as possible and hand over these messages to a data store
3 - A stream processing application reads this data store, process these messages add enrichlents and hand off the messages to another data store
4 - An aggregator reads the messages aggregates data and transform it into timeseries
5 - An API Server exploit these timeseries and enables our customers to query it
=== Legacy ===
Times were not always so confortable at Vigiglobe, when we started we inherited from a legacy prototype coded by extern contractors, it was as follows:
- The Ingestor was coded in PHP
- The stream processing application was a mixture of PHP, Perl and Makefiles running in batches throught the help of a cron task
- Our datastore was MongoDB
- Our timeserie generator was node.js application doing some intensive incremental map reduce jobs on mongodb collections
- The API Server was also a node.js application, coded with the ActionHero Framework
The goal was set and clear, we had to evolve this prototype into a full fledged production product, however we had many issues with it:
1 - Performance
PHP being single threaded, the ingestor was not able to use our machine's cores and couldn't handle high volumes of messages, what's more the ingestor was doing too much work at it's level
MongoDB with it's database locking was a real headhache, we couldn't handle high volumes and also display our dashboards to high traffic websites
Our ActionHero API Server was using redis queues to dispatch queries accross workers and servers, and would eventually fail under heavy load.
2 - Stability, correctness and data safety
As mongo was locking, our incremental map reduce jobs were failing and we started to get massive data losses.
Using PHP we had a lot of troubles safely accessing deeply nested data structures, jobs would fail with slight data structure changes.
Maintaining the API was a real burden as we also had to ensure query validation and documentation by hand.
3 - Productivity
with such a polyglot architecture progress was slow, we needed to change our mindset with every language and we couldn't share methods or libraries accross projects, we coded the same thing multiple times with different results.
These languages were clearly not adapted to our goals, there was not a lot of community projects or libraries to help us.