mkdocs/druid/druid.reveal.html
Yann Esposito (Yogsototh) 585d84325c
latest commit
2016-04-06 00:45:18 +02:00

616 lines
19 KiB
HTML
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Druid for real-time analysis</title>
<meta name="description" content="Druid for real-time analysis">
<meta name="author" content="Yann Esposito" />
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<link rel="stylesheet" href="../.reveal.js-3.2.0/css/reveal.css">
<link rel="stylesheet" href="../.reveal.js-3.2.0/css/theme/solarized-dark.css" id="theme">
<!-- For syntax highlighting -->
<link rel="stylesheet" href="../.reveal.js-3.2.0/lib/css/solarized-dark.css">
<!-- If the query includes 'print-pdf', use the PDF print sheet -->
<script>
document.write( '<link rel="stylesheet" href="../.reveal.js-3.2.0/css/print/' +
( window.location.search.match( /print-pdf/gi ) ? 'pdf' : 'paper' ) +
'.css" type="text/css" media="print">' );
</script>
<!--[if lt IE 9]>
<script src="../.reveal.js-3.2.0/lib/js/html5shiv.js"></script>
<![endif]-->
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="slides">
<section>
<h1>Druid for real-time analysis</h1>
<h3>Yann Esposito</h3>
<p>
<h4>7 Avril 2016</h4>
</p>
</section>
<section id="druid-the-sales-pitch" class="level1">
<h1>Druid the Sales Pitch</h1>
<ul>
<li>Sub-Second Queries</li>
<li>Real-time Streams</li>
<li>Scalable to Petabytes</li>
<li>Deploy Anywhere</li>
<li>Vibrant Community (Open Source)</li>
</ul>
<aside class="notes">
<ul>
<li>Ideal for powering user-facing analytic applications</li>
<li>Deploy anywhere: cloud, on-premise, integrate with Haddop, Spark, Kafka, Storm, Samza</li>
</ul>
</aside>
</section>
<section id="intro" class="level1">
<h1>Intro</h1>
<section id="experience" class="level2">
<h2>Experience</h2>
<ul>
<li>Real Time Social Media Analytics</li>
</ul>
</section>
<section id="real-time" class="level2">
<h2>Real Time?</h2>
<ul>
<li>Ingestion Latency: seconds</li>
<li>Query Latency: seconds</li>
</ul>
</section>
<section id="demand" class="level2">
<h2>Demand</h2>
<ul>
<li>Twitter: <code>20k msg/s</code>, <code>1msg = 10ko</code> during 24h</li>
<li>Facebook public: 1000 to 2000 msg/s continuously</li>
<li>Low Latency</li>
</ul>
</section>
<section id="reality" class="level2">
<h2>Reality</h2>
<ul>
<li>Twitter: 400 msg/s continuously, burst to 1500</li>
<li>Facebook: 1000 to 2000 msg/s</li>
</ul>
</section>
</section>
<section id="origin-php" class="level1">
<h1>Origin (PHP)</h1>
<p><img src="img/bad_php.jpg" alt="OH NOES PHP!" /> </p>
</section>
<section id="st-refactoring-node.js" class="level1">
<h1>1st Refactoring (Node.js)</h1>
<ul>
<li>Ingestion still in PHP</li>
<li>Node.js, Perl, Java &amp; R for sentiment analysis</li>
<li>MongoDB</li>
<li>Manually made time series (Incremental Map/Reduce)</li>
<li>Manually coded HyperLogLog in js</li>
</ul>
</section>
<section id="return-of-experience" class="level1">
<h1>Return of Experience</h1>
<p><img src="img/MongoDB.png" alt="MongoDB the destroyer" /> </p>
</section>
<section id="return-of-experience-1" class="level1">
<h1>Return of Experience</h1>
<ul>
<li>Ingestion still in PHP (600 msg/s max)</li>
<li>Node.js, Perl, Java (10 msg/s max)</li>
</ul>
<figure>
<img src="img/bored.gif" alt="Too Slow, Bored" /><figcaption>Too Slow, Bored</figcaption>
</figure>
</section>
<section id="nd-refactoring" class="level1">
<h1>2nd Refactoring</h1>
<ul>
<li>Haskell</li>
<li>Clojure / Clojurescript</li>
<li>Kafka / Zookeeper</li>
<li>Mesos / Marathon</li>
<li>Elasticsearch</li>
<li><strong>Druid</strong></li>
</ul>
</section>
<section id="nd-refactoring-ftw" class="level1">
<h1>2nd Refactoring (FTW!)</h1>
<p><img src="img/talking.jpg" alt="Now we&#39;re talking" /> </p>
</section>
<section id="nd-refactoring-return-of-experience" class="level1">
<h1>2nd Refactoring return of experience</h1>
<ul>
<li>No limit, everything is scalable</li>
<li>High availability</li>
<li>Low latency: Ingestion &amp; User faced querying</li>
<li>Cheap if done correctly</li>
</ul>
<p><strong>Thanks Druid!</strong></p>
</section>
<section id="demo" class="level1">
<h1>Demo</h1>
<ul>
<li>Low Latency High Volume of Data Analysis</li>
<li>Typically <code>pulse</code></li>
</ul>
<p><a href="http://pulse.vigiglo.be/#/vigiglobe/Earthquake/dashboard" target="_blank"> DEMO Time </a></p>
</section>
<section id="pre-considerations" class="level1">
<h1>Pre Considerations</h1>
<section id="discovered-vs-invented" class="level2">
<h2>Discovered vs Invented</h2>
<p>Try to conceptualize a s.t.</p>
<ul>
<li>Ingest Events</li>
<li>Real-Time Queries</li>
<li>Scalable</li>
<li>Highly Available</li>
</ul>
<p>Analytics: timeseries, alerting system, top N, etc...</p>
</section>
<section id="in-the-end" class="level2">
<h2>In the End</h2>
<p>Druid concepts are always emerging naturally</p>
</section>
</section>
<section id="druid" class="level1">
<h1>Druid</h1>
<section id="who" class="level2">
<h2>Who?</h2>
<p>Metamarkets</p>
<p><a href="http://druid.io/druid-powered.html"
target="_blank">Powered by Druid</a></p>
<ul>
<li>Alibaba, Cisco, Criteo, eBay, Hulu, Netflix, Paypal...</li>
</ul>
</section>
<section id="goal" class="level2">
<h2>Goal</h2>
<blockquote>
<p>Druid is an open source store designed for real-time exploratory analytics on large data sets.</p>
</blockquote>
<blockquote>
<p>hosted dashboard that would allow users to arbitrarily explore and visualize event streams.</p>
</blockquote>
</section>
<section id="concepts" class="level2">
<h2>Concepts</h2>
<ul>
<li>Column-oriented storage layout</li>
<li>distributed, shared-nothing architecture</li>
<li>advanced indexing structure</li>
</ul>
</section>
<section id="key-features" class="level2">
<h2>Key Features</h2>
<ul>
<li>Sub-second OLAP Queries</li>
<li>Real-time Streaming Ingestion</li>
<li>Power Analytic Applications</li>
<li>Cost Effective</li>
<li>High Available</li>
<li>Scalable</li>
</ul>
</section>
<section id="right-for-me" class="level2">
<h2>Right for me?</h2>
<ul>
<li>require fast aggregations</li>
<li>exploratory analytics</li>
<li>analysis in real-time</li>
<li>lots of data (trillions of events, petabytes of data)</li>
<li>no single point of failure</li>
</ul>
</section>
</section>
<section id="high-level-architecture" class="level1">
<h1>High Level Architecture</h1>
<section id="inspiration" class="level2">
<h2>Inspiration</h2>
<ul>
<li>Google's <a href="http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf">BigQuery/Dremel</a></li>
<li>Google's <a href="http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf">PowerDrill</a></li>
</ul>
</section>
<section id="index-immutability" class="level2">
<h2>Index / Immutability</h2>
<p>Druid indexes data to create mostly immutable views.</p>
</section>
<section id="storage" class="level2">
<h2>Storage</h2>
<p>Store data in custom column format highly optimized for aggregation &amp; filter.</p>
</section>
<section id="specialized-nodes" class="level2">
<h2>Specialized Nodes</h2>
<ul>
<li>A Druid cluster is composed of various type of nodes</li>
<li>Each designed to do a small set of things very well</li>
<li>Nodes don't need to be deployed on individual hardware</li>
<li>Many node types can be colocated in production</li>
</ul>
</section>
</section>
<section id="druid-vs-x" class="level1">
<h1>Druid vs X</h1>
<section id="elasticsearch" class="level2">
<h2>Elasticsearch</h2>
<ul>
<li>resource requirement much higher for ingestion &amp; aggregation</li>
<li>No data summarization (100x in real world data)</li>
</ul>
</section>
<section id="keyvalue-stores-hbasecassandraopentsdb" class="level2">
<h2>Key/Value Stores (HBase/Cassandra/OpenTSDB)</h2>
<ul>
<li>Must Pre-compute Result
<ul>
<li>Exponential storage</li>
<li>Hours of pre-processing time</li>
</ul></li>
<li>Use the dimensions as key (like in OpenTSDB)
<ul>
<li>No filter index other than range</li>
<li>Hard for complex predicates</li>
</ul></li>
</ul>
</section>
<section id="spark" class="level2">
<h2>Spark</h2>
<ul>
<li>Druid can be used to accelerate OLAP queries in Spark</li>
<li>Druid focuses on the latencies to ingest and serve queries</li>
<li>Too long for end user to arbitrarily explore data</li>
</ul>
</section>
<section id="sql-on-hadoop-impaladrillspark-sqlpresto" class="level2">
<h2>SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)</h2>
<ul>
<li>Queries: more data transfer between nodes</li>
<li>Data Ingestion: bottleneck by backing store</li>
<li>Query Flexibility: more flexible (full joins)</li>
</ul>
</section>
</section>
<section id="data" class="level1">
<h1>Data</h1>
<section id="concepts-1" class="level2">
<h2>Concepts</h2>
<ul>
<li><strong>Timestamp column</strong>: query centered on time axis</li>
<li><strong>Dimension columns</strong>: strings (used to filter or to group)</li>
<li><strong>Metric columns</strong>: used for aggregations (count, sum, mean, etc...)</li>
</ul>
</section>
<section id="indexing" class="level2">
<h2>Indexing</h2>
<ul>
<li>Immutable snapshots of data</li>
<li>data structure highly optimized for analytic queries</li>
<li>Each column is stored separately</li>
<li>Indexes data on a per shard (segment) level</li>
</ul>
</section>
<section id="loading" class="level2">
<h2>Loading</h2>
<ul>
<li>Real-Time</li>
<li>Batch</li>
</ul>
</section>
<section id="querying" class="level2">
<h2>Querying</h2>
<ul>
<li>JSON over HTTP</li>
<li>Single Table Operations, no joins.</li>
</ul>
</section>
<section id="segments" class="level2">
<h2>Segments</h2>
<ul>
<li>Per time interval
<ul>
<li>skip segments when querying</li>
</ul></li>
<li>Immutable
<ul>
<li>Cache friendly</li>
<li>No locking</li>
</ul></li>
<li>Versioned
<ul>
<li>No locking</li>
<li>Read-write concurrency</li>
</ul></li>
</ul>
</section>
</section>
<section id="roll-up" class="level1">
<h1>Roll-up</h1>
<section id="example" class="level2">
<h2>Example</h2>
<pre><code>timestamp page ... added deleted
2011-01-01T00:01:35Z Cthulhu 10 65
2011-01-01T00:03:63Z Cthulhu 15 62
2011-01-01T01:04:51Z Cthulhu 32 45
2011-01-01T01:01:00Z Azatoth 17 87
2011-01-01T01:02:00Z Azatoth 43 99
2011-01-01T02:03:00Z Azatoth 12 53</code></pre>
<pre><code>timestamp page ... nb added deleted
2011-01-01T00:00:00Z Cthulhu 2 25 127
2011-01-01T01:00:00Z Cthulhu 1 32 45
2011-01-01T01:00:00Z Azatoth 2 60 186
2011-01-01T02:00:00Z Azatoth 1 12 53</code></pre>
</section>
<section id="as-sql" class="level2">
<h2>as SQL</h2>
<pre><code>GROUP BY timestamp, page, nb, added, deleted
:: nb = COUNT(1)
, added = SUM(added)
, deleted = SUM(deleted)</code></pre>
<p>In practice can dramatically reduce the size (up to x100)</p>
</section>
</section>
<section id="segments-1" class="level1">
<h1>Segments</h1>
<section id="sharding" class="level2">
<h2>Sharding</h2>
<p><small><code>sampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0</code></small></p>
<pre><code>timestamp page ... nb added deleted
2011-01-01T01:00:00Z Cthulhu 1 20 45
2011-01-01T01:00:00Z Azatoth 1 30 106</code></pre>
<p><small><code>sampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0</code></small></p>
<pre><code>timestamp page ... nb added deleted
2011-01-01T01:00:00Z Cthulhu 1 12 45
2011-01-01T01:00:00Z Azatoth 2 30 80</code></pre>
</section>
<section id="core-data-structure" class="level2">
<h2>Core Data Structure</h2>
<p><img src="img/druid-column-types.png" alt="Segment" /> </p>
<ul>
<li>dictionary</li>
<li>a bitmap for each value</li>
<li>a list of the columns values encoded using the dictionary</li>
</ul>
</section>
<section id="example-1" class="level2">
<h2>Example</h2>
<pre><code>dictionary: { &quot;Cthulhu&quot;: 0
, &quot;Azatoth&quot;: 1 }
column data: [0, 0, 1, 1]
bitmaps (one for each value of the column):
value=&quot;Cthulhu&quot;: [1,1,0,0]
value=&quot;Azatoth&quot;: [0,0,1,1]</code></pre>
</section>
<section id="example-multiple-matches" class="level2">
<h2>Example (multiple matches)</h2>
<pre><code>dictionary: { &quot;Cthulhu&quot;: 0
, &quot;Azatoth&quot;: 1 }
column data: [0, [0,1], 1, 1]
bitmaps (one for each value of the column):
value=&quot;Cthulhu&quot;: [1,1,0,0]
value=&quot;Azatoth&quot;: [0,1,1,1]</code></pre>
</section>
<section id="real-time-ingestion" class="level2">
<h2>Real-time ingestion</h2>
<ul>
<li>Via Real-Time Node and Firehose
<ul>
<li>No redundancy or HA, thus not recommended</li>
</ul></li>
<li>Via Indexing Service and Tranquility API
<ul>
<li>Core API</li>
<li>Integration with Streaming Frameworks</li>
<li>HTTP Server</li>
<li><strong>Kafka Consumer</strong></li>
</ul></li>
</ul>
</section>
<section id="batch-ingestion" class="level2">
<h2>Batch Ingestion</h2>
<ul>
<li>File based (HDFS, S3, ...)</li>
</ul>
</section>
<section id="real-time-ingestion-1" class="level2">
<h2>Real-time Ingestion</h2>
<pre><code>Task 1: [ Interval ][ Window ]
Task 2: [ ]
-----------------------------------------------------&gt;
time</code></pre>
</section>
</section>
<section id="querying-1" class="level1">
<h1>Querying</h1>
<section id="query-types" class="level2">
<h2>Query types</h2>
<ul>
<li>Group by: group by multiple dimensions</li>
<li>Top N: like grouping by a single dimension</li>
<li>Timeseries: without grouping over dimensions</li>
<li>Search: Dimensions lookup</li>
<li>Time Boundary: Find available data timeframe</li>
<li>Metadata queries</li>
</ul>
</section>
<section id="examples" class="level2">
<h2>Example(s)</h2>
<pre><code>{&quot;queryType&quot;: &quot;groupBy&quot;,
&quot;dataSource&quot;: &quot;druidtest&quot;,
&quot;granularity&quot;: &quot;all&quot;,
&quot;dimensions&quot;: [],
&quot;aggregations&quot;: [
{&quot;type&quot;: &quot;count&quot;, &quot;name&quot;: &quot;rows&quot;},
{&quot;type&quot;: &quot;longSum&quot;, &quot;name&quot;: &quot;imps&quot;, &quot;fieldName&quot;: &quot;impressions&quot;},
{&quot;type&quot;: &quot;doubleSum&quot;, &quot;name&quot;: &quot;wp&quot;, &quot;fieldName&quot;: &quot;wp&quot;}
],
&quot;intervals&quot;: [&quot;2010-01-01T00:00/2020-01-01T00&quot;]}</code></pre>
</section>
<section id="result" class="level2">
<h2>Result</h2>
<pre><code>[ {
&quot;version&quot; : &quot;v1&quot;,
&quot;timestamp&quot; : &quot;2010-01-01T00:00:00.000Z&quot;,
&quot;event&quot; : {
&quot;imps&quot; : 5,
&quot;wp&quot; : 15000.0,
&quot;rows&quot; : 5
}
} ]</code></pre>
</section>
<section id="caching" class="level2">
<h2>Caching</h2>
<ul>
<li>Historical node level
<ul>
<li>By segment</li>
</ul></li>
<li>Broker Level
<ul>
<li>By segment and query</li>
<li><code>groupBy</code> is disabled on purpose!</li>
</ul></li>
<li>By default: local caching</li>
</ul>
</section>
</section>
<section id="druid-components" class="level1">
<h1>Druid Components</h1>
<section id="druid-1" class="level2">
<h2>Druid</h2>
<ul>
<li>Real-time Nodes</li>
<li>Historical Nodes</li>
<li>Broker Nodes</li>
<li>Coordinator</li>
<li>For indexing:
<ul>
<li>Overlord</li>
<li>Middle Manager</li>
</ul></li>
</ul>
</section>
<section id="also" class="level2">
<h2>Also</h2>
<ul>
<li>Deep Storage (S3, HDFS, ...)</li>
<li>Metadata Storage (SQL)</li>
<li>Load Balancer</li>
<li>Cache</li>
</ul>
</section>
<section id="coordinator" class="level2">
<h2>Coordinator</h2>
<ul>
<li>Real-time Nodes (pull data, index it)</li>
<li>Historical Nodes (keep old segments)</li>
<li>Broker Nodes (route queries to RT &amp; Hist. nodes, merge)</li>
<li>Coordinator (manage segemnts)</li>
<li>For indexing:
<ul>
<li>Overlord (distribute task to the middle manager)</li>
<li>Middle Manager (execute tasks via Peons)</li>
</ul></li>
</ul>
</section>
</section>
<section id="when-not-to-choose-druid" class="level1">
<h1>When <em>not</em> to choose Druid</h1>
<ul>
<li>Data is not time-series</li>
<li>Cardinality is <em>very</em> high</li>
<li>Number of dimensions is high</li>
<li>Setup cost must be avoided</li>
</ul>
</section>
<section id="graphite-metrics" class="level1">
<h1>Graphite (metrics)</h1>
<p><img src="img/graphite.png" alt="Graphite" />__</p>
<p><a href="http://graphite.wikidot.com">Graphite</a></p>
</section>
<section id="pivot-exploring-data" class="level1">
<h1>Pivot (exploring data)</h1>
<p><img src="img/pivot.gif" alt="Pivot" /> </p>
<p><a href="https://github.com/implydata/pivot">Pivot</a></p>
</section>
<section id="caravel" class="level1">
<h1>Caravel</h1>
<p><img src="img/caravel.png" alt="caravel" /> </p>
<p><a href="https://github.com/airbnb/caravel">Caravel</a></p>
</section>
<section id="conclusions" class="level1">
<h1>Conclusions</h1>
<section id="precompute-your-time-series" class="level2">
<h2>Precompute your time series?</h2>
<p><img src="img/wrong.jpg" alt="You&#39;re doing it wrong" /> </p>
</section>
<section id="dont-reinvent-it" class="level2">
<h2>Don't reinvent it</h2>
<ul>
<li>need a user facing API</li>
<li>need time series on many dimensions</li>
<li>need real-time</li>
<li>big volume of data</li>
</ul>
</section>
<section id="druid-way-is-the-right-way" class="level2">
<h2>Druid way is the right way!</h2>
<ol type="1">
<li>Push in kafka</li>
<li>Add the right dimensions</li>
<li>Push in druid</li>
<li>???</li>
<li>Profit!</li>
</ol>
</section>
</section>
</div>
<script src="../.reveal.js-3.2.0/lib/js/head.min.js"></script>
<script src="../.reveal.js-3.2.0/js/reveal.js"></script>
<script>
// Full list of configuration options available here:
// https://github.com/hakimel/reveal.js#configuration
Reveal.initialize({
controls: true,
progress: true,
history: true,
center: false,
// available themes are in /css/theme
theme: Reveal.getQueryHash().theme || 'solarized-dark',
// default/cube/page/concave/zoom/linear/fade/none
transition: Reveal.getQueryHash().transition || 'linear',
// Optional libraries used to extend on reveal.js
dependencies: [
{ src: '..//.reveal.js-3.2.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '..//.reveal.js-3.2.0/plugin/markdown/showdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '..//.reveal.js-3.2.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '..//.reveal.js-3.2.0/plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '..//.reveal.js-3.2.0/plugin/zoom-js/zoom.js', async: true, condition: function() { return !!document.body.classList; } },
{ src: '..//.reveal.js-3.2.0/plugin/notes/notes.js', async: true, condition: function() { return !!document.body.classList; } }
]
});
</script>
</body>
</html>