mkdocs/druid/druid.html
2016-03-29 11:47:22 +02:00

301 lines
9.5 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<meta name="author" content="Yann Esposito">
<title>Druid pour lanalyse de données en temps réel</title>
<style type="text/css">code{white-space: pre;}</style>
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<link rel="stylesheet" href="../styling.css">
</head>
<body>
<header>
<h1 class="title">Druid pour lanalyse de données en temps réel</h1>
<h2 class="author">Yann Esposito</h2>
<h3 class="date">7 Avril 2016</h3>
</header>
<nav id="TOC">
<ul>
<li><a href="#intro">Intro</a><ul>
<li><a href="#plan">Plan</a></li>
<li><a href="#experience">Experience</a></li>
<li><a href="#real-time">Real Time?</a></li>
<li><a href="#demande">Demande</a></li>
<li><a href="#en-pratique">En pratique</a></li>
<li><a href="#origine-php">Origine (PHP)</a></li>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#pre-considerations">Pre Considerations</a></li>
<li><a href="#try-to-conceptualize-events">Try to conceptualize (events)</a></li>
<li><a href="#in-the-end">In the End</a></li>
</ul></li>
<li><a href="#druid">Druid</a><ul>
<li><a href="#who">Who</a></li>
<li><a href="#goal">Goal</a></li>
<li><a href="#concepts">Concepts</a></li>
<li><a href="#features">Features</a></li>
<li><a href="#storage">Storage</a></li>
<li><a href="#columnar-storage">Columnar Storage</a></li>
<li><a href="#index">Index</a></li>
<li><a href="#data-segments">Data Segments</a></li>
<li><a href="#real-time-ingestion">Real-time ingestion</a></li>
<li><a href="#batch-ingestion">Batch Ingestion</a></li>
<li><a href="#real-time-ingestion-1">Real-time Ingestion</a></li>
</ul></li>
<li><a href="#querying">Querying</a><ul>
<li><a href="#query-types">Query types</a></li>
<li><a href="#tip">Tip</a></li>
<li><a href="#query-spec">Query Spec</a></li>
<li><a href="#examples">Example(s)</a></li>
<li><a href="#caching">Caching</a></li>
<li><a href="#load-rules">Load Rules</a></li>
</ul></li>
<li><a href="#components">Components</a><ul>
<li><a href="#druid-components">Druid Components</a></li>
<li><a href="#coordinator">Coordinator</a></li>
<li><a href="#real-time-nodes">Real-time Nodes</a></li>
<li><a href="#historical-nodes">Historical Nodes</a></li>
<li><a href="#overlord">Overlord</a></li>
<li><a href="#middle-manager">Middle Manager</a></li>
<li><a href="#broker-nodes">Broker Nodes</a></li>
<li><a href="#deep-storage">Deep Storage</a></li>
</ul></li>
<li><a href="#considerations-tools">Considerations &amp; Tools</a><ul>
<li><a href="#when-not-to-choose-druid">When <em>not</em> to choose Druid</a></li>
<li><a href="#graphite-metrics">Graphite (metrics)</a></li>
<li><a href="#pivot-exploring-data">Pivot (exploring data)</a></li>
<li><a href="#caravel-exploring-data">Caravel (exploring data)</a></li>
</ul></li>
</ul>
</nav>
<h1 id="intro">Intro</h1>
<h2 id="plan">Plan</h2>
<ul>
<li>Introduction; why?</li>
<li>How?</li>
</ul>
<h2 id="experience">Experience</h2>
<ul>
<li>Real Time Social Media Analytics</li>
</ul>
<h2 id="real-time">Real Time?</h2>
<ul>
<li>Ingestion Latency: seconds</li>
<li>Query Latency: seconds</li>
</ul>
<h2 id="demande">Demande</h2>
<ul>
<li>Twitter: <code>20k msg/s</code>, <code>1msg = 10ko</code> pendant 24h</li>
<li>Facebook public: 1000 à 2000 msg/s en continu</li>
</ul>
<h2 id="en-pratique">En pratique</h2>
<ul>
<li>Twitter: 400 msg/s en continu, pics à 1500</li>
</ul>
<h2 id="origine-php">Origine (PHP)</h2>
<p><img src="img/bad_php.jpg" alt="History" /> </p>
<h2 id="introduction">Introduction</h2>
<ul>
<li>Traitement de donnée gros volume + faible latence</li>
<li>Typiquement <code>pulse</code></li>
</ul>
<p><a href="http://pulse.vigiglo.be/#/vigiglobe/Earthquake/dashboard" target="_blank"> DEMO </a></p>
<h2 id="pre-considerations">Pre Considerations</h2>
<p>Discovered vs Invented</p>
<h2 id="try-to-conceptualize-events">Try to conceptualize (events)</h2>
<p>Scalable + Real Time + Fail safe</p>
<ul>
<li>timeseries</li>
<li>alerting system</li>
<li>top N</li>
<li>etc…</li>
</ul>
<h2 id="in-the-end">In the End</h2>
<p>Druid concepts are always emerging naturally</p>
<h1 id="druid">Druid</h1>
<h2 id="who">Who</h2>
<p>Metamarkets</p>
<h2 id="goal">Goal</h2>
<blockquote>
<p>Druid is an open source store designed for real-time exploratory analytics on large data sets.</p>
</blockquote>
<blockquote>
<p>hosted dashboard that would allow users to arbitrarily explore and visualize event streams.</p>
</blockquote>
<h2 id="concepts">Concepts</h2>
<ul>
<li>Column-oriented storage layout</li>
<li>distributed, shared-nothing architecture</li>
<li>advanced indexing structure</li>
</ul>
<h2 id="features">Features</h2>
<ul>
<li>fast aggregations</li>
<li>flexible filters</li>
<li>low latency data ingestion</li>
</ul>
<p><strong>arbitrary exploration of billion-row tables tables with sub-second latencies</strong></p>
<h2 id="storage">Storage</h2>
<ul>
<li>Columnar</li>
<li>Inverted Index</li>
<li>Immutable Segments</li>
</ul>
<h2 id="columnar-storage">Columnar Storage</h2>
<h2 id="index">Index</h2>
<ul>
<li>Values are dictionary encoded</li>
</ul>
<p><code>{&quot;USA&quot; 1, &quot;Canada&quot; 2, &quot;Mexico&quot; 3, ...}</code></p>
<ul>
<li>Bitmap for every dimension value (used by filters)</li>
</ul>
<p><code>&quot;USA&quot; -&gt; [0 1 0 0 1 1 0 0 0]</code></p>
<ul>
<li>Column values (used by aggergation queries)</li>
</ul>
<p><code>[2,1,3,15,1,1,2,8,7]</code></p>
<h2 id="data-segments">Data Segments</h2>
<ul>
<li>Per time interval</li>
<li>skip segments when querying</li>
<li>Immutable</li>
<li>Cache friendly</li>
<li>No locking</li>
<li>Versioned</li>
<li>No locking</li>
<li>Read-write concurrency</li>
</ul>
<h2 id="real-time-ingestion">Real-time ingestion</h2>
<ul>
<li>Via Real-Time Node and Firehose</li>
<li>No redundancy or HA, thus not recommended</li>
<li>Via Indexing Service and Tranquility API</li>
<li>Core API</li>
<li>Integration with Streaming Frameworks</li>
<li>HTTP Server</li>
<li><strong>Kafka Consumer</strong></li>
</ul>
<h2 id="batch-ingestion">Batch Ingestion</h2>
<ul>
<li>File based (HDFS, S3, …)</li>
</ul>
<h2 id="real-time-ingestion-1">Real-time Ingestion</h2>
<pre><code>Task 1: [ Interval ][ Window ]
Task 2: [ ]
---------------------------------------&gt;
time</code></pre>
<p>Minimum indexing slots =<br />
Data Sources × Partitions × Replicas × 2</p>
<h1 id="querying">Querying</h1>
<h2 id="query-types">Query types</h2>
<ul>
<li>Group by: group by multiple dimensions</li>
<li>Top N: like grouping by a single dimension</li>
<li>Timeseries: without grouping over dimensions</li>
<li>Search: Dimensions lookup</li>
<li>Time Boundary: Find available data timeframe</li>
<li>Metadata queries</li>
</ul>
<h2 id="tip">Tip</h2>
<ul>
<li>Prefer <code>topN</code> over <code>groupBy</code></li>
<li>Prefer <code>timeseries</code> over <code>topN</code></li>
<li>Use limits (and priorities)</li>
</ul>
<h2 id="query-spec">Query Spec</h2>
<ul>
<li>Data source</li>
<li>Dimensions</li>
<li>Interval</li>
<li>Filters</li>
<li>Aggergations</li>
<li>Post Aggregations</li>
<li>Granularity</li>
<li>Context (query configuration)</li>
<li>Limit</li>
</ul>
<h2 id="examples">Example(s)</h2>
<p>TODO</p>
<h2 id="caching">Caching</h2>
<ul>
<li>Historical node level</li>
<li>By segment</li>
<li>Broker Level</li>
<li>By segment and query</li>
<li><code>groupBy</code> is disabled on purpose!</li>
<li>By default - local caching</li>
</ul>
<h2 id="load-rules">Load Rules</h2>
<ul>
<li>Can be defined</li>
<li>What can be set</li>
</ul>
<h1 id="components">Components</h1>
<h2 id="druid-components">Druid Components</h2>
<ul>
<li>Real-time Nodes</li>
<li>Historical Nodes</li>
<li>Broker Nodes</li>
<li>Coordinator</li>
<li>For indexing:</li>
<li>Overlord</li>
<li><p>Middle Manager</p></li>
<li>Deep Storage</li>
<li><p>Metadata Storage</p></li>
<li>Load Balancer</li>
<li><p>Cache</p></li>
</ul>
<h2 id="coordinator">Coordinator</h2>
<p>Manage Segments</p>
<h2 id="real-time-nodes">Real-time Nodes</h2>
<ul>
<li>Pulling data in real-time</li>
<li>Indexing it</li>
</ul>
<h2 id="historical-nodes">Historical Nodes</h2>
<ul>
<li>Keep historical segments</li>
</ul>
<h2 id="overlord">Overlord</h2>
<ul>
<li>Accepts tasks and distributes them to middle manager</li>
</ul>
<h2 id="middle-manager">Middle Manager</h2>
<ul>
<li>Execute submitted tasks via Peons</li>
</ul>
<h2 id="broker-nodes">Broker Nodes</h2>
<ul>
<li>Route query to Real-time and Historical nodes</li>
<li>Merge results</li>
</ul>
<h2 id="deep-storage">Deep Storage</h2>
<ul>
<li>Segments backup (HDFS, S3, …)</li>
</ul>
<h1 id="considerations-tools">Considerations &amp; Tools</h1>
<h2 id="when-not-to-choose-druid">When <em>not</em> to choose Druid</h2>
<ul>
<li>Data is not time-series</li>
<li>Cardinality is <em>very</em> high</li>
<li>Number of dimensions is high</li>
<li>Setup cost must be avoided</li>
</ul>
<h2 id="graphite-metrics">Graphite (metrics)</h2>
<p><img src="img/graphite.png" alt="Graphite" />__</p>
<p><a href="http://graphite.wikidot.com">Graphite</a></p>
<h2 id="pivot-exploring-data">Pivot (exploring data)</h2>
<p><img src="img/pivot.gif" alt="Pivot" /> </p>
<p><a href="https://github.com/implydata/pivot">Pivot</a></p>
<h2 id="caravel-exploring-data">Caravel (exploring data)</h2>
<p><img src="img/caravel.png" alt="caravel" /> </p>
<p><a href="https://github.com/airbnb/caravel">Caravel</a></p>
<div id="footer">
<a href="yannesposito.com">Y</a>
</div>
</body>
</html>