update

2016-03-29 11:08:27 +02:00 · 2016-03-29 11:08:27 +02:00 · af539989e2
commit af539989e2
parent 52c9112e33
4 changed files with 204 additions and 4 deletions
--- a/druid/druid.md
+++ b/druid/druid.md
@ -11,13 +11,18 @@ date: 7 Avril 2016

 ## Plan

- Introduction ; pourquoi ?
- Comment ?
+- Introduction; why?
+- How?

-## Expérience
+## Experience

 - Real Time Social Media Analytics

+## Real Time?
+
+- Ingestion Latency: seconds
+- Query Latency: seconds
+
 ## Demande

 - Twitter: `20k msg/s`, `1msg = 10ko`  pendant 24h
@ -40,6 +45,22 @@ date: 7 Avril 2016
 DEMO
 </a>

+## Pre Considerations
+
+Discovered vs Invented
+
+## Try to conceptualize (events)
+
+Scalable + Real Time + Fail safe
+
+- timeseries
+- alerting system
+- top N
+- etc...
+
+## In the End
+
+Druid concepts are always emerging naturally

 # Druid

@ -69,4 +90,183 @@ Metamarkets

 **arbitrary exploration of billion-row tables tables with sub-second latencies**

-## Proof
+## Storage
+
+- Columnar
+- Inverted Index
+- Immutable Segments
+
+## Columnar Storage
+
+## Index
+
+- Values are dictionary encoded
+
+`{"USA" 1, "Canada" 2, "Mexico" 3, ...}`
+
+- Bitmap for every dimension value (used by filters)
+
+`"USA" -> [0 1 0 0 1 1 0 0 0]`
+
+- Column values (used by aggergation queries)
+
+`[2,1,3,15,1,1,2,8,7]`
+
+## Data Segments
+
+- Per time interval
+  - skip segments when querying
+- Immutable
+  - Cache friendly
+  - No locking
+- Versioned
+  - No locking
+  - Read-write concurrency
+  
+## Real-time ingestion
+
+- Via Real-Time Node and Firehose
+  - No redundancy or HA, thus not recommended
+- Via Indexing Service and Tranquility API
+  - Core API
+  - Integration with Streaming Frameworks
+  - HTTP Server
+  - **Kafka Consumer**
+
+## Batch Ingestion
+
+- File based (HDFS, S3, ...)
+
+## Real-time Ingestion
+
+~~~
+Task 1: [   Interval   ][ Window ]
+Task 2:                 [              ]
+--------------------------------------->
+                                time
+~~~
+
+Minimum indexing slots =  
+  Data Sources × Partitions × Replicas × 2
+
+# Querying
+
+## Query types
+
+- Group by: group by multiple dimensions
+- Top N: like grouping by a single dimension
+- Timeseries: without grouping over dimensions
+- Search: Dimensions lookup
+- Time Boundary: Find available data timeframe
+- Metadata queries
+
+## Tip
+
+- Prefer `topN` over `groupBy`
+- Prefer `timeseries` over `topN`
+- Use limits (and priorities)
+
+## Query Spec
+
+- Data source
+- Dimensions
+- Interval
+- Filters
+- Aggergations
+- Post Aggregations
+- Granularity
+- Context (query configuration)
+- Limit
+
+## Example(s)
+
+TODO
+
+## Caching
+
+- Historical node level
+  - By segment
+- Broker Level
+  - By segment and query
+  - `groupBy` is disabled on purpose!
+- By default - local caching
+
+## Load Rules
+
+- Can be defined
+- What can be set
+
+# Components
+
+## Druid Components
+
+- Real-time Nodes
+- Historical Nodes
+- Broker Nodes
+- Coordinator
+- For indexing:
+  - Overlord
+  - Middle Manager
+
+ Deep Storage
+ Metadata Storage
+
+ Load Balancer
+ Cache
+
+## Coordinator
+
+Manage Segments
+
+## Real-time Nodes
+
+- Pulling data in real-time
+- Indexing it
+
+## Historical Nodes
+
+- Keep historical segments
+
+## Overlord
+
+- Accepts tasks and distributes them to middle manager
+
+## Middle Manager
+
+- Execute submitted tasks via Peons
+
+## Broker Nodes
+
+- Route query to Real-time and Historical nodes
+- Merge results
+
+## Deep Storage
+
+- Segments backup (HDFS, S3, ...)
+
+# Considerations & Tools
+
+## When *not* to choose Druid
+
+- Data is not time-series
+- Cardinality is _very_ high
+- Number of dimensions is high
+- Setup cost must be avoided
+
+## Graphite (metrics)
+
+![Graphite](img/graphite.png)\__
+
+[Graphite](http://graphite.wikidot.com)
+
+## Pivot (exploring data)
+
+![Pivot](img/pivot.gif)\  
+
+[Pivot](https://github.com/implydata/pivot)
+
+## Caravel (exploring data)
+
+![caravel](img/caravel.png)\  
+
+[Caravel](https://github.com/airbnb/caravel)
--- a/druid/img/caravel.png
+++ b/druid/img/caravel.png
--- a/druid/img/graphite.png
+++ b/druid/img/graphite.png
--- a/druid/img/pivot.gif
+++ b/druid/img/pivot.gif