diff --git a/druid/druid.md b/druid/druid.md index fa38769..105d207 100644 --- a/druid/druid.md +++ b/druid/druid.md @@ -11,13 +11,18 @@ date: 7 Avril 2016 ## Plan -- Introduction ; pourquoi ? -- Comment ? +- Introduction; why? +- How? -## Expérience +## Experience - Real Time Social Media Analytics +## Real Time? + +- Ingestion Latency: seconds +- Query Latency: seconds + ## Demande - Twitter: `20k msg/s`, `1msg = 10ko` pendant 24h @@ -40,6 +45,22 @@ date: 7 Avril 2016 DEMO +## Pre Considerations + +Discovered vs Invented + +## Try to conceptualize (events) + +Scalable + Real Time + Fail safe + +- timeseries +- alerting system +- top N +- etc... + +## In the End + +Druid concepts are always emerging naturally # Druid @@ -69,4 +90,183 @@ Metamarkets **arbitrary exploration of billion-row tables tables with sub-second latencies** -## Proof +## Storage + +- Columnar +- Inverted Index +- Immutable Segments + +## Columnar Storage + +## Index + +- Values are dictionary encoded + +`{"USA" 1, "Canada" 2, "Mexico" 3, ...}` + +- Bitmap for every dimension value (used by filters) + +`"USA" -> [0 1 0 0 1 1 0 0 0]` + +- Column values (used by aggergation queries) + +`[2,1,3,15,1,1,2,8,7]` + +## Data Segments + +- Per time interval + - skip segments when querying +- Immutable + - Cache friendly + - No locking +- Versioned + - No locking + - Read-write concurrency + +## Real-time ingestion + +- Via Real-Time Node and Firehose + - No redundancy or HA, thus not recommended +- Via Indexing Service and Tranquility API + - Core API + - Integration with Streaming Frameworks + - HTTP Server + - **Kafka Consumer** + +## Batch Ingestion + +- File based (HDFS, S3, ...) + +## Real-time Ingestion + +~~~ +Task 1: [ Interval ][ Window ] +Task 2: [ ] +---------------------------------------> + time +~~~ + +Minimum indexing slots = + Data Sources × Partitions × Replicas × 2 + +# Querying + +## Query types + +- Group by: group by multiple dimensions +- Top N: like grouping by a single dimension +- Timeseries: without grouping over dimensions +- Search: Dimensions lookup +- Time Boundary: Find available data timeframe +- Metadata queries + +## Tip + +- Prefer `topN` over `groupBy` +- Prefer `timeseries` over `topN` +- Use limits (and priorities) + +## Query Spec + +- Data source +- Dimensions +- Interval +- Filters +- Aggergations +- Post Aggregations +- Granularity +- Context (query configuration) +- Limit + +## Example(s) + +TODO + +## Caching + +- Historical node level + - By segment +- Broker Level + - By segment and query + - `groupBy` is disabled on purpose! +- By default - local caching + +## Load Rules + +- Can be defined +- What can be set + +# Components + +## Druid Components + +- Real-time Nodes +- Historical Nodes +- Broker Nodes +- Coordinator +- For indexing: + - Overlord + - Middle Manager + ++ Deep Storage ++ Metadata Storage + ++ Load Balancer ++ Cache + +## Coordinator + +Manage Segments + +## Real-time Nodes + +- Pulling data in real-time +- Indexing it + +## Historical Nodes + +- Keep historical segments + +## Overlord + +- Accepts tasks and distributes them to middle manager + +## Middle Manager + +- Execute submitted tasks via Peons + +## Broker Nodes + +- Route query to Real-time and Historical nodes +- Merge results + +## Deep Storage + +- Segments backup (HDFS, S3, ...) + +# Considerations & Tools + +## When *not* to choose Druid + +- Data is not time-series +- Cardinality is _very_ high +- Number of dimensions is high +- Setup cost must be avoided + +## Graphite (metrics) + +![Graphite](img/graphite.png)\__ + +[Graphite](http://graphite.wikidot.com) + +## Pivot (exploring data) + +![Pivot](img/pivot.gif)\ + +[Pivot](https://github.com/implydata/pivot) + +## Caravel (exploring data) + +![caravel](img/caravel.png)\ + +[Caravel](https://github.com/airbnb/caravel) diff --git a/druid/img/caravel.png b/druid/img/caravel.png new file mode 100644 index 0000000..c0b3f41 Binary files /dev/null and b/druid/img/caravel.png differ diff --git a/druid/img/graphite.png b/druid/img/graphite.png new file mode 100644 index 0000000..dd54437 Binary files /dev/null and b/druid/img/graphite.png differ diff --git a/druid/img/pivot.gif b/druid/img/pivot.gif new file mode 100644 index 0000000..1b0e3cd Binary files /dev/null and b/druid/img/pivot.gif differ