This commit is contained in:
Yann Esposito 2016-03-29 11:08:27 +02:00
parent 52c9112e33
commit af539989e2
4 changed files with 204 additions and 4 deletions

View file

@ -11,13 +11,18 @@ date: 7 Avril 2016
## Plan
- Introduction ; pourquoi ?
- Comment ?
- Introduction; why?
- How?
## Expérience
## Experience
- Real Time Social Media Analytics
## Real Time?
- Ingestion Latency: seconds
- Query Latency: seconds
## Demande
- Twitter: `20k msg/s`, `1msg = 10ko` pendant 24h
@ -40,6 +45,22 @@ date: 7 Avril 2016
DEMO
</a>
## Pre Considerations
Discovered vs Invented
## Try to conceptualize (events)
Scalable + Real Time + Fail safe
- timeseries
- alerting system
- top N
- etc...
## In the End
Druid concepts are always emerging naturally
# Druid
@ -69,4 +90,183 @@ Metamarkets
**arbitrary exploration of billion-row tables tables with sub-second latencies**
## Proof
## Storage
- Columnar
- Inverted Index
- Immutable Segments
## Columnar Storage
## Index
- Values are dictionary encoded
`{"USA" 1, "Canada" 2, "Mexico" 3, ...}`
- Bitmap for every dimension value (used by filters)
`"USA" -> [0 1 0 0 1 1 0 0 0]`
- Column values (used by aggergation queries)
`[2,1,3,15,1,1,2,8,7]`
## Data Segments
- Per time interval
- skip segments when querying
- Immutable
- Cache friendly
- No locking
- Versioned
- No locking
- Read-write concurrency
## Real-time ingestion
- Via Real-Time Node and Firehose
- No redundancy or HA, thus not recommended
- Via Indexing Service and Tranquility API
- Core API
- Integration with Streaming Frameworks
- HTTP Server
- **Kafka Consumer**
## Batch Ingestion
- File based (HDFS, S3, ...)
## Real-time Ingestion
~~~
Task 1: [ Interval ][ Window ]
Task 2: [ ]
--------------------------------------->
time
~~~
Minimum indexing slots =
Data Sources × Partitions × Replicas × 2
# Querying
## Query types
- Group by: group by multiple dimensions
- Top N: like grouping by a single dimension
- Timeseries: without grouping over dimensions
- Search: Dimensions lookup
- Time Boundary: Find available data timeframe
- Metadata queries
## Tip
- Prefer `topN` over `groupBy`
- Prefer `timeseries` over `topN`
- Use limits (and priorities)
## Query Spec
- Data source
- Dimensions
- Interval
- Filters
- Aggergations
- Post Aggregations
- Granularity
- Context (query configuration)
- Limit
## Example(s)
TODO
## Caching
- Historical node level
- By segment
- Broker Level
- By segment and query
- `groupBy` is disabled on purpose!
- By default - local caching
## Load Rules
- Can be defined
- What can be set
# Components
## Druid Components
- Real-time Nodes
- Historical Nodes
- Broker Nodes
- Coordinator
- For indexing:
- Overlord
- Middle Manager
+ Deep Storage
+ Metadata Storage
+ Load Balancer
+ Cache
## Coordinator
Manage Segments
## Real-time Nodes
- Pulling data in real-time
- Indexing it
## Historical Nodes
- Keep historical segments
## Overlord
- Accepts tasks and distributes them to middle manager
## Middle Manager
- Execute submitted tasks via Peons
## Broker Nodes
- Route query to Real-time and Historical nodes
- Merge results
## Deep Storage
- Segments backup (HDFS, S3, ...)
# Considerations & Tools
## When *not* to choose Druid
- Data is not time-series
- Cardinality is _very_ high
- Number of dimensions is high
- Setup cost must be avoided
## Graphite (metrics)
![Graphite](img/graphite.png)\__
[Graphite](http://graphite.wikidot.com)
## Pivot (exploring data)
![Pivot](img/pivot.gif)\
[Pivot](https://github.com/implydata/pivot)
## Caravel (exploring data)
![caravel](img/caravel.png)\
[Caravel](https://github.com/airbnb/caravel)

BIN
druid/img/caravel.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 300 KiB

BIN
druid/img/graphite.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

BIN
druid/img/pivot.gif Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 360 KiB