14 KiB
14 KiB
title |
---|
Whirlwind Tour of Core Haskell Libraries |
Whirlwind Tour of Core Haskell Libraries
- Michael Snoyman
- LambdaConf Winter Retreat 2018
Today's format
- Very informal
- Interrupt!
- Ask questions!
- References to lots of external learning material
- Can go into depth on any of that if desired
- Happy to discuss the topics here at length
- Mini hackathon as well? https://github.com/snoyberg/codename-karka
Haskell's standard library
- Standard library is
base
- Includes standard prelude,
Prelude
- They both suck :(
- Missing lots of functionality
- Dangerous functions
- Need to call out to other libraries for almost any program
Downsides to weak base
- Which library to use for this functionality?
- Dependency fear! I want my package to be lightweight
- Mismatches in core types across ecosystem
Bonus problem: patterns
- Other languages have "design patterns"
- We don't need that in Haskell because types
- Except: how do you handle effects like possible failure, or HTTP
calls?
- Throw it all in
IO
! - Concrete monad transformers
mtl
-style typeclasses- Effect libraries...
- Throw it all in
- Weak standard library => non-standard types => many different patterns
End result
- Difficult for people new to the language to get started
- Lack of standardization across team makes code bases difficult to maintain
- Fear of dependencies ultimately leads to lots of reinvented
functionality
- Code bloat
- More bugs
Today's topic
- Cover a number of recommended libraries
- Recommended by whom? Me :)
- Discuss some best practices for putting projects together
- Describe a new initiative to help bring this all together
- Finally: how to help Haskell take over the world!
Features to cover
- Data structures
- I/O
- Concurrency
- Mutable data
- Exception handling
- External processes
Doesn't cover all needs, but most real programs will need almost all of these.
Data structures
Three categories
- Sequential data
- Map/Dictionary
- Set
Sequential data the most complicated, let's knock out the other two
Maps
- Three core datatypes
data Map key value
data IntMap value
data HashMap key value
IntMap
is a specialized, optimizedMap Int
Map
is a binary tree,HashMap
is (surprised) hash mapMap
requiresOrd
on keys,HashMap
requiresHashable
andEq
- Generally:
HashMap
performs better
Strict or lazy values
- Maps are always strict in their keys
- Forcing a
Map
requires forcing all of its keys
- Forcing a
- You can be lazy in the values if you want...
- Usually: don't do that, use
Data.Map.Strict
et al
Mutability
- Unlike other languages,
Map
s are immutable - Less used hashtables library provides in place mutation
- Immutable is nice: don't worry about data races
- Stick it inside a
TVar
,IORef
, etc - Downside: performance is not as good
Map API Overview
import qualified Data.Map.Strict as Map
import qualified Data.IntMap.Strict as IntMap
import qualified Data.HashMap.Strict as HashMap
singleton :: k -> v -> Map k v
fromList :: [(k, v)] -> Map k v
toList :: Map k v -> [(k, v)]
lookup :: k -> Map k v -> Maybe v
insert :: k -> v -> Map k v -> Map k v
insertWith :: (v -> v -> v) -> k -> v -> Map k v -> Map k v
union :: Map k v -> Map k v -> Map k v
unionWith :: (v -> v -> v) -> Map k v -> Map k v -> Map k v
What about duplicates?
Sets
- Just like
Map
s, but no values (or()
is the value...) - No strict vs lazy difference... no values!
- No worry about duplicate keys... no values!
Set API Overview
import qualified Data.Set as Set
import qualified Data.IntSet as IntSet
import qualified Data.HashSet as HashSet
singleton :: k -> Set k
fromList :: [k] -> Set k
toList :: Set k -> [k]
member :: k -> Set k -> Bool
insert :: k -> Set k -> Set k
union :: Set k -> Set k -> Set k
Calculating frequency
import qualified Data.ByteString.Lazy as BL
import qualified Data.Map.Strict as Map
main :: IO ()
main = do
lbs <- BL.getContents
let add m w = Map.insertWith (+) w 1 m
mapM_ print $ Map.toList $ BL.foldl' add Map.empty lbs
More information
Sequential data
Everyone knows lists, right?
(++) :: [a] -> [a] -> [a]
concat :: [[a]] -> [a]
map :: (a -> b) -> [a] -> [b]
break :: (a -> Bool) -> [a] -> ([a], [a])
splitAt :: Int -> [a] -> ([a], [a])
null :: [a] -> Bool
length :: [a] -> Int
reverse :: [a] -> [a]
intercalate :: [a] -> [[a]] -> [a]
foldl' :: (b -> a -> b) -> b -> [a] -> b
and :: [Bool] -> Bool
sum :: Num a => [a] -> a
replicate :: Int -> a -> [a]
Lists: the good
- Polymorphic on any contained value
- Lazy/infinite
- Cheap prepend (singly linked list)
- Pure data structure
- Built in syntactic sugar
- Easy to pattern match
Lists: the bad
- Lots of memory overhead
- Data constructor per cons
- Pointer to value (1 word)
- Pointer to rest of list (1 word)
- O(n) indexing
- Can hide bottom values (
1:2:undefined
) - Consider overhead of
[Word8]
and[Char]
What do?
Other languages
Most languages have a few sequential data types
- Linked list/doubly linked lists
- Queue/double-ended queue
- Array/vector
Haskell's plethora
- Lists/difference lists
- Seq
- Arrays (don't bother, use vector)
- Vector: boxed, storable, unboxed
- ByteString: strict, lazy
- Text: strict, lazy
- ShortByteString (seriously?)
Why so many?
Haskell's memory model
We have four ways of storing sequential data
- Entirely as normal heap objects (list, Seq, diff lists)
- Primitive boxed arrays (boxed vector)
- Unpinned memory (Text, ShortByteString, unboxed vector)
- Pinned memory (ByteString, storable vector)
Heap objects
- Pointers, pointers everywhere
- Memory overhead for the allocations/GC
- CPU overhead for following pointers
Primitive boxed arrays
- Packed representation of pointers
- Still follow pointers to the values
- Pointers can point to thunks, which is why they're value lazy
- Less pointer overhead, but still some
- Allows any heap object to be stored
Unpinned memory
- Byte array managed by garbage collector
- GC can move it around
- Reducing fragmentation
- Can't pass it over FFI
- Values stored as bytes
- Must be representable as bytes
- Must represent in fixed size
- Cannot be lazy
Pinned memory
- Standard
malloc
-style buffers - GC can't move it around
- Can fragment memory (don't hold for too long)
- Can pass it over FFI
- Values stored as bytes, same as unpinned
Haskell's laziness
Three levels of laziness in these data structures
- Fully lazy, both the values and the structure itself are lazy (e.g.,
list
oo:bar:undefined
) - Spine strict: values can be lazy, but not the structure (e.g., boxed
vector,
fromList [undefined]
) - Fully strict: nothing lazy (e.g.,
ByteString
, this failsfromList [undefined]
) - Semi-strict: lazy list of strict chunks (lazy
ByteString
andText
)
Overlaps
That still leaves us with some overlaps
- List vs diff list vs
Seq
: different time complexity for some operations ShortByteString
vs unboxedVector Word8
: same thingByteString
vs storableVector Word8
: also same thingText
does not overlap: it's aShortByteString
containing UTF-16 codepoints with aChar
-based API
What to use?
ShortByteString
: smaller and long livedByteString
interacting with FFI (I/O)Text
for storing textual values- If it works and not FFI: unpinned vector
- If it works and you need FFI: storable vector
- Need spine laziness (e.g., infinite), use lists
- Unusual optimizations
- Cheap append and inspection:
Seq
- Cheap append: difference lists
- Cheap append and inspection:
- Otherwise: boxed vector
The string problem
- Lots of theoretical ways to represent string-like stuff
[Char]
, strict/lazyByteString
/Text
,ShortByteString
,Vector
ofWord8
orChar
...Vector
ofChar
is always a bad idea: too much memory- Good that we have bytes vs text difference
- Need to use
Text
instead ofString
everywhere... not there yet - Conclusion: use strict
ByteString
orText
almost everywhere, convert when necessary
Why not use...
- Lazy
ByteString
orText
- Useful sometimes, like lazily generating large data
- Mostly used for lazy I/O, which I advise against (use conduit instead)
- Unboxed/storable vectors instead of
ByteString
/ShortByteString
- Hysterical raisins, probably the right thing
- Lots of Rust envy here :(
- Lists everywhere
- Performance
The good news
- You know lists, right?
- You basically know all of these data structures
- Don't get overwhelmed with the choices, just follow the advice above
Qualified imports (the bad news?)
- Since these all have similar APIs, names conflict
- Use qualified imports
- Recommended naming
import qualified Data.ByteString as B
import qualified Data.Text as T
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text.Lazy as TL
import qualified Data.Vector as V -- boxed
import qualified Data.Vector.Unboxed as VU
import qualified Data.Vector.Storable as VS
import Data.ByteString.Short
(ShortByteString, toShort, fromShort)
Further reading
I/O
- Not the monad, actual input and output
- Console
- Files
- Network
- Streaming and in memory
The bad: character I/O
- Implicit character decoding
- Newline handling
- Environment variables affect things
- For console: probably right
- File I/O: probably wrong
- Network I/O: lol nope
https://www.snoyman.com/blog/2016/12/beware-of-readfile
The bad: lazy I/O
- Lazy
readFile
(et al)- Hides exceptions till later
- Keeps file descriptors open longer than expected
- Answer: use strict I/O operations
- Dealing with large data? Use conduit
- Exception to the rule: lazy
writeFile
is fine (not actually lazy I/O)
Reading a file
- Wants bytes?
Data.ByteString.readFile
- Want text? Choose an encoding!
decodeUtf8With lenientDecode <$> B.readFile fp
- Let's get crazy
readFileUtf8 :: MonadIO m => FilePath -> m Text
readFileUtf8 fp = do
bs <- readFileBinary fp
case decodeUtf8' bs of
Left e -> throwIO $ ReadFileUtf8Exception fp e
Right text -> return text
Writing a file
- No need to worry about char enc problems
Data.ByteString.writeFile
- Have text? Choose an encoding!
B.writeFile fp $ encodeUtf8 text
Copy a file
What's wrong with this code?
bs <- B.readFile inputFile
B.writeFile outputFile bs
Streaming
Here's a conduit solution
runConduitRes $ sourceFile inputFile
.| sinkFile outputFile
Or without ResourceT
:
withSourceFile inputFile $ \src ->
withSinkFile outputFile $ \sink ->
runConduit $ src .| sink
Why with pattern? We'll talk exceptions later
Generating large output
What's wrong with this code?
odds :: [Int]
odds = [1, 3..]
toLine :: Int -> String
toLine i = show i ++ "\n"
toLines :: [Int] -> String
toLines = foldr (\i rest -> toLine i ++ rest) ""
main :: IO ()
main = putStr $ toLines $ take 1000 odds
Strings!
Strict ByteString
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B8
import Data.Monoid ((<>))
odds = [1, 3..]
toLine :: Int -> ByteString
toLine i = B8.pack (show i) <> "\n"
toLines :: [Int] -> ByteString
toLines = foldr (\i rest -> toLine i <> rest) B8.empty
main = B8.putStr $ toLines $ take 1000 odds
Problem? Quadratic complexity
Lazy ByteString
Avoid the buffer copies
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy.Char8 as BL8
import Data.Monoid ((<>))
odds = [1, 3..]
toLine :: Int -> ByteString
toLine i = BL8.pack (show i) <> "\n"
toLines :: [Int] -> ByteString
toLines = foldr (\i rest -> toLine i <> rest) BL8.empty
main = BL8.putStr $ toLines $ take 1000 odds
Still quadratic :(
Builders
Single-copy data structure, efficient Handle
interaction
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Builder (Builder, intDec, hPutBuilder)
import System.IO (stdout)
import Data.Monoid ((<>))
odds = [1, 3..]
toLine :: Int -> Builder
toLine i = intDec i <> "\n"
toLines :: [Int] -> Builder
toLines = foldr (\i rest -> toLine i <> rest) mempty
main = hPutBuilder stdout $ toLines $ take 1000 odds
Text builders
- Text also has a builder
- Much less useful overall, no direct output capabilities
- Downside to ByteString builders: have to assume a character encoding
- For console, may be a problem, but great for network or file I/O
Networking
- Low level libraries like network
- Use the
Network.Socket
API!
- Use the
- conduit-based helper functions on top of that
- WAI and Warp for web servers
- http-conduit for web clients
- Many other libraries out there too
Web server
- Lots of details here: https://github.com/fpco/applied-haskell/blob/master/web-services.md
- Who wants to go down the rabbit hole?
Web client
- https://haskell-lang.org/library/http-client
- Optional rabbit hole again
Network server with conduit
- FIXME include some code samples
Side adventure: unliftio
- Many more details on the motivation tomorrow
- Two packages
- unliftio-core provides a typeclass
- unliftio wraps a bunch of libraries with that type class
- If it's in unliftio: it's good to use, do it!
- Epic foreshadowment for tomorrow's presentation :)
Concurrency
- FIXME UnliftIO.Async
- https://haskell-lang.org/library/async
Mutable data
Exception handling
- Documentation still out of date
- We'll cover the why of things tomorrow
- Short answer: use
UnliftIO.Exception
External processes
- FIXME typed-process
- https://haskell-lang.org/library/typed-process