snoyman.com-content/reveal/whirlwind-tour-core-haskell-libraries.md
2018-01-08 06:55:50 -08:00

15 KiB

title
Whirlwind Tour of Core Haskell Libraries

Whirlwind Tour of Core Haskell Libraries

  • Michael Snoyman
  • LambdaConf Winter Retreat 2018

Today's format

  • Very informal
    • Interrupt!
    • Ask questions!
  • References to lots of external learning material
  • Can go into depth on any of that if desired
  • Happy to discuss the topics here at length
  • Mini hackathon as well? https://github.com/snoyberg/codename-karka

Haskell's standard library

  • Standard library is base
  • Includes standard prelude, Prelude
  • They both suck :(
    • Missing lots of functionality
    • Dangerous functions
  • Need to call out to other libraries for almost any program

Downsides to weak base

  • Which library to use for this functionality?
  • Dependency fear! I want my package to be lightweight
  • Mismatches in core types across ecosystem

Bonus problem: patterns

  • Other languages have "design patterns"
  • We don't need that in Haskell because types
  • Except: how do you handle effects like possible failure, or HTTP calls?
    • Throw it all in IO!
    • Concrete monad transformers
    • mtl-style typeclasses
    • Effect libraries...
  • Weak standard library => non-standard types => many different patterns

End result

  • Difficult for people new to the language to get started
  • Lack of standardization across team makes code bases difficult to maintain
  • Fear of dependencies ultimately leads to lots of reinvented functionality
    • Code bloat
    • More bugs

Today's topic

  • Cover a number of recommended libraries
    • Recommended by whom? Me :)
  • Discuss some best practices for putting projects together
  • Describe a new initiative to help bring this all together
  • Finally: how to help Haskell take over the world!

Features to cover

  • Data structures
  • I/O
  • Concurrency
  • Mutable data
  • Exception handling
  • External processes

Doesn't cover all needs, but most real programs will need almost all of these.


Libraries

  • base
  • bytestring
  • containers
  • deepseq
  • directory
  • filepath
  • hashable
  • microlens
  • mtl
  • text
  • time
  • typed-process
  • unliftio
  • unordered-containers
  • vector

Data structures

Three categories

  • Sequential data
  • Map/Dictionary
  • Set

Sequential data the most complicated, let's knock out the other two


Maps

  • Three core datatypes
    • data Map key value
    • data IntMap value
    • data HashMap key value
  • IntMap is a specialized, optimized Map Int
  • Map is a binary tree, HashMap is (surprise) hash map
  • Map requires Ord on keys, HashMap requires Hashable and Eq
  • Generally: HashMap performs better

Strict or lazy values

  • Maps are always strict in their keys
    • Forcing a Map requires forcing all of its keys
  • You can be lazy in the values if you want...
  • Usually: don't do that, use Data.Map.Strict et al

Mutability

  • Unlike other languages, Maps are immutable
  • Less used hashtables library provides in place mutation
  • Immutable is nice: don't worry about data races
  • Stick it inside a TVar, IORef, etc
  • Downside: performance is not as good

Map API Overview

import qualified Data.Map.Strict as Map
import qualified Data.IntMap.Strict as IntMap
import qualified Data.HashMap.Strict as HashMap

singleton :: k -> v -> Map k v
fromList :: [(k, v)] -> Map k v
toList :: Map k v -> [(k, v)]
lookup :: k -> Map k v -> Maybe v
insert :: k -> v -> Map k v -> Map k v
insertWith :: (v -> v -> v) -> k -> v -> Map k v -> Map k v
union :: Map k v -> Map k v -> Map k v
unionWith :: (v -> v -> v) -> Map k v -> Map k v -> Map k v

What about duplicates?


Sets

  • Just like Maps, but no values (or () is the value...)
  • No strict vs lazy difference... no values!
  • No worry about duplicate keys... no values!

Set API Overview

import qualified Data.Set as Set
import qualified Data.IntSet as IntSet
import qualified Data.HashSet as HashSet

singleton :: k -> Set k
fromList :: [k] -> Set k
toList :: Set k -> [k]
member :: k -> Set k -> Bool
insert :: k -> Set k -> Set k
union :: Set k -> Set k -> Set k

Calculating frequency

import qualified Data.ByteString.Lazy as BL
import qualified Data.Map.Strict as Map

main :: IO ()
main = do
  lbs <- BL.getContents
  let add m w = Map.insertWith (+) w 1 m
  mapM_ print $ Map.toList $ BL.foldl' add Map.empty lbs

More information


Sequential data

Everyone knows lists, right?

(++) :: [a] -> [a] -> [a]
concat :: [[a]] -> [a]
map :: (a -> b) -> [a] -> [b]
break :: (a -> Bool) -> [a] -> ([a], [a])
splitAt :: Int -> [a] -> ([a], [a])
null :: [a] -> Bool
length :: [a] -> Int
reverse :: [a] -> [a]
intercalate :: [a] -> [[a]] -> [a]
foldl' :: (b -> a -> b) -> b -> [a] -> b
and :: [Bool] -> Bool
sum :: Num a => [a] -> a
replicate :: Int -> a -> [a]

Lists: the good

  • Polymorphic on any contained value
  • Lazy/infinite
  • Cheap prepend (singly linked list)
  • Pure data structure
  • Built in syntactic sugar
  • Easy to pattern match

Lists: the bad

  • Lots of memory overhead
    • Data constructor per cons
    • Pointer to value (1 word)
    • Pointer to rest of list (1 word)
  • O(n) indexing
  • Can hide bottom values (1:2:undefined)
  • Consider overhead of [Word8] and [Char]

What do?


Other languages

Most languages have a few sequential data types

  • Linked list/doubly linked lists
  • Queue/double-ended queue
  • Array/vector

Haskell's plethora

  • Lists/difference lists
  • Seq
  • Arrays (don't bother, use vector)
  • Vector: boxed, storable, unboxed
  • ByteString: strict, lazy
  • Text: strict, lazy
  • ShortByteString (seriously?)

Why so many?


Haskell's memory model

We have four ways of storing sequential data

  • Entirely as normal heap objects (list, Seq, diff lists)
  • Primitive boxed arrays (boxed vector)
  • Unpinned memory (Text, ShortByteString, unboxed vector)
  • Pinned memory (ByteString, storable vector)

Heap objects

  • Pointers, pointers everywhere
  • Memory overhead for the allocations/GC
  • CPU overhead for following pointers

Primitive boxed arrays

  • Packed representation of pointers
  • Still follow pointers to the values
  • Pointers can point to thunks, which is why they're value lazy
  • Less pointer overhead, but still some
  • Allows any heap object to be stored

Unpinned memory

  • Byte array managed by garbage collector
  • GC can move it around
    • Reducing fragmentation
    • Can't pass it over FFI
  • Values stored as bytes
    • Must be representable as bytes
    • Must represent in fixed size
    • Cannot be lazy

Pinned memory

  • Standard malloc-style buffers
  • GC can't move it around
    • Can fragment memory (don't hold for too long)
    • Can pass it over FFI
  • Values stored as bytes, same as unpinned

Haskell's laziness

Three levels of laziness in these data structures

  • Fully lazy, both the values and the structure itself are lazy (e.g., list oo:bar:undefined)
  • Spine strict: values can be lazy, but not the structure (e.g., boxed vector, fromList [undefined])
  • Fully strict: nothing lazy (e.g., ByteString, this fails fromList [undefined])
  • Semi-strict: lazy list of strict chunks (lazy ByteString and Text)

Overlaps

That still leaves us with some overlaps

  • List vs diff list vs Seq: different time complexity for some operations
  • ShortByteString vs unboxed Vector Word8: same thing
  • ByteString vs storable Vector Word8: also same thing
  • Text does not overlap: it's a ShortByteString containing UTF-16 codepoints with a Char-based API

What to use?

  • ShortByteString: smaller and long lived
  • ByteString interacting with FFI (I/O)
  • Text for storing textual values
  • If it works and not FFI: unpinned vector
  • If it works and you need FFI: storable vector
  • Need spine laziness (e.g., infinite), use lists
  • Unusual optimizations
    • Cheap append and inspection: Seq
    • Cheap append: difference lists
  • Otherwise: boxed vector

The string problem

  • Lots of theoretical ways to represent string-like stuff
  • [Char], strict/lazy ByteString/Text, ShortByteString, Vector of Word8 or Char...
  • Vector of Char is always a bad idea: too much memory
  • Good that we have bytes vs text difference
  • Need to use Text instead of String everywhere... not there yet
  • Conclusion: use strict ByteString or Text almost everywhere, convert when necessary

Why not use...

  • Lazy ByteString or Text
    • Useful sometimes, like lazily generating large data
    • Mostly used for lazy I/O, which I advise against (use conduit instead)
  • Unboxed/storable vectors instead of ByteString/ShortByteString
    • Hysterical raisins, probably the right thing
    • Lots of Rust envy here :(
  • Lists everywhere
    • Performance

The good news

  • You know lists, right?
  • You basically know all of these data structures
  • Don't get overwhelmed with the choices, just follow the advice above

Qualified imports (the bad news?)

  • Since these all have similar APIs, names conflict
  • Use qualified imports
  • Recommended naming
import qualified Data.ByteString as B
import qualified Data.Text as T
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text.Lazy as TL
import qualified Data.Vector as V -- boxed
import qualified Data.Vector.Unboxed as VU
import qualified Data.Vector.Storable as VS
import Data.ByteString.Short
  (ShortByteString, toShort, fromShort)

Further reading


I/O

  • Not the monad, actual input and output
  • Console
  • Files
  • Network
  • Streaming and in memory

The bad: character I/O

  • Implicit character decoding
  • Newline handling
  • Environment variables affect things
  • For console: probably right
  • File I/O: probably wrong
  • Network I/O: lol nope

https://www.snoyman.com/blog/2016/12/beware-of-readfile


The bad: lazy I/O

  • Lazy readFile (et al)
    • Hides exceptions till later
    • Keeps file descriptors open longer than expected
  • Answer: use strict I/O operations
  • Dealing with large data? Use conduit
  • Exception to the rule: lazy writeFile is fine (not actually lazy I/O)

Reading a file

  • Wants bytes? Data.ByteString.readFile
  • Want text? Choose an encoding!
    • decodeUtf8With lenientDecode <$> B.readFile fp
  • Let's get crazy
readFileUtf8 :: MonadIO m => FilePath -> m Text
readFileUtf8 fp = do
  bs <- readFileBinary fp
  case decodeUtf8' bs of
    Left e -> throwIO $ ReadFileUtf8Exception fp e
    Right text -> return text

Writing a file

  • No need to worry about char enc problems
  • Data.ByteString.writeFile
  • Have text? Choose an encoding!
    • B.writeFile fp $ encodeUtf8 text

Copy a file

What's wrong with this code?

bs <- B.readFile inputFile
B.writeFile outputFile bs

Streaming

Here's a conduit solution

runConduitRes $ sourceFile inputFile
             .| sinkFile outputFile

Or without ResourceT:

withSourceFile inputFile $ \src ->
withSinkFile outputFile $ \sink ->
  runConduit $ src .| sink

Why with pattern? We'll talk exceptions later


Generating large output

What's wrong with this code?

odds :: [Int]
odds = [1, 3..]

toLine :: Int -> String
toLine i = show i ++ "\n"

toLines :: [Int] -> String
toLines = foldr (\i rest -> toLine i ++ rest) ""

main :: IO ()
main = putStr $ toLines $ take 1000 odds

Strings!


Strict ByteString

{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B8
import Data.Monoid ((<>))

odds = [1, 3..]

toLine :: Int -> ByteString
toLine i = B8.pack (show i) <> "\n"

toLines :: [Int] -> ByteString
toLines = foldr (\i rest -> toLine i <> rest) B8.empty

main = B8.putStr $ toLines $ take 1000 odds

Problem? Quadratic complexity


Lazy ByteString

Avoid the buffer copies

{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy.Char8 as BL8
import Data.Monoid ((<>))

odds = [1, 3..]

toLine :: Int -> ByteString
toLine i = BL8.pack (show i) <> "\n"

toLines :: [Int] -> ByteString
toLines = foldr (\i rest -> toLine i <> rest) BL8.empty

main = BL8.putStr $ toLines $ take 1000 odds

Still quadratic :(


Builders

Single-copy data structure, efficient Handle interaction

{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Builder (Builder, intDec, hPutBuilder)
import System.IO (stdout)
import Data.Monoid ((<>))

odds = [1, 3..]

toLine :: Int -> Builder
toLine i = intDec i <> "\n"

toLines :: [Int] -> Builder
toLines = foldr (\i rest -> toLine i <> rest) mempty

main = hPutBuilder stdout $ toLines $ take 1000 odds

Text builders

  • Text also has a builder
  • Much less useful overall, no direct output capabilities
  • Downside to ByteString builders: have to assume a character encoding
  • For console, may be a problem, but great for network or file I/O

Networking

  • Low level libraries like network
    • Use the Network.Socket API!
  • conduit-based helper functions on top of that
  • WAI and Warp for web servers
  • http-conduit for web clients
  • Many other libraries out there too

Web server


Web client


Network server with conduit

Simple echo server example

{-# LANGUAGE OverloadedStrings #-}
import Data.Conduit
import Data.Conduit.Network

main :: IO ()
main = runTCPServer (serverSettings 3001 "127.0.0.1") echo

echo :: AppData -> IO ()
echo app = runConduit $ appSource app .| appSink app

Side adventure: unliftio

  • Many more details on the motivation tomorrow
  • Two packages
    • unliftio-core provides a typeclass
    • unliftio wraps a bunch of libraries with that type class
  • If it's in unliftio: it's good to use, do it!
  • Epic foreshadowment for tomorrow's presentation :)

Mutable data


Concurrency


Exception handling

  • Documentation still out of date
  • We'll cover the why of things tomorrow
  • Short answer: use UnliftIO.Exception

External processes


Structuring applications


Random grab bag