From 711af7a90688e6007f43a22826ef01ffd62d1147 Mon Sep 17 00:00:00 2001 From: Michael Snoyman Date: Sun, 7 May 2017 08:09:10 +0300 Subject: [PATCH] Blog post: worst function in Conduit --- posts.yaml | 3 + posts/worst-function-in-conduit.md | 246 +++++++++++++++++++++++++++++ 2 files changed, 249 insertions(+) create mode 100644 posts/worst-function-in-conduit.md diff --git a/posts.yaml b/posts.yaml index 91b2496..feab494 100644 --- a/posts.yaml +++ b/posts.yaml @@ -1,3 +1,6 @@ +- file: posts/worst-function-in-conduit.md + title: The Worst Function in Conduit + time: 2017-05-07T05:10:00Z - file: posts/stackages-no-revisions-field.md title: "Stackage's no-revisions (experimental) field" time: 2017-04-27T06:00:00Z diff --git a/posts/worst-function-in-conduit.md b/posts/worst-function-in-conduit.md new file mode 100644 index 0000000..ebb3b45 --- /dev/null +++ b/posts/worst-function-in-conduit.md @@ -0,0 +1,246 @@ +This blog post addresses a long-standing FIXME in the +conduit-combinators documentation, as well as +[a question on Twitter](https://twitter.com/bitemyapp/status/860919651957710849). This +blog post will assume familiarity with the Conduit streaming data +library; if you'd like to read up on it first, please +[check out the tutorial](https://haskell-lang.org/library/conduit). The +full executable snippet is at the end of this blog post, but we'll +build up intermediate bits along the way. First, the +[Stack](https://haskell-lang.org/get-started) script header, import +statement, and some minor helper functions. + +```haskell +#!/usr/bin/env stack +--stack --resolver lts-8.12 script +import Conduit + +src10 :: Monad m => ConduitM i Int m () +src10 = yieldMany [1..10] + +remaining :: MonadIO m => ConduitM i o m () +remaining = lengthC >>= \x -> liftIO (putStrLn ("Remaining: " ++ show x)) +``` + +`src10` just provides the numbers 1 through 10 as a source, and +`remaining` tells you how many values are remaining from +upstream. Cool. + +Now let's pretend that the Conduit libraries completely forgot to +provide a `drop` function. That is, a function that will take an `Int` +and discard that many values from the upstream. We could write one +ourselves pretty easily: + +```haskell +dropSink :: Monad m => Int -> ConduitM i o m () +dropSink cnt + | cnt <= 0 = return () + | otherwise = await >> dropSink (cnt - 1) +``` + +(Bonus points to readers: this function is inefficient in the case +that upstream has less than `cnt` values, optimize it.) + +This function will drop a certain number of elements from upstream, so +the next component we monadically bind with can pick it up. Let's see +how that looks: + +```haskell +goodDropSink :: IO () +goodDropSink = runConduit + $ src10 + .| (dropSink 5 >> remaining) +``` + +All well and good. But notice two things: + +* I called this `dropSink`. Why sink? +* I stressed that we had to monadically bind. Why? + +Well, there's another formulation of this drop function. Instead of +letting the next monadically bound component pick up remaining values, +we could _pass the remaining values downstream_. Fortunately it's +really easy to implement this function in terms of `dropSink`: + +```haskell +dropTrans :: Monad m => Int -> ConduitM i i m () +dropTrans cnt = dropSink cnt >> mapC id +``` + +(For more meaningless bonus points, feel free to implement this +without `dropSink`, or for a greater challenge, implement `dropSink` +in terms of `dropTrans`.) Anyway, this function can be used easily as: + +```haskell +goodDropTrans :: IO () +goodDropTrans = runConduit + $ src10 + .| dropTrans 5 + .| remaining +``` + +Many may argue that this is more natural. To some extent, it mirrors +the behavior of `take` more closely, as `take` passes the initial +values downstream. On the other hand, `dropTrans` cannot _guarantee_ +that the values will be removed from the stream; if instead of +`dropTrans 5 .| remaining` I simply did `dropTrans 5 .| return ()`, +then the `dropTrans` would never have a chance to fire, since +execution is driven from downstream. Also, as demonstrated, it's +really easy to capture this transformer behavior from the sink +behavior; the other way is trickier. + +My point here is that we have two legitimate definitions of a +function. And from my experience, different people expect different +behavior for the function. In fact, some people (myself included) +intuitively expect different behavior _depending on the circumstance_! +This is what earns `drop` the title of worst function in conduit. + +To make it even more clear how bad this is, let's see how you can +misuse these functions unintentionally. + +```haskell +badDropSink :: IO () +badDropSink = runConduit + $ src10 + .| dropSink 5 + .| remaining +``` + +This code looks perfectly reasonable, and if we just replaced +`dropSink` with `dropTrans`, it would be correct. But instead of +saying, as expected, that we have 5 values remaining, this will +print 0. The reason: `src10` yields 10 values to +`dropSink`. `dropSink` drops 5 of those and leaves the remaining 5 +untouched. But `dropSink` never itself yields a value downstream, so +`remaining` receives nothing. + +Because of the type system, it's slightly trickier to misuse +`dropTrans`. Let's first do the naive thing of just assuming it's +`dropSink`: + +```haskell +badDropTrans :: IO () +badDropTrans = runConduit + $ src10 + .| (dropTrans 5 >> remaining) +``` + +GHC does not like this one bit: + +``` +error: + • Couldn't match type ‘Int’ with ‘Data.Void.Void’ + Expected type: ConduitM () Data.Void.Void IO () + Actual type: ConduitM () Int IO () +``` + +The problem is that `runConduit` expects a pipeline where the final +output value is `Void`. However, `dropTrans` has an output value of +type `Int`. And if it's yielding `Int`s, so must `remaining`. This is +definitely an argument in favor of `dropTrans` being the better +function: the type system helps us a bit. (It's also an argument in +favor of keeping +[the type signature of `runConduit` as-is](http://www.snoyman.com/blog/2017/04/generalizing-type-signatures).) + +However, it's still possible to accidentally screw things up in bigger +pipelines, e.g.: + +```haskell +badDropTrans :: IO () +badDropTrans = runConduit + $ src10 + .| (dropTrans 5 >> remaining) + .| (sinkList >>= liftIO . print) +``` + +This code may look a bit contrived, but in real-world Conduit code +it's not at all uncommon to deeply nest these components in such a way +that the error would not be present. You may be surprised to hear that +the output of this program is: + +``` +Remaining: 0 +[6,7,8,9,10] +``` + +The reason is that the `sinkList` is downstream from `dropTrans`, and +grabs all of its output. `dropTrans` itself will drain all output from +`src10`, leaving nothing behind for `remaining` to grab. + +The Conduit libraries use the `dropSink` variety of function. I wish +there was a better approach here that felt more intuitive to +everyone. The closest I can think of to that is deprecating `drop` and +replacing it with more explicitly named `dropSink` and `dropTrans`, +but I'm not sure how I feel about that (feedback welcome, and other +ideas _certainly_ welcome). + +* * * + +Full code + +```haskell +#!/usr/bin/env stack +--stack --resolver lts-8.12 script +import Conduit + +dropSink :: Monad m => Int -> ConduitM i o m () +dropSink cnt + | cnt <= 0 = return () + | otherwise = await >> dropSink (cnt - 1) + +dropTrans :: Monad m => Int -> ConduitM i i m () +dropTrans cnt = dropSink cnt >> mapC id + +src10 :: Monad m => ConduitM i Int m () +src10 = yieldMany [1..10] + +remaining :: MonadIO m => ConduitM i o m () +remaining = lengthC >>= \x -> liftIO (putStrLn ("Remaining: " ++ show x)) + +goodDropSink :: IO () +goodDropSink = runConduit + $ src10 + .| (dropSink 5 >> remaining) + +badDropSink :: IO () +badDropSink = runConduit + $ src10 + .| dropSink 5 + .| remaining + +goodDropTrans :: IO () +goodDropTrans = runConduit + $ src10 + .| dropTrans 5 + .| remaining + +badDropTrans :: IO () +badDropTrans = runConduit + $ src10 + .| (dropTrans 5 >> remaining) + .| (sinkList >>= liftIO . print) + +main :: IO () +main = do + putStrLn "Good drop sink" + goodDropSink + putStrLn "Bad drop sink" + badDropSink + putStrLn "Good drop trans" + goodDropTrans + putStrLn "Bad drop trans" + badDropTrans +``` + +Full output + +``` +Good drop sink +Remaining: 5 +Bad drop sink +Remaining: 0 +Good drop trans +Remaining: 5 +Bad drop trans +Remaining: 0 +[6,7,8,9,10] +```