snoyman.com-content/posts/worst-function-in-conduit.md
2017-05-07 08:09:24 +03:00

246 lines
7.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This blog post addresses a long-standing FIXME in the
conduit-combinators documentation, as well as
[a question on Twitter](https://twitter.com/bitemyapp/status/860919651957710849). This
blog post will assume familiarity with the Conduit streaming data
library; if you'd like to read up on it first, please
[check out the tutorial](https://haskell-lang.org/library/conduit). The
full executable snippet is at the end of this blog post, but we'll
build up intermediate bits along the way. First, the
[Stack](https://haskell-lang.org/get-started) script header, import
statement, and some minor helper functions.
```haskell
#!/usr/bin/env stack
--stack --resolver lts-8.12 script
import Conduit
src10 :: Monad m => ConduitM i Int m ()
src10 = yieldMany [1..10]
remaining :: MonadIO m => ConduitM i o m ()
remaining = lengthC >>= \x -> liftIO (putStrLn ("Remaining: " ++ show x))
```
`src10` just provides the numbers 1 through 10 as a source, and
`remaining` tells you how many values are remaining from
upstream. Cool.
Now let's pretend that the Conduit libraries completely forgot to
provide a `drop` function. That is, a function that will take an `Int`
and discard that many values from the upstream. We could write one
ourselves pretty easily:
```haskell
dropSink :: Monad m => Int -> ConduitM i o m ()
dropSink cnt
| cnt <= 0 = return ()
| otherwise = await >> dropSink (cnt - 1)
```
(Bonus points to readers: this function is inefficient in the case
that upstream has less than `cnt` values, optimize it.)
This function will drop a certain number of elements from upstream, so
the next component we monadically bind with can pick it up. Let's see
how that looks:
```haskell
goodDropSink :: IO ()
goodDropSink = runConduit
$ src10
.| (dropSink 5 >> remaining)
```
All well and good. But notice two things:
* I called this `dropSink`. Why sink?
* I stressed that we had to monadically bind. Why?
Well, there's another formulation of this drop function. Instead of
letting the next monadically bound component pick up remaining values,
we could _pass the remaining values downstream_. Fortunately it's
really easy to implement this function in terms of `dropSink`:
```haskell
dropTrans :: Monad m => Int -> ConduitM i i m ()
dropTrans cnt = dropSink cnt >> mapC id
```
(For more meaningless bonus points, feel free to implement this
without `dropSink`, or for a greater challenge, implement `dropSink`
in terms of `dropTrans`.) Anyway, this function can be used easily as:
```haskell
goodDropTrans :: IO ()
goodDropTrans = runConduit
$ src10
.| dropTrans 5
.| remaining
```
Many may argue that this is more natural. To some extent, it mirrors
the behavior of `take` more closely, as `take` passes the initial
values downstream. On the other hand, `dropTrans` cannot _guarantee_
that the values will be removed from the stream; if instead of
`dropTrans 5 .| remaining` I simply did `dropTrans 5 .| return ()`,
then the `dropTrans` would never have a chance to fire, since
execution is driven from downstream. Also, as demonstrated, it's
really easy to capture this transformer behavior from the sink
behavior; the other way is trickier.
My point here is that we have two legitimate definitions of a
function. And from my experience, different people expect different
behavior for the function. In fact, some people (myself included)
intuitively expect different behavior _depending on the circumstance_!
This is what earns `drop` the title of worst function in conduit.
To make it even more clear how bad this is, let's see how you can
misuse these functions unintentionally.
```haskell
badDropSink :: IO ()
badDropSink = runConduit
$ src10
.| dropSink 5
.| remaining
```
This code looks perfectly reasonable, and if we just replaced
`dropSink` with `dropTrans`, it would be correct. But instead of
saying, as expected, that we have 5 values remaining, this will
print 0. The reason: `src10` yields 10 values to
`dropSink`. `dropSink` drops 5 of those and leaves the remaining 5
untouched. But `dropSink` never itself yields a value downstream, so
`remaining` receives nothing.
Because of the type system, it's slightly trickier to misuse
`dropTrans`. Let's first do the naive thing of just assuming it's
`dropSink`:
```haskell
badDropTrans :: IO ()
badDropTrans = runConduit
$ src10
.| (dropTrans 5 >> remaining)
```
GHC does not like this one bit:
```
error:
• Couldn't match type Int with Data.Void.Void
Expected type: ConduitM () Data.Void.Void IO ()
Actual type: ConduitM () Int IO ()
```
The problem is that `runConduit` expects a pipeline where the final
output value is `Void`. However, `dropTrans` has an output value of
type `Int`. And if it's yielding `Int`s, so must `remaining`. This is
definitely an argument in favor of `dropTrans` being the better
function: the type system helps us a bit. (It's also an argument in
favor of keeping
[the type signature of `runConduit` as-is](http://www.snoyman.com/blog/2017/04/generalizing-type-signatures).)
However, it's still possible to accidentally screw things up in bigger
pipelines, e.g.:
```haskell
badDropTrans :: IO ()
badDropTrans = runConduit
$ src10
.| (dropTrans 5 >> remaining)
.| (sinkList >>= liftIO . print)
```
This code may look a bit contrived, but in real-world Conduit code
it's not at all uncommon to deeply nest these components in such a way
that the error would not be present. You may be surprised to hear that
the output of this program is:
```
Remaining: 0
[6,7,8,9,10]
```
The reason is that the `sinkList` is downstream from `dropTrans`, and
grabs all of its output. `dropTrans` itself will drain all output from
`src10`, leaving nothing behind for `remaining` to grab.
The Conduit libraries use the `dropSink` variety of function. I wish
there was a better approach here that felt more intuitive to
everyone. The closest I can think of to that is deprecating `drop` and
replacing it with more explicitly named `dropSink` and `dropTrans`,
but I'm not sure how I feel about that (feedback welcome, and other
ideas _certainly_ welcome).
* * *
Full code
```haskell
#!/usr/bin/env stack
--stack --resolver lts-8.12 script
import Conduit
dropSink :: Monad m => Int -> ConduitM i o m ()
dropSink cnt
| cnt <= 0 = return ()
| otherwise = await >> dropSink (cnt - 1)
dropTrans :: Monad m => Int -> ConduitM i i m ()
dropTrans cnt = dropSink cnt >> mapC id
src10 :: Monad m => ConduitM i Int m ()
src10 = yieldMany [1..10]
remaining :: MonadIO m => ConduitM i o m ()
remaining = lengthC >>= \x -> liftIO (putStrLn ("Remaining: " ++ show x))
goodDropSink :: IO ()
goodDropSink = runConduit
$ src10
.| (dropSink 5 >> remaining)
badDropSink :: IO ()
badDropSink = runConduit
$ src10
.| dropSink 5
.| remaining
goodDropTrans :: IO ()
goodDropTrans = runConduit
$ src10
.| dropTrans 5
.| remaining
badDropTrans :: IO ()
badDropTrans = runConduit
$ src10
.| (dropTrans 5 >> remaining)
.| (sinkList >>= liftIO . print)
main :: IO ()
main = do
putStrLn "Good drop sink"
goodDropSink
putStrLn "Bad drop sink"
badDropSink
putStrLn "Good drop trans"
goodDropTrans
putStrLn "Bad drop trans"
badDropTrans
```
Full output
```
Good drop sink
Remaining: 5
Bad drop sink
Remaining: 0
Good drop trans
Remaining: 5
Bad drop trans
Remaining: 0
[6,7,8,9,10]
```