parsec-presentation/parsec/content/020_intro.md
Yann Esposito (Yogsototh) 3387bb886c better git ignore + update
2013-10-07 22:34:27 +02:00

4.1 KiB

Parsing

Latin pars (ōrātiōnis), meaning part (of speech).

  • analysing a string of symbols
  • formal grammar.

Parsing in Programming Languages

Complexity:

Method Typical Example Output Data Structure


Splitting CSV Array, Map Regexp email + Fixed Layout Tree Parser Programming language + Most Data Structure

Parser & culture

In Haskell Parser are really easy to use.

Generally:

  • In most languages: split then regexp then parse
  • In Haskell: split then parse

Parsing Example

From String:

(1+3)*(1+5+9)

To data structure:

AST\

Parsec

Parsec lets you construct parsers by combining high-order Combinators to create larger expressions.

Combinator parsers are written and used within the same programming language as the rest of the program.

The parsers are first-class citizens of the languages [...]"

Parser Libraries

In reality there are many choices:


attoparsec fast Bytestring-lexing fast Parsec 3 powerful, nice error reporting


Haskell Remarks (1)

spaces are meaningful

f x   -- ⇔ f(x) in C-like languages
f x y -- ⇔ f(x,y)

Haskell Remarks (2)

Don't mind strange operators (<*>, <$>).
Consider them like separators, typically commas.
They are just here to deal with types.

Informally:

toto <$> x <*> y <*> z
    -- ⇔ toto x y z
    -- ⇔ toto(x,y,z) in C-like languages

A Parsec Example

whitespaces = many (oneOf "\t ")
number = many1 digit
symbol = oneOf "!#$%&|*+-/:<=>?@^_~"
"   \t "  -- whitespaces on "   \t "
""        -- whitespaces on "32"
"32"      -- number on "32"

-- number on "   \t 32  "
"number" (line 1, column 1):
unexpected " "
expecting digit

Comparison with Regexp (Parsec)

data IP = IP Word8 Word8 Word8 Word8
ip = IP <$>
       number <*  char '.' <*>
       number <*  char '.' <*>
       number <*  char '.' <*>
       number
number = do
    x <- read <$> many1 digit
    guard (0 <= x && x < 256)
    return (fromIntegral x)

Comparison with Regexp (Perl Regexp)

# remark: 888.999.999.999 is accepted
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b

# exact but difficult to read
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
  (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

Also, regexp are unityped by nature.

Monadic style

number :: Parser String
number = many1 digit

number' :: Parser Int
number' = do
    -- digitString :: String
    digitString <- many1 digit
    return (read digitString)
"32" :: [Char]  -- number on "32"
32   :: Int     -- number' on "32"

Combining Monadic style (S = aSb | ε)

s = (do
        a <- string "a"
        mid <- s
        b <- string "b"
        return (a ++ mid ++ b)
    <|> string ""

""          -- s on ""
"aaabbb"    -- s on "aaabbb"
"aabb"      -- s on "aabbb"
-- s on "aaabb"
S (line1 1, column 4):
unexpected end of input
expecting "b"

Combining Applicative style (S = aSb | ε)

s = concat3 <$> string "a" <*> s <*> char "b"
    <|> string ""
    where
        concat3 x y z = x ++ y ++ z

Applicative Style usefull with Data types

data IP = IP Int Int Int Int

parseIP = IP <$>
            number <*  char '.' <*>
            number <*  char '.' <*>
            number <*  char '.' <*>
            number

monadicParseIP = do
    d1 <- number
    char '.'
    d2 <- number
    char '.'
    d3 <- number
    char '.'
    d4 <- number
    return (IP d1 d2 d3 d4)

Write number correctly

number :: Parser Int
number = do
    x <- read <$> many1 digit
    guard (0 <= x && x < 256) <?>
        "Number between 0 and 255 (here " ++ show x ++ ")"
    return (fromIntegral x)
>>> test parseIP "parseIP" "823.32.80.113"
"parseIP" (line 1, column 4):
unexpected "."
expecting digit or Number between 0 and 255 (here 823)