layout: post
title: “Improving the parser library”
description: “Adding more informative errors”
categories: [Patterns]
seriesId: “Understanding Parser Combinators”

seriesOrder: 3

UPDATE: Slides and video from my talk on this topic

In this series, we are looking at how applicative parsers and parser combinators work.

  • In the first post, we created the foundations of a parsing library.
  • In the second post, we built out the library with many other useful combinators.
  • In this post, we’ll rework the library to provide more helpful error messages.

1. Labelling a Parser

In some of the failing code examples from earlier posts, we got confusing errors:

  1. let parseDigit = anyOf ['0'..'9']
  2. run parseDigit "|ABC" // Failure "Expecting '9'. Got '|'"

parseDigit is defined as a choice of digit characters, so when the last choice ('9') fails, that is the error message we receive.

But that message is quite confusing. What we really want is to receive is an error that mentions “digit”, something like: Failure "Expecting digit. Got '|'".

That is, what we need is a way of labeling parsers with a word like “digit” and then showing that label when a failure occurs.

As a reminder, this is how the Parser type was defined in earlier posts:

  1. type Parser<'a> = Parser of (string -> Result<'a * string>)

In order to add a label, we need to change it into a record structure:

  1. type ParserLabel = string
  2. /// A Parser structure has a parsing function & label
  3. type Parser<'a> = {
  4. parseFn : (string -> Result<'a * string>)
  5. label: ParserLabel
  6. }

The record contains two fields: the parsing function (parseFn) and the label.

One problem is that the label is in the parser itself, but not in the Result, which means that clients will not know how to display the label along with the error.

So let’s add it to the Failure case of Result as well, in addition to the error message:

  1. // Aliases
  2. type ParserLabel = string
  3. type ParserError = string
  4. type Result<'a> =
  5. | Success of 'a
  6. | Failure of ParserLabel * ParserError

And while we are at it, let’s define a helper function to display the result of a parse:

  1. let printResult result =
  2. match result with
  3. | Success (value,input) ->
  4. printfn "%A" value
  5. | Failure (label,error) ->
  6. printfn "Error parsing %s\n%s" label error

Updating the code

With this change to the definition of Parser and Result, we have to change some of the basic functions, such as bindP:

  1. /// "bindP" takes a parser-producing function f, and a parser p
  2. /// and passes the output of p into f, to create a new parser
  3. let bindP f p =
  4. let label = "unknown" // <====== "label" is new!
  5. let innerFn input =
  6. ...
  7. match result1 with
  8. | Failure (label,err) -> // <====== "label" is new!
  9. ...
  10. | Success (value1,remainingInput) ->
  11. ...
  12. {parseFn=innerFn; label=label} // <====== "parseFn" and "label" are new!

We have to make similar changes to returnP, orElse, and many. For the complete code, see the gist linked to below.

Updating the label

When we use a combinator to build a new compound parser, we will often want to assign a new label to it.
In order to do this, we replace the original parseFn with another one that returns the new label.

Here’s the code:

  1. /// Update the label in the parser
  2. let setLabel parser newLabel =
  3. // change the inner function to use the new label
  4. let newInnerFn input =
  5. let result = parser.parseFn input
  6. match result with
  7. | Success s ->
  8. // if Success, do nothing
  9. Success s
  10. | Failure (oldLabel,err) ->
  11. // if Failure, return new label
  12. Failure (newLabel,err) // <====== use newLabel here
  13. // return the Parser
  14. {parseFn=newInnerFn; label=newLabel} // <====== use newLabel here

And let’s create an infix version of this called <?>:

  1. /// infix version of setLabel
  2. let ( <?> ) = setLabel

Let’s test our new toy!

  1. let parseDigit_WithLabel =
  2. anyOf ['0'..'9']
  3. <?> "digit"
  4. run parseDigit_WithLabel "|ABC"
  5. |> printResult

And the output is:

  1. Error parsing digit
  2. Unexpected '|'

The error message is now Error parsing digit rather than Expecting '9'. Much better!

Setting default labels:

We can also set the default labels for certain combinators such as andThen and orElse based on the inputs:

  1. /// Combine two parsers as "A andThen B"
  2. let andThen p1 p2 =
  3. let label = sprintf "%s andThen %s" (getLabel p1) (getLabel p2)
  4. p1 >>= (fun p1Result ->
  5. p2 >>= (fun p2Result ->
  6. returnP (p1Result,p2Result) ))
  7. <?> label // <====== provide a custom label
  8. // combine two parsers as "A orElse B"
  9. let orElse parser1 parser2 =
  10. // construct a new label
  11. let label = // <====== provide a custom label
  12. sprintf "%s orElse %s" (getLabel parser1) (getLabel parser2)
  13. let innerFn input =
  14. ... etc ...
  15. /// choose any of a list of characters
  16. let anyOf listOfChars =
  17. let label = sprintf "any of %A" listOfChars
  18. listOfChars
  19. |> List.map pchar
  20. |> choice
  21. <?> label // <====== provide a custom label

2. Replacing “pchar” with “satisfy”

One thing that has bothered me about all the implementations so far is pchar, the basic primitive that all the other functions have built on.

I don’t like that it is so tightly coupled to the input model. What happens if we want to parse bytes from a binary format, or other kinds of input.
All the combinators other than pchar are loosely coupled. If we could decouple pchar as well,
we would be set up for parsing any stream of tokens, and that would make me happy!

At this point, I’ll repeat one of my favorite FP slogans: “parameterize all the things!” In the case of pchar, we’ll remove the charToMatch parameter and
replace it with a function — a predicate. We’ll call the new function satisfy:

  1. /// Match an input token if the predicate is satisfied
  2. let satisfy predicate label =
  3. let innerFn input =
  4. if String.IsNullOrEmpty(input) then
  5. Failure (label,"No more input")
  6. else
  7. let first = input.[0]
  8. if predicate first then // <====== use predicate here
  9. let remainingInput = input.[1..]
  10. Success (first,remainingInput)
  11. else
  12. let err = sprintf "Unexpected '%c'" first
  13. Failure (label,err)
  14. // return the parser
  15. {parseFn=innerFn;label=label}

Other than the parameters, the only thing that has changed from the pchar implementation is this one line:

  1. let satisfy predicate label =
  2. ...
  3. if predicate first then
  4. ...

With satisfy available, we can rewrite pchar:

  1. /// parse a char
  2. let pchar charToMatch =
  3. let predicate ch = (ch = charToMatch)
  4. let label = sprintf "%c" charToMatch
  5. satisfy predicate label

Note that we are setting the label to be the charToMatch. This refactoring would not have been as convenient before, because we didn’t have the concept of “labels” yet,
and so pchar would not have been able to return a useful error message.

The satisfy function also lets us write more efficient versions of other parsers. For example, parsing a digit looked like this originally:

  1. /// parse a digit
  2. let digitChar =
  3. anyOf ['0'..'9']

But now we can rewrite it using a predicate directly, making it a lot more efficient:

  1. /// parse a digit
  2. let digitChar =
  3. let predicate = Char.IsDigit
  4. let label = "digit"
  5. satisfy predicate label

Similarly, we can create a more efficient whitespace parser too:

  1. /// parse a whitespace char
  2. let whitespaceChar =
  3. let predicate = Char.IsWhiteSpace
  4. let label = "whitespace"
  5. satisfy predicate label

3. Adding position and context to error messages

Another way to improve the error messages is to show the line and column that the error occurred on.

Obviously, for simple one-liners, keeping track of the error location is not a problem, but when you are parsing a 100 line JSON file, it will be very helpful.

In order to track the line and column we are going to have to abandon the simple string input and replace it with something more complex,
so let’s start with that.

Defining a input that tracks position

First, we will need a Position type to store the line and column, with helper functions to increment one column and one line:

  1. type Position = {
  2. line : int
  3. column : int
  4. }
  5. /// define an initial position
  6. let initialPos = {line=0; column=0}
  7. /// increment the column number
  8. let incrCol pos =
  9. {pos with column=pos.column + 1}
  10. /// increment the line number and set the column to 0
  11. let incrLine pos =
  12. {line=pos.line + 1; column=0}

Next, we’ll need to combine the input string with a position into a single “input state” type. Since we are line oriented, we can make our
lives easier and store the input string as a array of lines rather than as one giant string:

  1. /// Define the current input state
  2. type InputState = {
  3. lines : string[]
  4. position : Position
  5. }

We will also need a way to convert a string into a initial InputState:

  1. /// Create a new InputState from a string
  2. let fromStr str =
  3. if String.IsNullOrEmpty(str) then
  4. {lines=[||]; position=initialPos}
  5. else
  6. let separators = [| "\r\n"; "\n" |]
  7. let lines = str.Split(separators, StringSplitOptions.None)
  8. {lines=lines; position=initialPos}

Finally, and most importantly, we need a way to read the next character from the input — let’s call it nextChar.

We know what the input for nextChar will be (an InputState) but what should the output look like?

  • If the input is at the end, we need a way to indicate that there is no next character, so in that case return None.
  • Therefore in the case when a character is available, we will return Some.
  • In addition, the input state will have changed because the column (or line) will have been incremented as well.

So, putting this together, the input for nextChar is an InputState and the output is a pair char option * InputState.

The logic for returning the next char will be as follows then:

  • If we are at the last character of the input, return EOF (None) and don’t change the state.
  • If the current column is not at the end of a line, return the character at that position and change the state by incrementing the column position.
  • If the current column is at the end of a line, return a newline character and change the state by incrementing the line position.

Here’s the code:

  1. // return the current line
  2. let currentLine inputState =
  3. let linePos = inputState.position.line
  4. if linePos < inputState.lines.Length then
  5. inputState.lines.[linePos]
  6. else
  7. "end of file"
  8. /// Get the next character from the input, if any
  9. /// else return None. Also return the updated InputState
  10. /// Signature: InputState -> InputState * char option
  11. let nextChar input =
  12. let linePos = input.position.line
  13. let colPos = input.position.column
  14. // three cases
  15. // 1) if line >= maxLine ->
  16. // return EOF
  17. // 2) if col less than line length ->
  18. // return char at colPos, increment colPos
  19. // 3) if col at line length ->
  20. // return NewLine, increment linePos
  21. if linePos >= input.lines.Length then
  22. input, None
  23. else
  24. let currentLine = currentLine input
  25. if colPos < currentLine.Length then
  26. let char = currentLine.[colPos]
  27. let newPos = incrCol input.position
  28. let newState = {input with position=newPos}
  29. newState, Some char
  30. else
  31. // end of line, so return LF and move to next line
  32. let char = '\n'
  33. let newPos = incrLine input.position
  34. let newState = {input with position=newPos}
  35. newState, Some char

Unlike the earlier string implementation, the underlying array of lines is never altered or copied — only the position is changed. This means that
making a new state each time the position changes should be reasonably efficient, because the text is shared everywhere.

Let’s quickly test that the implementation works. We’ll create a helper function readAllChars and then see what it returns
for different inputs:

  1. let rec readAllChars input =
  2. [
  3. let remainingInput,charOpt = nextChar input
  4. match charOpt with
  5. | None ->
  6. // end of input
  7. ()
  8. | Some ch ->
  9. // return first character
  10. yield ch
  11. // return the remaining characters
  12. yield! readAllChars remainingInput
  13. ]

Here it is with some example inputs:

  1. fromStr "" |> readAllChars // []
  2. fromStr "a" |> readAllChars // ['a'; '\n']
  3. fromStr "ab" |> readAllChars // ['a'; 'b'; '\n']
  4. fromStr "a\nb" |> readAllChars // ['a'; '\n'; 'b'; '\n']

Note that the implementation returns a newline at the end of the input, even if the input doesn’t have one. I think that this is a feature, not a bug!

Changing the parser to use the input

We now need to change the Parser type again.

To start with, the Failure case needs to return some kind of data that indicates the position, so we can show it in an error message.

We could just use the InputState as is, but let’s be good and define a new type specially for this use, called ParserPosition:

  1. /// Stores information about the parser position for error messages
  2. type ParserPosition = {
  3. currentLine : string
  4. line : int
  5. column : int
  6. }

We’ll need some way to convert a InputState into a ParserPosition:

  1. let parserPositionFromInputState (inputState:Input) = {
  2. currentLine = TextInput.currentLine inputState
  3. line = inputState.position.line
  4. column = inputState.position.column
  5. }

And finally, we can update the Result type to include ParserPosition:

  1. // Result type
  2. type Result<'a> =
  3. | Success of 'a
  4. | Failure of ParserLabel * ParserError * ParserPosition

In addition, the Parser type needs to change from string to InputState:

  1. type Input = TextInput.InputState // type alias
  2. /// A Parser structure has a parsing function & label
  3. type Parser<'a> = {
  4. parseFn : (Input -> Result<'a * Input>)
  5. label: ParserLabel
  6. }

With all this extra information available, the printResult function can be enhanced to print the text of the current line, along with a caret where the error is:

  1. let printResult result =
  2. match result with
  3. | Success (value,input) ->
  4. printfn "%A" value
  5. | Failure (label,error,parserPos) ->
  6. let errorLine = parserPos.currentLine
  7. let colPos = parserPos.column
  8. let linePos = parserPos.line
  9. let failureCaret = sprintf "%*s^%s" colPos "" error
  10. printfn "Line:%i Col:%i Error parsing %s\n%s\n%s" linePos colPos label errorLine failureCaret

Let’s test printResult with a dummy error value:

  1. let exampleError =
  2. Failure ("identifier", "unexpected |",
  3. {currentLine = "123 ab|cd"; line=1; column=6})
  4. printResult exampleError

The output is shown below:

  1. Line:1 Col:6 Error parsing identifier
  2. 123 ab|cd
  3. ^unexpected |

Much nicer than before!

Fixing up the run function

The run function now needs to take an InputState not a string. But we also want the convenience of running against string input,
so let’s create two run functions, one that takes an InputState and one that takes a string:

  1. /// Run the parser on a InputState
  2. let runOnInput parser input =
  3. // call inner function with input
  4. parser.parseFn input
  5. /// Run the parser on a string
  6. let run parser inputStr =
  7. // call inner function with input
  8. runOnInput parser (TextInput.fromStr inputStr)

Fixing up the combinators

We now have three items in the Failure case rather than two. This breaks some code but is easy to fix. I’m tempted to create a special ParserError type
so that it never happens again, but for now, I’ll just fix up the errors.

Here’s a new version of satisfy:

  1. /// Match an input token if the predicate is satisfied
  2. let satisfy predicate label =
  3. let innerFn input =
  4. let remainingInput,charOpt = TextInput.nextChar input
  5. match charOpt with
  6. | None ->
  7. let err = "No more input"
  8. let pos = parserPositionFromInputState input
  9. //Failure (label,err) // <====== old version
  10. Failure (label,err,pos) // <====== new version
  11. | Some first ->
  12. if predicate first then
  13. Success (first,remainingInput)
  14. else
  15. let err = sprintf "Unexpected '%c'" first
  16. let pos = parserPositionFromInputState input
  17. //Failure (label,err) // <====== old version
  18. Failure (label,err,pos) // <====== new version
  19. // return the parser
  20. {parseFn=innerFn;label=label}

Note that the failure case code is now Failure (label,err,pos) where the parser position is built from the input state.

And here is bindP:

  1. /// "bindP" takes a parser-producing function f, and a parser p
  2. /// and passes the output of p into f, to create a new parser
  3. let bindP f p =
  4. let label = "unknown"
  5. let innerFn input =
  6. let result1 = runOnInput p input
  7. match result1 with
  8. | Failure (label,err,pos) -> // <====== new with pos
  9. // return error from parser1
  10. Failure (label,err,pos)
  11. | Success (value1,remainingInput) ->
  12. // apply f to get a new parser
  13. let p2 = f value1
  14. // run parser with remaining input
  15. runOnInput p2 remainingInput
  16. {parseFn=innerFn; label=label}

We can fix up the other functions in the same way.

Testing the positional errors

Let’s test with a real parser now:

  1. let parseAB =
  2. pchar 'A' .>>. pchar 'B'
  3. <?> "AB"
  4. run parseAB "A|C"
  5. |> printResult

And the output is:

  1. // Line:0 Col:1 Error parsing AB
  2. // A|C
  3. // ^Unexpected '|'

Excellent! I think we can stop now.

4. Adding some standard parsers to the library

In the previous posts, we’ve built parsers for strings and ints in passing, but now let’s add them to the core library, so that clients don’t have to reinvent the wheel.

These parsers are based on those in the the FParsec library.

Let’s start with some string-related parsers. I will present them without comment — I hope that the code is self-explanatory by now.

  1. /// parse a char
  2. let pchar charToMatch =
  3. // label is just the character
  4. let label = sprintf "%c" charToMatch
  5. let predicate ch = (ch = charToMatch)
  6. satisfy predicate label
  7. /// Choose any of a list of characters
  8. let anyOf listOfChars =
  9. let label = sprintf "anyOf %A" listOfChars
  10. listOfChars
  11. |> List.map pchar // convert into parsers
  12. |> choice
  13. <?> label
  14. /// Convert a list of chars to a string
  15. let charListToStr charList =
  16. String(List.toArray charList)
  17. /// Parses a sequence of zero or more chars with the char parser cp.
  18. /// It returns the parsed chars as a string.
  19. let manyChars cp =
  20. many cp
  21. |>> charListToStr
  22. /// Parses a sequence of one or more chars with the char parser cp.
  23. /// It returns the parsed chars as a string.
  24. let manyChars1 cp =
  25. many1 cp
  26. |>> charListToStr
  27. /// parse a specific string
  28. let pstring str =
  29. // label is just the string
  30. let label = str
  31. str
  32. // convert to list of char
  33. |> List.ofSeq
  34. // map each char to a pchar
  35. |> List.map pchar
  36. // convert to Parser<char list>
  37. |> sequence
  38. // convert Parser<char list> to Parser<string>
  39. |> mapP charListToStr
  40. <?> label

Let’s test pstring, for example:

  1. run (pstring "AB") "ABC"
  2. |> printResult
  3. // Success
  4. // "AB"
  5. run (pstring "AB") "A|C"
  6. |> printResult
  7. // Line:0 Col:1 Error parsing AB
  8. // A|C
  9. // ^Unexpected '|'

Whitespace parsers

Whitespace is important in parsing, even if we do end up mostly throwing it away!

  1. /// parse a whitespace char
  2. let whitespaceChar =
  3. let predicate = Char.IsWhiteSpace
  4. let label = "whitespace"
  5. satisfy predicate label
  6. /// parse zero or more whitespace char
  7. let spaces = many whitespaceChar
  8. /// parse one or more whitespace char
  9. let spaces1 = many1 whitespaceChar

And here’s some whitespace tests:

  1. run spaces " ABC"
  2. |> printResult
  3. // [' ']
  4. run spaces "A"
  5. |> printResult
  6. // []
  7. run spaces1 " ABC"
  8. |> printResult
  9. // [' ']
  10. run spaces1 "A"
  11. |> printResult
  12. // Line:0 Col:0 Error parsing many1 whitespace
  13. // A
  14. // ^Unexpected 'A'

Numeric parsers

Finally, we need a parser for ints and floats.

  1. /// parse a digit
  2. let digitChar =
  3. let predicate = Char.IsDigit
  4. let label = "digit"
  5. satisfy predicate label
  6. // parse an integer
  7. let pint =
  8. let label = "integer"
  9. // helper
  10. let resultToInt (sign,digits) =
  11. let i = digits |> int // ignore int overflow for now
  12. match sign with
  13. | Some ch -> -i // negate the int
  14. | None -> i
  15. // define parser for one or more digits
  16. let digits = manyChars1 digitChar
  17. // an "int" is optional sign + one or more digits
  18. opt (pchar '-') .>>. digits
  19. |> mapP resultToInt
  20. <?> label
  21. // parse a float
  22. let pfloat =
  23. let label = "float"
  24. // helper
  25. let resultToFloat (((sign,digits1),point),digits2) =
  26. let fl = sprintf "%s.%s" digits1 digits2 |> float
  27. match sign with
  28. | Some ch -> -fl // negate the float
  29. | None -> fl
  30. // define parser for one or more digits
  31. let digits = manyChars1 digitChar
  32. // a float is sign, digits, point, digits (ignore exponents for now)
  33. opt (pchar '-') .>>. digits .>>. pchar '.' .>>. digits
  34. |> mapP resultToFloat
  35. <?> label

And some tests:

  1. run pint "-123Z"
  2. |> printResult
  3. // -123
  4. run pint "-Z123"
  5. |> printResult
  6. // Line:0 Col:1 Error parsing integer
  7. // -Z123
  8. // ^Unexpected 'Z'
  9. run pfloat "-123.45Z"
  10. |> printResult
  11. // -123.45
  12. run pfloat "-123Z45"
  13. |> printResult
  14. // Line:0 Col:4 Error parsing float
  15. // -123Z45
  16. // ^Unexpected 'Z'

5. Backtracking

One more topic that we should discuss is “backtracking”.

Let’s say that you have two parsers: one to match the string A-1 and and another to match the string A-2. If the input is
A-2 then the first parser will fail at the third character and the second parser will be attempted.

Now the second parser must start at the beginning of the original sequence of characters, not at the third character. That is, we
need to undo the current position in the input stream and go back to the first position.

If we were using a mutable input stream then this might be a tricky problem, but thankfully we are using immutable data, and so
“undoing” the position just means using the original input value. And of course, this is exactly what combinators such as orElse (<|>) do.

In other words, we get backtracking “for free” when we use immutable input state. Yay!

Sometimes however, we don’t want to backtrack. For example, let’s say we have these parsers:

  • let forExpression = the “for” keyword, then an identifier, then the “in” keyword, etc.
  • let ifExpression = the “if” keyword, then an identifier, then the “then” keyword, etc.

and we then create a combined expression parser that chooses between them:

  • let expression = forExpression <|> ifExpression

Now, if the input stream is for &&& in something then the forExpression parser will error when it hits the sequence &&&, because it is expecting
a valid identifier. At this point we don’t want to backtrack and try the ifExpression — we want to show an error such as “identifier expected after ‘for’”.

The rule then is that: if input has been consumed successfully (in this case, the for keyword was matched successfully) then do not backtrack.

We’re not going to implement this rule in our simple library, but a proper library like FParsec does implement this and also has support
for bypassing it when needed.

Listing of the final parser library

The parsing library is up to 500 lines of code now, so I won’t show it here. You can see it at this gist.

Summary

In this post, we added better error handling and some more parsers.

Now we have everything we need to build a JSON parser!
That will be the topic of the next post.

The source code for this post is available at this gist.