Chapter 13 Lexer and parser generators (ocamllex, ocamlyacc)

Chapter 13 Lexer and parser generators (ocamllex, ocamlyacc)

This chapter describes two program generators: ocamllex, thatproduces a lexical analyzer from a set of regular expressions withassociated semantic actions, and ocamlyacc, that produces a parserfrom a grammar with associated semantic actions.

These program generators are very close to the well-known lex andyacc commands that can be found in most C programming environments.This chapter assumes a working knowledge of lex and yacc: whileit describes the input syntax for ocamllex and ocamlyacc and themain differences with lex and yacc, it does not explain the basicsof writing a lexer or parser description in lex and yacc. Readersunfamiliar with lex and yacc are referred to “Compilers:principles, techniques, and tools” by Aho, Sethi and Ullman(Addison-Wesley, 1986), or “Lex & Yacc”, by Levine, Mason andBrown (O’Reilly, 1992).

13.1 Overview of ocamllex

The ocamllex command produces a lexical analyzer from a set of regularexpressions with attached semantic actions, in the style oflex. Assuming the input file is lexer.mll, executing

        ocamllex lexer.mll

produces OCaml code for a lexical analyzer in file lexer.ml.This file defines one lexing function per entry point in the lexerdefinition. These functions have the same names as the entrypoints. Lexing functions take as argument a lexer buffer, and returnthe semantic attribute of the corresponding entry point.

Lexer buffers are an abstract data type implemented in the standardlibrary module Lexing. The functions Lexing.from_channel,Lexing.from_string and Lexing.from_function createlexer buffers that read from an input channel, a character string, orany reading function, respectively. (See the description of moduleLexing in chapter ??.)

When used in conjunction with a parser generated by ocamlyacc, thesemantic actions compute a value belonging to the type token definedby the generated parsing module. (See the description of ocamlyaccbelow.)

13.1.1 Options

The following command-line options are recognized by ocamllex.

-ml
Output code that does not use OCaml’s built-in automatainterpreter. Instead, the automaton is encoded by OCaml functions.This option improves performance when using the native compiler, butdecreases it when using the bytecode compiler.
-ooutput-file
Specify the name of the output file produced by ocamllex.The default is the input file name with its extension replaced by .ml.
-q
Quiet mode. ocamllex normally outputs informational messagesto standard output. They are suppressed if option -q is used.
-v or -version
Print version string and exit.
-vnum
Print short version number and exit.
-help or —help
Display a short usage summary and exit.

13.2 Syntax of lexer definitions

The format of lexer definitions is as follows:

{ header }
let ident = regexp …
[refill { refill-handler }]
rule entrypoint [arg₁… arg_n] =
  parse regexp { action }
      | …
      | regexp { action }
and entrypoint [arg₁… arg_n] =
  parse …
and …
{ trailer }

Comments are delimited by ( and ), as in OCaml.The parse keyword, can be replaced by the shortest keyword, withthe semantic consequences explained below.

Refill handlers are a recent (optional) feature introduced in 4.02,documented below in subsection 13.2.7.

13.2.1 Header and trailer

The header and trailer sections are arbitrary OCamltext enclosed in curly braces. Either or both can be omitted. Ifpresent, the header text is copied as is at the beginning of theoutput file and the trailer text at the end. Typically, theheader section contains the open directives requiredby the actions, and possibly some auxiliary functions used in theactions.

13.2.2 Naming regular expressions

Between the header and the entry points, one can give names tofrequently-occurring regular expressions. This is writtenletident= regexp.In regular expressions that follow this declaration, the identifierident can be used as shorthand for regexp.

13.2.3 Entry points

The names of the entry points must be valid identifiers for OCamlvalues (starting with a lowercase letter).Similarily, the arguments arg₁…arg_n must be valid identifiers for OCaml.Each entry point becomes anOCaml function that takes n+1 arguments,the extra implicit last argument being of type Lexing.lexbuf.Characters are read from the Lexing.lexbuf argument and matchedagainst the regular expressions provided in the rule, until a prefixof the input matches one of the rule. The corresponding action isthen evaluated and returned as the result of the function.

If several regular expressions match a prefix of the input, the“longest match” rule applies: the regular expression that matchesthe longest prefix of the input is selected. In case of tie, theregular expression that occurs earlier in the rule is selected.

However, if lexer rules are introduced with the shortest keyword inplace of the parse keyword, then the “shortest match” rule applies:the shortest prefix of the input is selected. In case of tie, theregular expression that occurs earlier in the rule is still selected.This feature is not intended for use in ordinary lexical analyzers, itmay facilitate the use of ocamllex as a simple text processing tool.

13.2.4 Regular expressions

The regular expressions are in the style of lex, with a moreOCaml-like syntax.

regexp

::=

…

'regular-char ∣ escape-sequence'
A character constant, with the same syntax as OCaml characterconstants. Match the denoted character.
_
(underscore) Match any character.
eof
Match the end of the lexer input.Note: On some systems, with interactive input, an end-of-filemay be followed by more characters. However, ocamllex will notcorrectly handle regular expressions that contain eof followed bysomething else.
" { string-character } "
A string constant, with the same syntax as OCaml stringconstants. Match the corresponding sequence of characters.
[character-set]
Match any single character belonging to the givencharacter set. Valid character sets are: singlecharacter constants 'c'; ranges of characters'c1'-'c2' (all characters between c1 and c2,inclusive); and the union of two or more character sets, denoted byconcatenation.
[^character-set]
Match any single character not belonging to the given character set.
regexp1# regexp2
(difference of character sets)Regular expressions regexp1 and regexp2 must be character setsdefined with [… ] (or a single character expression orunderscore _).Match the difference of the two specified character sets.
regexp*
(repetition) Match the concatenation of zero or morestrings that match regexp.
regexp+
(strict repetition) Match the concatenation of one or morestrings that match regexp.
regexp?
(option) Match the empty string, or a string matching regexp.
regexp1| regexp2
(alternative) Match any string that matches regexp1 or regexp2
regexp1 regexp2
(concatenation) Match the concatenation of two strings, the firstmatching regexp1, the second matching regexp2.
(regexp)
Match the same strings as regexp.
ident
Reference the regular expression bound to ident by an earlierletident= regexp definition.
regexpas ident
Bind the substring matched by regexp to identifier ident.Concerning the precedences of operators, # has the highest precedence,followed by *, + and ?,then concatenation, then | (alternation), then as.

13.2.5 Actions

The actions are arbitrary OCaml expressions. They are evaluated ina context where the identifiers defined by using the as constructare bound to subparts of the matched string.Additionally, lexbuf is bound to the current lexerbuffer. Some typical uses for lexbuf, in conjunction with theoperations on lexer buffers provided by the Lexing standard librarymodule, are listed below.

Lexing.lexeme lexbuf
Return the matched string.
Lexing.lexeme_char lexbuf n
Return the nthcharacter in the matched string. The first character corresponds to n = 0.
Lexing.lexeme_start lexbuf
Return the absolute position in the input text of the beginning of thematched string (i.e. the offset of the first character of the matchedstring). The first character read from the input text has offset 0.
Lexing.lexeme_end lexbuf
Return the absolute position in the input text of the end of thematched string (i.e. the offset of the first character after thematched string). The first character read from the input text hasoffset 0.
entrypoint [exp1… expn] lexbuf
(Where entrypoint is the name of another entry point in the samelexer definition.) Recursively call the lexer on the given entry point.Notice that lexbuf is the last argument.Useful for lexing nested comments, for example.

13.2.6 Variables in regular expressions

The as construct is similar to “groups” as provided bynumerous regular expression packages.The type of these variables can be string, char, string optionor char option.

We first consider the case of linear patterns, that is the case whenall as bound variables are distinct.In regexpas ident, the type of ident normally is string (orstring option) exceptwhen regexp is a character constant, an underscore, a stringconstant of length one, a character set specification, or analternation of those. Then, the type of ident is char (or char option).Option types are introduced when overall rule matching does notimply matching of the bound sub-pattern. This is in particular thecase of (regexpas ident)? and ofregexp₁|( regexp₂as ident).

There is no linearity restriction over as bound variables.When a variable is bound more than once, the previous rules are to beextended as follows:

A variable is a char variable when all its occurrences bindchar occurrences in the previous sense.
A variable is an option variable when the overall expressioncan be matched without binding this variable.For instance, in('a' as x) | ( 'a' ( as x) ) the variable x is of typechar, whereas in("ab" as x) | ( 'a' ( as x) ? ) the variable x is of typestring option.

In some cases, a successful match may not yield a unique set of bindings.For instance the matching of aba by the regular expression(('a'|"ab") as x) (("ba"|'a') as y) may result in bindingeitherx to "ab" and y to "a", orx to "a" and y to "ba".The automata produced ocamllex on such ambiguous regularexpressions will select one of the possible resulting sets ofbindings.The selected set of bindings is purposely left unspecified.

13.2.7 Refill handlers

By default, when ocamllex reaches the end of its lexing buffer, itwill silently call the refill_buff function of lexbuf structureand continue lexing. It is sometimes useful to be able to take controlof refilling action; typically, if you use a library for asynchronouscomputation, you may want to wrap the refilling action in a delayingfunction to avoid blocking synchronous operations.

Since OCaml 4.02, it is possible to specify a refill-handler,a function that will be called when refill happens. It is passed thecontinuation of the lexing, on which it has total control. The OCamlexpression used as refill action should have a type that is aninstance of

   (Lexing.lexbuf -> 'a) -> Lexing.lexbuf -> 'a

where the first argument is the continuation which captures theprocessing ocamllex would usually perform (refilling the buffer, thencalling the lexing function again), and the result type thatinstantiates [’a] should unify with the result type of all lexingrules.

As an example, consider the following lexer that is parametrized overan arbitrary monad:

{
type token = EOL | INT of int | PLUS
 
module Make (M : sig
               type 'a t
               val return: 'a -> 'a t
               val bind: 'a t -> ('a -> 'b t) -> 'b t
               val fail : string -> 'a t
 
               (* Set up lexbuf *)
               val on_refill : Lexing.lexbuf -> unit t
             end)
= struct
 
let refill_handler k lexbuf =
    M.bind (M.on_refill lexbuf) (fun () -> k lexbuf)
 
}
 
refill {refill_handler}
 
rule token = parse
| [' ' '\t']
    { token lexbuf }
| '\n'
    { M.return EOL }
| ['0'-'9']+ as i
    { M.return (INT (int_of_string i)) }
| '+'
    { M.return PLUS }
| _
    { M.fail "unexpected character" }
{
end
}

13.2.8 Reserved identifiers

All identifiers starting with __ocaml_lex are reserved for use byocamllex; do not use any such identifier in your programs.

13.3 Overview of ocamlyacc

The ocamlyacc command produces a parser from a context-free grammarspecification with attached semantic actions, in the style of yacc.Assuming the input file is grammar.mly, executing

        ocamlyacc options grammar.mly

produces OCaml code for a parser in the file grammar.ml,and its interface in file grammar.mli.

The generated module defines one parsing function per entry point inthe grammar. These functions have the same names as the entry points.Parsing functions take as arguments a lexical analyzer (a functionfrom lexer buffers to tokens) and a lexer buffer, and return thesemantic attribute of the corresponding entry point. Lexical analyzerfunctions are usually generated from a lexer specification by theocamllex program. Lexer buffers are an abstract data typeimplemented in the standard library module Lexing. Tokens are values fromthe concrete type token, defined in the interface filegrammar.mli produced by ocamlyacc.

13.4 Syntax of grammar definitions

Grammar definitions have the following format:

%{
  header
%}
  declarations
%%
  rules
%%
  trailer

Comments are enclosed between / and / (as in C) in the“declarations” and “rules” sections, and between ( and) (as in OCaml) in the “header” and “trailer” sections.

13.4.1 Header and trailer

The header and the trailer sections are OCaml code that is copiedas is into file grammar.ml. Both sections are optional. The headergoes at the beginning of the output file; it usually containsopen directives and auxiliary functions required by the semanticactions of the rules. The trailer goes at the end of the output file.

13.4.2 Declarations

Declarations are given one per line. They all start with a % sign.

%tokenconstr … constr
Declare the given symbols constr … constras tokens (terminal symbols). These symbolsare added as constant constructors for the token concrete type.
%token<typexpr> constr … constr
Declare the given symbols constr … constr as tokens with anattached attribute of thegiven type. These symbols are added as constructors with arguments ofthe given type for the token concrete type. The typexpr part isan arbitrary OCaml type expression, except that all typeconstructor names must be fully qualified (e.g. Modname.typename)for all types except standard built-in types, even if the properopen directives (e.g. open Modname) were given in theheader section. That’s because the header is copied only to the .mloutput file, but not to the .mli output file, while the typexpr partof a %token declaration is copied to both.
%startsymbol … symbol
Declare the given symbols as entry points for the grammar. For eachentry point, a parsing function with the same name is defined in theoutput module. Non-terminals that are not declared as entry pointshave no such parsing function. Start symbols must be given a type withthe %type directive below.
%type<typexpr> symbol … symbol
Specify the type of the semantic attributes for the given symbols.This is mandatory for start symbols only. Other nonterminal symbolsneed not be given types by hand: these types will be inferred whenrunning the output files through the OCaml compiler (unless the-s option is in effect). The typexpr part is an arbitrary OCamltype expression, except that all type constructor names must befully qualified, as explained above for %token.
%leftsymbol … symbol
%rightsymbol … symbol
%nonassocsymbol … symbol
Associate precedences and associativities to the given symbols. Allsymbols on the same line are given the same precedence. They havehigher precedence than symbols declared before in a %left,%right or %nonassoc line. They have lower precedencethan symbols declared after in a %left, %right or%nonassoc line. The symbols are declared to associate to theleft (%left), to the right (%right), or to benon-associative (%nonassoc). The symbols are usually tokens.They can also be dummy nonterminals, for use with the %precdirective inside the rules.

The precedence declarations are used in the following way toresolve reduce/reduce and shift/reduce conflicts:

Tokens and rules have precedences. By default, the precedenceof a rule is the precedence of its rightmost terminal. Youcan override this default by using the %prec directive in the rule.
A reduce/reduce conflictis resolved in favor of the first rule (in the order given by thesource file), and ocamlyacc outputs a warning.
A shift/reduce conflictis resolved by comparing the precedence of the rule to bereduced with the precedence of the token to be shifted. If theprecedence of the rule is higher, then the rule will be reduced;if the precedence of the token is higher, then the token willbe shifted.
A shift/reduce conflict between a rule and a token with thesame precedence will be resolved using the associativity: if thetoken is left-associative, then the parser will reduce; if thetoken is right-associative, then the parser will shift. If thetoken is non-associative, then the parser will declare a syntaxerror.
When a shift/reduce conflict cannot be resolved using the abovemethod, then ocamlyacc will output a warning and the parser willalways shift.

13.4.3 Rules

The syntax for rules is as usual:

nonterminal :
    symbol … symbol { semantic-action }
  | …
  | symbol … symbol { semantic-action }
;

Rules can also contain the %prec symbol directive in theright-hand side part, to override the default precedence andassociativity of the rule with the precedence and associativity of thegiven symbol.

Semantic actions are arbitrary OCaml expressions, thatare evaluated to produce the semantic attribute attached tothe defined nonterminal. The semantic actions can access thesemantic attributes of the symbols in the right-hand side ofthe rule with the $ notation: $1 is the attribute for thefirst (leftmost) symbol, $2 is the attribute for the secondsymbol, etc.

The rules may contain the special symbol error to indicateresynchronization points, as in yacc.

Actions occurring in the middle of rules are not supported.

Nonterminal symbols are like regular OCaml symbols, except that theycannot end with ' (single quote).

13.4.4 Error handling

Error recovery is supported as follows: when the parser reaches anerror state (no grammar rules can apply), it calls a function namedparse_error with the string "syntax error" as argument. The defaultparse_error function does nothing and returns, thus initiating errorrecovery (see below). The user can define a customized parse_errorfunction in the header section of the grammar file.

The parser also enters error recovery mode if one of the grammaractions raises the Parsing.Parse_error exception.

In error recovery mode, the parser discards states from thestack until it reaches a place where the error token can be shifted.It then discards tokens from the input until it finds three successivetokens that can be accepted, and starts processing with the first ofthese. If no state can be uncovered where the error token can beshifted, then the parser aborts by raising the Parsing.Parse_errorexception.

Refer to documentation on yacc for more details and guidance in howto use error recovery.

13.5 Options

The ocamlyacc command recognizes the following options:

-bprefix
Name the output files prefix.ml, prefix.mli,prefix.output, instead of the default naming convention.
-q
This option has no effect.
-v
Generate a description of the parsing tables and a report on conflictsresulting from ambiguities in the grammar. The description is put infile grammar.output.
-version
Print version string and exit.
-vnum
Print short version number and exit.
-
Read the grammar specification from standard input. The defaultoutput file names are stdin.ml and stdin.mli.
—file
Process file as the grammar specification, even if its namestarts with a dash (-) character. This option must be the last on thecommand line.At run-time, the ocamlyacc-generated parser can be debugged bysetting the p option in the OCAMLRUNPARAM environment variable(see section 11.2). This causes the pushdownautomaton executing the parser to print a trace of its action (tokensshifted, rules reduced, etc). The trace mentions rule numbers andstate numbers that can be interpreted by looking at the filegrammar.output generated by ocamlyacc -v.

13.6 A complete example

The all-time favorite: a desk calculator. This program readsarithmetic expressions on standard input, one per line, and printstheir values. Here is the grammar definition:

        /* File parser.mly */
        %token <int> INT
        %token PLUS MINUS TIMES DIV
        %token LPAREN RPAREN
        %token EOL
        %left PLUS MINUS        /* lowest precedence */
        %left TIMES DIV         /* medium precedence */
        %nonassoc UMINUS        /* highest precedence */
        %start main             /* the entry point */
        %type <int> main
        %%
        main:
            expr EOL                { $1 }
        ;
        expr:
            INT                     { $1 }
          | LPAREN expr RPAREN      { $2 }
          | expr PLUS expr          { $1 + $3 }
          | expr MINUS expr         { $1 - $3 }
          | expr TIMES expr         { $1 * $3 }
          | expr DIV expr           { $1 / $3 }
          | MINUS expr %prec UMINUS { - $2 }
        ;

Here is the definition for the corresponding lexer:

        (* File lexer.mll *)
        {
        open Parser        (* The type token is defined in parser.mli *)
        exception Eof
        }
        rule token = parse
            [' ' '\t']     { token lexbuf }     (* skip blanks *)
          | ['\n' ]        { EOL }
          | ['0'-'9']+ as lxm { INT(int_of_string lxm) }
          | '+'            { PLUS }
          | '-'            { MINUS }
          | '*'            { TIMES }
          | '/'            { DIV }
          | '('            { LPAREN }
          | ')'            { RPAREN }
          | eof            { raise Eof }

Here is the main program, that combines the parser with the lexer:

        (* File calc.ml *)
        let _ =
          try
            let lexbuf = Lexing.from_channel stdin in
            while true do
              let result = Parser.main Lexer.token lexbuf in
                print_int result; print_newline(); flush stdout
            done
          with Lexer.Eof ->
            exit 0

To compile everything, execute:

        ocamllex lexer.mll       # generates lexer.ml
        ocamlyacc parser.mly     # generates parser.ml and parser.mli
        ocamlc -c parser.mli
        ocamlc -c lexer.ml
        ocamlc -c parser.ml
        ocamlc -c calc.ml
        ocamlc -o calc lexer.cmo parser.cmo calc.cmo

13.7 Common errors

ocamllex: transition table overflow, automaton is too big
The deterministic automata generated by ocamllex are limited to atmost 32767 transitions. The message above indicates that your lexerdefinition is too complex and overflows this limit. This is commonlycaused by lexer definitions that have separate rules for each of thealphabetic keywords of the language, as in the following example.

rule token = parse
  "keyword1"   { KWD1 }
| "keyword2"   { KWD2 }
| ...
| "keyword100" { KWD100 }
| ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id
               { IDENT id}

To keep the generated automata small, rewrite those definitions withonly one general “identifier” rule, followed by a hashtable lookupto separate keywords from identifiers:

{ let keyword_table = Hashtbl.create 53
  let _ =
    List.iter (fun (kwd, tok) -> Hashtbl.add keyword_table kwd tok)
              [ "keyword1", KWD1;
                "keyword2", KWD2; ...
                "keyword100", KWD100 ]
}
rule token = parse
  ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id
               { try
                   Hashtbl.find keyword_table id
                 with Not_found ->
                   IDENT id }

ocamllex: Position memory overflow, too many bindings
The deterministic automata generated by ocamllex maintain a table ofpositions inside the scanned lexer buffer. The size of this table islimited to at most 255 cells. This error should not show up in normalsituations.

13.8 Module Depend : Module dependencies.

Warning: this module is unstable and part ofcompiler-libs[24].

module String :

Misc.Stdlib.String

type map_tree =
  | Node of String.Set.t * bound_map

type bound_map = map_tree String.Map.t

val make_leaf : string -> map_tree

val make_node : bound_map -> map_tree

val weaken_map : String.Set.t -> map_tree -> map_tree

val free_structure_names : String.Set.t ref

val pp_deps : string list ref

dependencies found by preprocessing tools (plugins)

val open_module : bound_map -> Longident.t -> bound_map

val add_use_file : bound_map -> Parsetree.toplevel_phrase list -> unit

val add_signature : bound_map -> Parsetree.signature -> unit

val add_implementation : bound_map -> Parsetree.structure -> unit

val add_implementation_binding :
  bound_map -> Parsetree.structure -> bound_map

val add_signature_binding : bound_map -> Parsetree.signature -> bound_map