As a final modification to our running JSON quotation example, I want to repair a problem noted in the first post—that the default lexer does not match the JSON spec—and in doing so demonstrate the use of custom lexers with Camlp4 grammars. We’ll parse UTF8-encoded Javascript using the ulex library.
To use a custom lexer, we need to pass a module matching the Lexer
signature (in camlp4/Camlp4/Sig.ml
) to Camlp4.PreCast.MakeGram
. (Recall that we get back an empty grammar which we then extend with parser entries. ) Let’s look at the signature and its subsignatures, and our implementation of each:
moduletypeError=sigtypetexceptionEoftvalto_string:t->stringvalprint:Format.formatter->t->unitend
First we have a module for packaging up an exception so it can be handled generically (in particular it may be registered with Camlp4.ErrorHandler
for common printing and handling). We have simple exception needs so we give a simple implementation:
moduleError=structtypet=stringexceptionEofstringletprint=Format.pp_print_stringletto_stringx=xendlet_=letmoduleM=Camlp4.ErrorHandler.Register(Error)in()
Next we have a module defining the tokens our lexer supports:
moduletypeToken=sigmoduleLoc:Loctypetvalto_string:t->stringvalprint:Format.formatter->t->unitvalmatch_keyword:string->t->boolvalextract_string:t->stringmoduleFilter:...(* see below *)moduleError:Errorend
The type t
represents a token. This can be anything we like (in particular it does not need to be a variant with arms KEYWORD
, EOI
, etc. although that is the conventional representation), so long as we provide the specified functions to convert it to a string, print it to a formatter, determine if it matches a string keyword (recall that we can use literal strings in grammars; this function is called to see if the next token matches a literal string), and extract a string representation of it (called when you bind a variable to a token in a grammar—e.g. n = NUMBER
). Here’s our implementation:
typetoken=|KEYWORDofstring|NUMBERofstring|STRINGofstring|ANTIQUOTofstring*string|EOImoduleToken=structtypet=tokenletto_stringt=letsf=Printf.sprintfinmatchtwith|KEYWORDs->sf"KEYWORD %S"s|NUMBERs->sf"NUMBER %s"s|STRINGs->sf"STRING \"%s\""s|ANTIQUOT(n,s)->sf"ANTIQUOT %s: %S"ns|EOI->sf"EOI"letprintppfx=Format.pp_print_stringppf(to_stringx)letmatch_keywordkwd=function|KEYWORDkwd'whenkwd=kwd'->true|_->falseletextract_string=function|KEYWORDs|NUMBERs|STRINGs->s|tok->invalid_arg("Cannot extract a string from this token: "^to_stringtok)moduleLoc=Camlp4.PreCast.LocmoduleError=ErrormoduleFilter=...(* see below *)end
Not much to it. KEYWORD
covers true
, false
, null
, and punctuation; NUMBER
and STRING
are JSON numbers and strings; as we saw last time antiquotations are returned in ANTIQUOT
; finally we signal the end of the input with EOI
.
moduleFilter:sigtypetoken_filter=(t*Loc.t)Stream.t->(t*Loc.t)Stream.ttypetvalmk:(string->bool)->tvaldefine_filter:t->(token_filter->token_filter)->unitvalfilter:t->token_filtervalkeyword_added:t->string->bool->unitvalkeyword_removed:t->string->unitend;
The Filter
module provides filters over token streams. We don’t have a need for it in the JSON example, but it’s interesting to see how it is implemented in the default lexer and used in the OCaml parser. The argument to mk
is a function indicating whether a string should be treated as a keyword (i.e. the literal string is used in the grammar), and the default lexer uses it to filter the token stream to convert identifiers into keywords. If we wanted the JSON parser to be extensible, we would need to take this into account; instead we’ll just stub out the functions:
moduleFilter=structtypetoken_filter=(t*Loc.t)Stream.t->(t*Loc.t)Stream.ttypet=unitletmk_=()letfilter_strm=strmletdefine_filter__=()letkeyword_added___=()letkeyword_removed__=()end
Finally we have Lexer
, which packages up the other modules and provides the actual lexing function. The lexing function takes an initial location and a character stream, and returns a stream of token and location pairs:
moduletypeLexer=sigmoduleLoc:LocmoduleToken:TokenwithmoduleLoc=LocmoduleError:Errorvalmk:unit->(Loc.t->charStream.t->(Token.t*Loc.t)Stream.t)end
I don’t want to go through the whole lexing function; it is not very interesting. But here is the main loop:
letrectokenc=lexer|eof->EOI|newline->next_linec;tokencc.lexbuf|blank+->tokencc.lexbuf|'-'?['0'-'9']+('.'['0'-'9']*)?(('e'|'E')('+'|'-')?(['0'-'9']+))?->NUMBER(L.utf8_lexemec.lexbuf)|["{}[]:,"]|"null"|"true"|"false"->KEYWORD(L.utf8_lexemec.lexbuf)|'"'->set_start_locc;stringcc.lexbuf;STRING(get_stored_stringc)|"$"->set_start_locc;c.enc:=Ulexing.Latin1;letaq=antiquotclexbufinc.enc:=Ulexing.Utf8;aq|_->illegalc
The lexer
syntax is an extension provided by ulex
; the effect is similar to ocamllex
. The lexer needs to keep track of the current location and return it along with the token (next_line
advances the current location; set_start_loc
is for when a token spans multiple ulex
lexemes). The lexer also needs to parse antiquotations, taking into account nested quotations within them.
(I think it is not actually necessary to lex JSON as UTF8. The only place that non-ASCII characters can appear is in a string. To lex a string we just accumulate characters until we see a double-quote, which cannot appear as part of a multibyte character. So it would work just as well to accumulate bytes. I am no Unicode expert though. This example was extracted from the Javascript parser in jslib, where I think UTF8 must be taken into account.)
Hooking up the lexerThere are a handful of changes we need to make to call the custom lexer:
In Jq_parser
we make the grammar with the custom lexer module, and open it so the token constructors are available; we also replace the INT
and FLOAT
cases with just NUMBER
; for the other cases we used the same token constructor names as the default lexer so we don’t need to change anything.
openJq_lexermoduleGram=Camlp4.PreCast.MakeGram(Jq_lexer)...|n=NUMBER->Jq_number(float_of_stringn)
In Jq_quotations
we have Camlp4.PreCast
open (so references to Ast
in the <:expr< >>
quotations resolve), so EOI
is Camlp4.PreCast.EOI
; we want Jq_lexer.EOI
, so we need to write it explicitly:
json_eoi:[[x=Jq_parser.json;`Jq_lexer.EOI->x]];
(Recall that the backtick lets us match a constructor directly; for some reason we can’t module-qualify EOI
without it.)
That’s it.
I want to finish off this series next time by covering grammar extension, with an example OCaml syntax extension.
(You can find the complete code for this example here.)