Quantcast
Viewing latest article 18
Browse Latest Browse All 25

Reading Camlp4, part 10: custom lexers

As a final modification to our running JSON quotation example, I want to repair a problem noted in the first post—that the default lexer does not match the JSON spec—and in doing so demonstrate the use of custom lexers with Camlp4 grammars. We’ll parse UTF8-encoded Javascript using the ulex library.

To use a custom lexer, we need to pass a module matching the Lexer signature (in camlp4/Camlp4/Sig.ml) to Camlp4.PreCast.MakeGram. (Recall that we get back an empty grammar which we then extend with parser entries. ) Let’s look at the signature and its subsignatures, and our implementation of each:

Error
moduletypeError=sigtypetexceptionEoftvalto_string:t->stringvalprint:Format.formatter->t->unitend

First we have a module for packaging up an exception so it can be handled generically (in particular it may be registered with Camlp4.ErrorHandler for common printing and handling). We have simple exception needs so we give a simple implementation:

moduleError=structtypet=stringexceptionEofstringletprint=Format.pp_print_stringletto_stringx=xendlet_=letmoduleM=Camlp4.ErrorHandler.Register(Error)in()
Token

Next we have a module defining the tokens our lexer supports:

moduletypeToken=sigmoduleLoc:Loctypetvalto_string:t->stringvalprint:Format.formatter->t->unitvalmatch_keyword:string->t->boolvalextract_string:t->stringmoduleFilter:...(* see below *)moduleError:Errorend

The type t represents a token. This can be anything we like (in particular it does not need to be a variant with arms KEYWORD, EOI, etc. although that is the conventional representation), so long as we provide the specified functions to convert it to a string, print it to a formatter, determine if it matches a string keyword (recall that we can use literal strings in grammars; this function is called to see if the next token matches a literal string), and extract a string representation of it (called when you bind a variable to a token in a grammar—e.g. n = NUMBER). Here’s our implementation:

typetoken=|KEYWORDofstring|NUMBERofstring|STRINGofstring|ANTIQUOTofstring*string|EOImoduleToken=structtypet=tokenletto_stringt=letsf=Printf.sprintfinmatchtwith|KEYWORDs->sf"KEYWORD %S"s|NUMBERs->sf"NUMBER %s"s|STRINGs->sf"STRING \"%s\""s|ANTIQUOT(n,s)->sf"ANTIQUOT %s: %S"ns|EOI->sf"EOI"letprintppfx=Format.pp_print_stringppf(to_stringx)letmatch_keywordkwd=function|KEYWORDkwd'whenkwd=kwd'->true|_->falseletextract_string=function|KEYWORDs|NUMBERs|STRINGs->s|tok->invalid_arg("Cannot extract a string from this token: "^to_stringtok)moduleLoc=Camlp4.PreCast.LocmoduleError=ErrormoduleFilter=...(* see below *)end

Not much to it. KEYWORD covers true, false, null, and punctuation; NUMBER and STRING are JSON numbers and strings; as we saw last time antiquotations are returned in ANTIQUOT; finally we signal the end of the input with EOI.

Filter
moduleFilter:sigtypetoken_filter=(t*Loc.t)Stream.t->(t*Loc.t)Stream.ttypetvalmk:(string->bool)->tvaldefine_filter:t->(token_filter->token_filter)->unitvalfilter:t->token_filtervalkeyword_added:t->string->bool->unitvalkeyword_removed:t->string->unitend;

The Filter module provides filters over token streams. We don’t have a need for it in the JSON example, but it’s interesting to see how it is implemented in the default lexer and used in the OCaml parser. The argument to mk is a function indicating whether a string should be treated as a keyword (i.e. the literal string is used in the grammar), and the default lexer uses it to filter the token stream to convert identifiers into keywords. If we wanted the JSON parser to be extensible, we would need to take this into account; instead we’ll just stub out the functions:

moduleFilter=structtypetoken_filter=(t*Loc.t)Stream.t->(t*Loc.t)Stream.ttypet=unitletmk_=()letfilter_strm=strmletdefine_filter__=()letkeyword_added___=()letkeyword_removed__=()end
Lexer

Finally we have Lexer, which packages up the other modules and provides the actual lexing function. The lexing function takes an initial location and a character stream, and returns a stream of token and location pairs:

moduletypeLexer=sigmoduleLoc:LocmoduleToken:TokenwithmoduleLoc=LocmoduleError:Errorvalmk:unit->(Loc.t->charStream.t->(Token.t*Loc.t)Stream.t)end

I don’t want to go through the whole lexing function; it is not very interesting. But here is the main loop:

letrectokenc=lexer|eof->EOI|newline->next_linec;tokencc.lexbuf|blank+->tokencc.lexbuf|'-'?['0'-'9']+('.'['0'-'9']*)?(('e'|'E')('+'|'-')?(['0'-'9']+))?->NUMBER(L.utf8_lexemec.lexbuf)|["{}[]:,"]|"null"|"true"|"false"->KEYWORD(L.utf8_lexemec.lexbuf)|'"'->set_start_locc;stringcc.lexbuf;STRING(get_stored_stringc)|"$"->set_start_locc;c.enc:=Ulexing.Latin1;letaq=antiquotclexbufinc.enc:=Ulexing.Utf8;aq|_->illegalc

The lexer syntax is an extension provided by ulex; the effect is similar to ocamllex. The lexer needs to keep track of the current location and return it along with the token (next_line advances the current location; set_start_loc is for when a token spans multiple ulex lexemes). The lexer also needs to parse antiquotations, taking into account nested quotations within them.

(I think it is not actually necessary to lex JSON as UTF8. The only place that non-ASCII characters can appear is in a string. To lex a string we just accumulate characters until we see a double-quote, which cannot appear as part of a multibyte character. So it would work just as well to accumulate bytes. I am no Unicode expert though. This example was extracted from the Javascript parser in jslib, where I think UTF8 must be taken into account.)

Hooking up the lexer

There are a handful of changes we need to make to call the custom lexer:

In Jq_parser we make the grammar with the custom lexer module, and open it so the token constructors are available; we also replace the INT and FLOAT cases with just NUMBER; for the other cases we used the same token constructor names as the default lexer so we don’t need to change anything.

openJq_lexermoduleGram=Camlp4.PreCast.MakeGram(Jq_lexer)...|n=NUMBER->Jq_number(float_of_stringn)

In Jq_quotations we have Camlp4.PreCast open (so references to Ast in the <:expr< >> quotations resolve), so EOI is Camlp4.PreCast.EOI; we want Jq_lexer.EOI, so we need to write it explicitly:

json_eoi:[[x=Jq_parser.json;`Jq_lexer.EOI->x]];

(Recall that the backtick lets us match a constructor directly; for some reason we can’t module-qualify EOI without it.)

That’s it.

I want to finish off this series next time by covering grammar extension, with an example OCaml syntax extension.

(You can find the complete code for this example here.)


Viewing latest article 18
Browse Latest Browse All 25

Trending Articles