Reading Camlp4, part 10: custom lexers

As a final modification to our running JSON quotation example, I want to repair a problem noted in the first post—that the default lexer does not match the JSON spec—and in doing so demonstrate the use of custom lexers with Camlp4 grammars. We’ll parse UTF8-encoded Javascript using the ulex library.

To use a custom lexer, we need to pass a module matching the Lexer signature (in camlp4/Camlp4/Sig.ml) to Camlp4.PreCast.MakeGram. (Recall that we get back an empty grammar which we then extend with parser entries. ) Let’s look at the signature and its subsignatures, and our implementation of each:

Error

moduletypeError=sigtypetexceptionEoftvalto_string:t->stringvalprint:Format.formatter->t->unitend

First we have a module for packaging up an exception so it can be handled generically (in particular it may be registered with Camlp4.ErrorHandler for common printing and handling). We have simple exception needs so we give a simple implementation:

moduleError=structtypet=stringexceptionEofstringletprint=Format.pp_print_stringletto_stringx=xendlet_=letmoduleM=Camlp4.ErrorHandler.Register(Error)in()

Token

Next we have a module defining the tokens our lexer supports:

moduletypeToken=sigmoduleLoc:Loctypetvalto_string:t->stringvalprint:Format.formatter->t->unitvalmatch_keyword:string->t->boolvalextract_string:t->stringmoduleFilter:...(* see below *)moduleError:Errorend

The type t represents a token. This can be anything we like (in particular it does not need to be a variant with arms KEYWORD, EOI, etc. although that is the conventional representation), so long as we provide the specified functions to convert it to a string, print it to a formatter, determine if it matches a string keyword (recall that we can use literal strings in grammars; this function is called to see if the next token matches a literal string), and extract a string representation of it (called when you bind a variable to a token in a grammar—e.g. n = NUMBER). Here’s our implementation:

typetoken=|KEYWORDofstring|NUMBERofstring|STRINGofstring|ANTIQUOTofstring*string|EOImoduleToken=structtypet=tokenletto_stringt=letsf=Printf.sprintfinmatchtwith|KEYWORDs->sf"KEYWORD %S"s|NUMBERs->sf"NUMBER %s"s|STRINGs->sf"STRING \"%s\""s|ANTIQUOT(n,s)->sf"ANTIQUOT %s: %S"ns|EOI->sf"EOI"letprintppfx=Format.pp_print_stringppf(to_stringx)letmatch_keywordkwd=function|KEYWORDkwd'whenkwd=kwd'->true|_->falseletextract_string=function|KEYWORDs|NUMBERs|STRINGs->s|tok->invalid_arg("Cannot extract a string from this token: "^to_stringtok)moduleLoc=Camlp4.PreCast.LocmoduleError=ErrormoduleFilter=...(* see below *)end

Not much to it. KEYWORD covers true, false, null, and punctuation; NUMBER and STRING are JSON numbers and strings; as we saw last time antiquotations are returned in ANTIQUOT; finally we signal the end of the input with EOI.

Filter

moduleFilter:sigtypetoken_filter=(t*Loc.t)Stream.t->(t*Loc.t)Stream.ttypetvalmk:(string->bool)->tvaldefine_filter:t->(token_filter->token_filter)->unitvalfilter:t->token_filtervalkeyword_added:t->string->bool->unitvalkeyword_removed:t->string->unitend;

The Filter module provides filters over token streams. We don’t have a need for it in the JSON example, but it’s interesting to see how it is implemented in the default lexer and used in the OCaml parser. The argument to mk is a function indicating whether a string should be treated as a keyword (i.e. the literal string is used in the grammar), and the default lexer uses it to filter the token stream to convert identifiers into keywords. If we wanted the JSON parser to be extensible, we would need to take this into account; instead we’ll just stub out the functions:

moduleFilter=structtypetoken_filter=(t*Loc.t)Stream.t->(t*Loc.t)Stream.ttypet=unitletmk_=()letfilter_strm=strmletdefine_filter__=()letkeyword_added___=()letkeyword_removed__=()end

Lexer

Finally we have Lexer, which packages up the other modules and provides the actual lexing function. The lexing function takes an initial location and a character stream, and returns a stream of token and location pairs:

moduletypeLexer=sigmoduleLoc:LocmoduleToken:TokenwithmoduleLoc=LocmoduleError:Errorvalmk:unit->(Loc.t->charStream.t->(Token.t*Loc.t)Stream.t)end

I don’t want to go through the whole lexing function; it is not very interesting. But here is the main loop:

letrectokenc=lexer|eof->EOI|newline->next_linec;tokencc.lexbuf|blank+->tokencc.lexbuf|'-'?['0'-'9']+('.'['0'-'9']*)?(('e'|'E')('+'|'-')?(['0'-'9']+))?->NUMBER(L.utf8_lexemec.lexbuf)|["{}[]:,"]|"null"|"true"|"false"->KEYWORD(L.utf8_lexemec.lexbuf)|'"'->set_start_locc;stringcc.lexbuf;STRING(get_stored_stringc)|"$"->set_start_locc;c.enc:=Ulexing.Latin1;letaq=antiquotclexbufinc.enc:=Ulexing.Utf8;aq|_->illegalc

The lexer syntax is an extension provided by ulex; the effect is similar to ocamllex. The lexer needs to keep track of the current location and return it along with the token (next_line advances the current location; set_start_loc is for when a token spans multiple ulex lexemes). The lexer also needs to parse antiquotations, taking into account nested quotations within them.

(I think it is not actually necessary to lex JSON as UTF8. The only place that non-ASCII characters can appear is in a string. To lex a string we just accumulate characters until we see a double-quote, which cannot appear as part of a multibyte character. So it would work just as well to accumulate bytes. I am no Unicode expert though. This example was extracted from the Javascript parser in jslib, where I think UTF8 must be taken into account.)

Hooking up the lexer

There are a handful of changes we need to make to call the custom lexer:

In Jq_parser we make the grammar with the custom lexer module, and open it so the token constructors are available; we also replace the INT and FLOAT cases with just NUMBER; for the other cases we used the same token constructor names as the default lexer so we don’t need to change anything.

openJq_lexermoduleGram=Camlp4.PreCast.MakeGram(Jq_lexer)...|n=NUMBER->Jq_number(float_of_stringn)

In Jq_quotations we have Camlp4.PreCast open (so references to Ast in the <:expr< >> quotations resolve), so EOI is Camlp4.PreCast.EOI; we want Jq_lexer.EOI, so we need to write it explicitly:

json_eoi:[[x=Jq_parser.json;`Jq_lexer.EOI->x]];

(Recall that the backtick lets us match a constructor directly; for some reason we can’t module-qualify EOI without it.)

That’s it.

I want to finish off this series next time by covering grammar extension, with an example OCaml syntax extension.

(You can find the complete code for this example here.)

Reading Camlp4, part 10: custom lexers

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112