C++ Blog

Boost.Spirit : Complete Parser Design

Posted in boost, templates by Umesh Sirsiwal on December 25, 2009

Note that this post applies to Spirit.Classic or 2.0.

In the last post we discussed how to write a simple parser using Spirit. Most real life parsers are a lot more complex requiring several rules and combining them to create a full grammar. Spirit provides support to declare a full grammar. Let us create a parser which looks for all mailto: tag with associated e-mail address in an input. All Spirit grammar are derived from grammar class. The following is definition of grammar:

 struct my_grammar : public grammar<my_grammar>
    {
        template <typename ScannerT>
        struct definition
        {
            rule<ScannerT>  r;
            definition(my_grammar const& self)  { r = /*..define here..*/; }
            rule<ScannerT> const& start() const { return r; }
        };
    };

The my_grammar is derived from grammar class using curiously recurring template pattern (CRTP). For those who are new to CRTP, it was introduced by Jim Copelien and Wikipedia has some details on it. Using CRTP it is possible to achieve effect of dynamic polymorphism without taking cost of virtual function. More on CRTP another day.

Each grammar class must define another nested structure called definition. The definition must have a function called start which returns starting parser rule. So let us start writing rules for e-mail address parser. The mailto rule can be written as:

     mailTo = "mailto:" >> emailAddress;

From our previous post, emailAddress can be written as:

     emailAddress =  lexeme_d[ +alnum_p >> '@' >> +alnum_p >> *('.' >> +alnum_p)];

Of course the full input has things other than the mailto: tags. So we must skip other characters. You do that as follows:

       r = mailTo | anychar_p;

The above rule is saying that try matching the input with mailTo and if it does not match mailTo rule, match it against any character parser.

So combining these two the grammar constructor now can be written as:

      definition(my_grammar const& self)  { 
            r = mailTo | anychar_p;
            mailTo = "mailTo:" >> emailAddress;
            emailAddress = lexeme_d[ +alnum_p >> '@' >> +alnum_p >> *('.' >> +alnum_p)];
      }

First thing one notices is that the rule “r” refers to rules “emailAddress” and “mailTo” before they are initialized. It works because the rules are held by reference and not by value. The referred rule can be initialized anytime. This does complicate programming a little bit. It means that it is user’s responsibility to make sure that the referred rules never go out of scope. So emailAddress cannot be local to definition. Typically, one declares all rules as part of the definition call definition. The full grammar now becomes:

 struct my_grammar : public grammar<my_grammar>
    {
        template <typename ScannerT>
        struct definition
        {
            rule<ScannerT>  r;
            rule<ScannerT>  emailAddress mailTo;
            definition(my_grammar const& self)  { 
                r = mailTo | anychar_p;
                mailTo = "mailTo:" >> emailAddress;
                emailAddress = lexeme_d[ +alnum_p >> '@' >> +alnum_p >> *('.' >> +alnum_p)];
            }
            rule<ScannerT> const& start() const { return r; }
        };
    };

OK! now we have the grammar. Let us use it.

   my_grammar g;
    if (parse(first, last, g, space_p).full)
        cout << "parsing succeeded\n";
    else
        cout << "parsing failed\n";

Here first and last are iterators pointing to first and the last characters in the input.

Note that the above grammar matches exactly one e-mail address or any other character. This can be easily be modified to match all e-mail addresses.

The new rule will be:

r = *(mailTo | anychar_p);


Advertisements