headerphoto

Overview

APG was originally designed to generate recursive-descent parsers directly from the ABNF grammar specification of Context-Free languages (RFC5234, since updated with RFC7405). The approach was to recognize that ABNF defines seven independent operations and that each of these operations could form the node of a recursive-descent parse tree. However, APG quickly evolved away from parsing the strictly Context-Free languages described by ABNF in a couple of significant ways.

This first was through disambiguation (see the ALT and REP operators description below.) A Context-Free language sentence may have more than one parse tree that correctly parses the sentence. (That is, different ways of phrasing the same sentence.) APG disambiguates, always selecting only a single, well-defined parse tree from the multitude.

Second, it was quickly realized that this method of defining a parse tree of node operators did not in any way require that the operators correspond to the ABNF operations. They could be expanded and enhanced in any way that might be convenient for the problem at hand. The first of these node modifications was to add syntactic predicates ( refs. 4 & 5 ) or "look ahead" operators. From here it was a simple matter to add a User-Defined Terminal (UDT). That is, a node operation that is hand-written by the user for a specific phrase-matching problem. The later, JavaScript version adds "look behind", back referencing and other operators to assist the pattern-matching engine built on it. These additional operators enhance the basic ABNF operators but do not change them, resulting in a superset of ABNF, or as often referred to here, SABNF.

Today, APG is a versatile, well-tested generator of parsers in C/C++, Java and JavaScript. Because it is based on ABNF, it is especially well suited to parsing the languages of many Internet technical specifications and, in fact, is the parser of choice for a number of large Telecom companies.

[top]

Parsing Basics

Disclaimer: This is not the place to learn about parser generators and formal language parsing in general. An Internet or online book store search will turn up many good books and publications for this. The references below are some of the ones I have found most useful but other newer books are available and should be considered. The discussion here is a simple, non-technical description of how APG works, not a discourse on parsing theory.

At its simplest, ABNF describes named phrases, a phrase being nothing more than an array of character codes (integers.) The named phrase definitions are called rules. For example, consider the rule,
phrase = "abc" That is, an ABNF parser would match the rule named "phrase" with the array of character codes, [97, 98, 99] The full set of operations is described below, but for the purposes of this primer, we will just note operators: / - alternate phrases, match either phrase
(space) - concatenation, the left and right phrases are concatenated
name - match the phrase defined by the named rule.

Taken together, a set of rules will define the complete set of phrases and alphabet characters that make up the language. The language sentences, themselves, are just the phrases defined by the first, or start rule - the rule name that forms the root node of the parse tree.

We can illustrate this with a simple example.

SENTENCE = (RED / GREEN) (PAPER / LEAVES)
RED = "red "
GREEN = "green "
PAPER = "paper"
LEAVES = "leaves"

This grammar defines:
alphabet: {" ", "a", "d", "e", "g", "l", "n", "p", "r", "s", "v"}
or in ASCII alphabet character codes (lower and upper case)
{32,65,68,69,71,76,78,80,82,83,86,97,100,101,103,108,110,112,114,115,118}
phrases: {"red ", "green ", "paper", "leaves"}
language (the set of sentences): {"red paper", "red leaves", "green paper", "green leaves"}

The syntax tree is shown in Figure 1. Figure 1.
Figure 1.
A recursive-descent parser makes a "depth-first" traversal of the nodes. That is, beginning at the top, root node, the traversal moves down to the left-most, previously unvisited node. From the terminal nodes the direction is reversed back up the branch.

The operation at each node is to give the parser instructions on a) what to do and b) where to go next. For example, when approached from above each ALT or CAT node simply instructs the parser to proceed on to the left-most, unvisited node below. When approached from below, ALT nodes select from alternate phrases and CAT nodes concatenate phrases. RNM nodes, when approached from above, expand the named rule to form a sub-tree within the root node parse tree. When approached from below, they pass the matched phrase on up the tree.

But terminal nodes have much simpler operations. They simply try to match a pre-defined phrase and return to the parent. It is worth noting here that APG has been designed from the beginning to be a phrase-recognition parser. The terminal nodes are not single-character recognizers. They are hand-written phrase recognizers which either succeed or fail to recognize the entire phrase that defines them.

In general one might say, terminal nodes are phrase recognizers and non-terminal nodes are phrase manipulators.

[top]

The ABNF Operators

This will be a brief description of the SABNF operators. A more complete description is given here.

The orginal ABNF operators:

  • ALT (/): alternates. Any of the alternate phrases is acceptable. If a match is found to two or more alternates, the phrasing of the matched sentence is ambiguous. APG resolves the ambiguity with a "first match wins" rule (also referred to as "prioritized-choice" in ref. 4.) That is, the alternates are tried from left to right and the first match that is found is kept and all remaining alternates ignored.
  • CAT (space): concatenation. All phrases are concatenated. If a match to any phrase in the concatenation fails, the entire CAT fails.
  • REP (n*m): repetition. REP succeeds if the defined phrase is matched a minimum of "n" times or up to a maximum of "m" times. (Short-hand notation, *, n* or *m, defaults to "n=0" and "m=infinity".) Related to the disambiguation rules for ALT, the repetition operator is "greedy", meaning that it always matches the longest possible string. It never backtracks to see if a shorter match would succeed. For this reason the rule, A = *"a" "a" would fail to match the string "aaa" even though a strictly Context-Free parser would succeed.
  • RNM (name): rule name. Match the phrase defined by the rule name.
  • TLS ("string"): terminal case-insensitive string. Match the defined string.
  • TBS (%d98.99): terminal case-sensitive string. Match the defined string.
  • TRG (%d48-57): character range. Match any character in the range.

The additional superset operators:

  • AND (&): positive look ahead. Examine the phrase following the &. Succeed on a match, fail otherwise. Note that this operator never consumes characters. It matches the empty string only.
  • NOT (!): negative look ahead. Examine the phrase following the !. Fail on a match, succeed otherwise. Note that this operator never consumes characters. It matches the empty string only.
  • UDT (u_name: User-Defined Terminals. This operator calls a user-written function to do the phrase matching.

JavaScript only:

  • BKA (&&): positive look behind. Examine the phrase preceeding the &&. Succeed on a match, fail otherwise. Note that this operator never consumes characters. It matches the empty string only.
  • BKN (!!): negativelook behind. Examine the phrase preceeding the !!. Fail on a match, succeed otherwise. Note that this operator never consumes characters. It matches the empty string only.
  • BKR (\name): back references. Matches the phrase previously matched by the named rule.
  • ABG (%^): beginning of string anchor. Matches the beginning position of the string. Matches position only. No characters are consumed.
  • ABE (%$): end of string anchor. Matches the end position of the string. Matches position only. No characters are consumed.

[top]

Rule Name Callbacks

Rule name callback functions allow the user to interact with the parser at each defined phrase. The RNM nodes will call a user-written function twice for each node, first on the downward traversal of the node before any parsing of the phrase has been done and then again on the upward traversal after the Parser's phrase recognition work for the node is done. There are several types of actions that rule callback functions can accomplish.

  • downward traversal (pre-parsing of the branch below the node): Administrative work meaning any type of table or data buffer set up or similar data handling that does not affect the parsing of the string. Side calculations that are done parallel to the parsing process.
  • downward traversal: Override the Parser completely. That is, do your own phrase matching and skip the parsing completely. In this case, the rule callback us functioning as a UDT. However, implementation of a UDT in this manner is not obvious and a little clumsy. UDTs were developed to overcome this shortcoming.
  • upward traversal (post-parsing of the branch below): Again, any administrative work that might be convenient to do parallel to the parsing process.
  • upward traversal: Examine the Parsers's phrase matching result and decide whether to accept or reject it. For example, there may be non-CFG requirements that need to be enforced. This might be done by outright rejection, returning no match instead of match. Or you have another chance here for overriding the phrase recognition completely, and returning some phrase other than what the Parser found.

For details about writing and using rule name callback functions see the User's Guide for the target language of your choice.

[top]

User-Defined Terminals (UDTs)

Let's slightly rewrite the example grammar above.
SENTENCE = (RED / GREEN) (PAPER / u_LEAVES)
RED = "red "
GREEN = "green "
PAPER = "paper"

The ABNF terminal LEAVES has now been replaced with the User-Defined Terminal u_LEAVES. Since the underscore is not a valid character in ABNF, the parser generator has no trouble recognizing it as a UDT(*). And since it is user written it needs no ABNF rule to define it.

The new syntax tree now has a new operator UDT, as shown in Figure 2. Figure 2.
Figure 2.
As the name implies, this is a terminal node. Its only operation is to call a user-written, phrase-recognition function and return the result to the parent node. The User's Guides have details on the prototype for these functions and how to identify them to the Parser.

u_LEAVES has some "safe" uses and some "unsafe" uses. Both are very powerful enhancements to the parser.

A "safe" use would be to write it so that it simply recognizes the phrase "leaves" only is optimized to do it much faster than the ABNF terminal does. This use does not change the language that the grammar defines in any way. It just does it much faster. For example, in the SIP example released with the APG code I've managed to increase the parsing speed by a factor of 10 without changing in any way the language the original ABNF grammar defined.

An "unsafe" use would be, for example, to have it look backwards in the sentence string and if it sees "red" then it only recognizes "faced" and if it sees "green" it only recognizes "cheese". That is, it completely alters the language that the parser recognizes in a context-sensitive way.

A search for discussions of handwritten parsers will turn up many, many criticisms of them. The primary complaint is that there is generally no way to know what, exactly, the language is that is being parsed. But handwritten parsers also have two important advantages. The first being that handwritten parsers are usually much faster. The second is that they are capable of parsing non-Context-Free Grammars. For probably both of these reasons the GNU gcc compiler is, in fact, a handwritten, recursive-descent parser[gcc]. And it is for both of these reasons that UDTs have been introduced to APG.

To recap, there are three types of uses envisioned for the UDTs.

  • "safe" - to simply speed up the parsing process.
  • "unsafe" - to incorporate a few non-CFG language requirements into a grammar that is mostly CFG.
  • "totally dangerous" - to simply provide a recursive-descent backbone to a completely handwritten parser. UDTs are normally terminal nodes. However, there is a backdoor in APG to let them move on down the syntax tree. See the User's Guides for the Parser functions vExecuteRule() and vExecuteUdt(). You should really know what you are doing before using them, but you could build a parser for a completely arbitrary language with no ABNF grammar to guide it at all.

(*) UDT names can begin with either the "u_" or "e_" prefix. The difference between them is important. In order for the parser generator to determine the grammar's attributes it is very important for it to know which rules accept empty strings and which don't. Since UDTs are user written, there is no way for the parser generator to know whether a given UDT will or won't accept empty strings. Therefore, the convention has been adopted that only UDTs whose names begin with "e_" are allowed to accept empty strings. Those beginning with "u_" are not allowed to accept empty strings. This convention is enforced by the parser. That is, if a user-written function, u_myfunc() for example, returns a phrase length of zero, the Parser will fail with a terminal error and exit with a call to the currently installed alert handler.

[top]

Attributes

One of the first things you learn about recursive-descent parsers is that you can't parse a left-recursive grammar.
Start = Start "s" / "y" The parser will recurse infinitely (or more practically, until a stack overflow exception is issued) before it ever recognizes a single character of a single phrase. Left recursiveness is considered here to be an attribute of the grammar. It is an important attribute to know about, since it is a fatal flaw in the grammar. But there are other, equally important attributes.

In the following two sections below I'll define the attribute types and the recursion types which you will need to understand some of the information displayed by the parser generator. However, if you want to know the nitty gritty of how it is all done you will have to read the code.

[top]

Attribute Definitions

  • Left Recursiveness (fatal) -
    Start = Start "s" / "y" Fatal as discussed above.
  • Right Recursiveness -
    Start = "s" Start / "y" Not fatal but nice to know.
  • Nested Recursiveness -
    Start = "s" Start "s" / "y" Not fatal but important to know, since it is what distiguishes context-free grammars from regular grammars If this attribute is false the grammar is regular, if true the grammar is context-free.
  • Cyclic (fatal) -
    Start = Start Valid grammar syntax but defines no phrases.
  • Infinite (fatal)-
    Start = "s" Start This defines only infinite string phrases. Therefore, the parser will fail for every finite input string or sentence.
  • Empty -
    Start = "s" Start / "" The language described by this grammar includes empty strings. Not fatal, but knowing whether a rule's phrases can be empty or not is critical for determining the recursive attributes.

Note that it is possible for a rule to have more than one true attribute at the same time. For example, the infinite and empty rules above are also right-recursive. The other important thing to know is that these are aggregate attributes. That is, they only mean that an attribute can happen, not that it always does. For example, the empty example grammar can accept the empty string "", but it can also accept the non-empty string "s".

[top]

Types of Recursion

In the parser generator's displayed results you will see, in addition to the attributes, the rule names listed under several different types. These are:

  • N_RECURSIVE - non-recursive. The rule never refers to itself.
  • R_RECURSIVE - recursive. The rule refers to itself.
  • MR_RECURSIVE - mutually recursive. Sets of rules. Within each set, each rule refers to every other rule in the set including itself.
  • NMR_RECURSIVE - non-recursive but refers to one or more mutually-recursive rules.
  • RMR_RECURSIVE - recursive but also refers to one or more mutually-recursive rule sets of which it is not a member.

The reasons for these distinctions in recursive rule types is that they play an important role in the determination of the attributes. In particular, the mutually recursive sets create a catch-22 situation in that they present the seemingly impossible task of needing the attributes of one member of the set in order to determine the attributes of the others and vice versa. There is a way around this, which I won't go into here.

[top]

References

[1] Alfred V. Aho, Ravi Sethi, Monica S. Lam and Jeffrey D. Ullman, 
    "Compilers: Principles, Techniques, and Tools",
    2nd Edition, Addison-Wesley, 2007
    
[2] John E. Hopcroft and Jeffrey D. Ullman, 
    "Introduction to Automata Theory, Languages and Computation", 
    Addison-Wesley, 1979
    
[3] R Gregory Taylor, 
    "Models of Computation and Formal Languages", 
    Oxford University Press, 1998
    
[4] Bryan Ford, 
    "Parsing Expression Grammars: 
    A Recognition-Based Syntactic Foundation", 
    Proceedings 31st ACM Symposium on 
    Principles of Programming Languages, 
    pp. 111-122, 2004
    
[5] Parr, Terence J. and Quong, Russell W.,
    "Adding Semantic and Syntactic Predicates To LL(k): pred-LL(k)",
    Proceedings of the 5th International Conference on Compiler Construction,
    pp. 263-274, 1994