Parsing type constructors with arguments and comparison expressions

Parsing type constructors with arguments and comparison expressions

The problem

In C-family languages type arguments are usually delimited by “angular brackets” like this:

Array<String>

In the context of an expression, the “angular brackets” could be confused with infix comparison operators:

x = Foo<Bar>(1 + 2)

x = foo < bar > (1 + 2)

Should the above be parsed as a constructor call or an infix comparison expression?

Solutions

This problem can be solved on different levels:

  • Lexer level
  • Parser level
  • Language design level

Lexer-level solutions

The lexer hack

During scanning, a lexer can keep track of type names so every time it encounters a definition like class Foo it adds the type name to a symbol table. Next, when the lexer encounters an identifier, it checks if this identifier is present in that symbol table, and depending on that issues either an IDENTIFIER or TYPE_IDENTIFIER token.

This removes the ambiguity for the parser, since TYPE_IDENTIFIER '<' is clearly a start of a type expression, and IDENTIFIER '<' is clearly a start of a comparison.

The solution of keeping a symbol table of types in the lexer and issuing separate IDENTIFIER and TYPE_IDENTIFIER tokens is also called The lexer hack.

There are several problems with this solution:

  • If your language allows forward references then you need two separate passes:
    1. The first pass to construct the symbol table
    2. A second pass to issue tokens
  • Depending on how complex your type declaration syntax is you might need to do some actual parsing in your tokenizer to extract the identifier of your type. For example, in C, type declarations can get quite complicated:
    typedef int (*foo)(int bar, int, baz); /* Valid C89 and C99 */
    
Lexer whitespace sensitivity

Depending on whitespace surrounding < and > symbols, the lexer can issue different tokens. For example:

Regular expression Token
"<\s" LESS_THAN
"<" LEFT_ANGLE_BRACKET

The reality might be more complicated or not, depending on how you want to handle comments inside a type argument list.

For example:

Array</*o hai*/String>

One language that takes this approach is Swift. Swift uses whitespace to assign meaning to operator tokens.

This solution can be considered language-level, since it affects the set and meaning of accepted programs. This solution is also applicable to lexer-less parsing techniques, like PEG.

Parser-level solutions

Prioritized choice (PEG)

Parsing expression grammar PEG formalism has an interesting feature, the prioritized choice operator /, which unlike the usual choice operator | allows one grammar rule to be prioritized over another.

For example, it allows you to say that all expressions where < and > are balanced should be type-expressions, otherwise they should be value expressions:

constructor_call <-
    identifier '<' type_parameters '>' '(' parameters ')'
infix_expression <-
    identifier (('<' / '>') identifier)*

expression <-
    # `constructor_call` takes priority if ambiguity arises:
    constructor_call / infix_expression / '(' expression ')'
Generalized parsing

If PEGs allow you to make a choice between two production rules, Generalized parsers (GLR) allow you to defer the choice until after parsing. In the case of syntactic ambiguity, GLR parsers will produce two parse trees: one for type expression, another for infix expression. At a later stage of compilation (when you construct your symbol table) you can make a choice of discarding one of the parse trees depending whether the identifiers around the < symbol are in symbol table of types or not.

A disadvantage of this approach is that a GLR parser will produce multiple parse trees for all ambiguities in your grammar—whether you know about them in advance or not.

Hand-written parser

In case of a hand-written parser, you can do something similar to the lexer hack, but at the parser level. As you parse, you construct a symbol table for types, and whenever you encounter an identifier, you look it up in the symbol table and if it’s there you can continue parsing the type expression, otherwise you parse an infix expression.

This is how, for example, the hand-written Kotlin parser works.

Language-level solutions

The root of our problem is the following issue: In the same context we want to use < and > symbols as both infix operators and as parenthesis-like delimiters. So the ambiguity is inherent to the problem.

A language-level solution would be to use a different syntax for one of the cases and decide whether < and > are infix operators, or whether they are parenthesis-like delimiters.

Scala

One example of this is Scala, which uses square brackets as type parameters which avoids the problem altogether:

x = Foo[Bar](1 + 2)

x = foo < bar > (1 + 2)

Because < and > are always parsed as infix operators, while [ and ] are always parsed as parenthesis-like matching delimiters, there’s no confusion.

Rust

Rust is one of the languages that continues the tradition of using < and > symbols for both type parameters and infix comparison operators. However, in expression context it requires the programmer to explicitly disambiguate them by using a sequence of :: and < symbols. For example:

let x: foo::Foo<Bar> = foo::Foo::<Bar>(1 + 2);

Inside the type annotation you can write Foo<Bar>, because it is a type-only context, however in an expression context you need to write Foo::<Bar>.

Discussion

You can classify the above solutions into syntactic and semantic ones.

Syntactic solutions are those that make the decision solely based on the syntax, in our case:

  • Lexer whitespace sensitivity
  • Prioritized choice (PEG)

Both of the above accept slightly different languages.

Semantic solutions are those that rely on a symbol table of type names to disambiguate:

  • The lexer hack
  • Generalized parsing
  • Hand-written parser

These three solutions accept the same language.

If you’re implementing an existing language, you should carefully read the spec (if one exists) to see which of the solutions could be applicable. However, if you’re designing a new language, you have the power to come up with a syntax that is not conceptually ambiguous. Scala and Rust are good examples of the power of that approach.