In C-family languages type arguments are usually delimited by “angular brackets” like this:
In the context of an expression, the “angular brackets” could be confused with infix comparison operators:
x = Foo<Bar>(1 + 2) x = foo < bar > (1 + 2)
Should the above be parsed as a constructor call or an infix comparison expression?
This problem can be solved on different levels:
- Lexer level
- Parser level
- Language design level
During scanning, a lexer can keep track of type names so every time it encounters a definition like
class Foo it adds the type name to a symbol table. Next, when the lexer encounters an identifier, it checks if this identifier is present in that symbol table, and depending on that issues either an
This removes the ambiguity for the parser, since
TYPE_IDENTIFIER '<' is clearly a start of a type expression, and
IDENTIFIER '<' is clearly a start of a comparison.
The solution of keeping a symbol table of types in the lexer and issuing separate
TYPE_IDENTIFIER tokens is also called The lexer hack.
There are several problems with this solution:
- If your language allows forward references then you need two separate passes:
- The first pass to construct the symbol table
- A second pass to issue tokens
- Depending on how complex your type declaration syntax is you might need to do some actual parsing in your tokenizer to extract the identifier of your type. For example, in C, type declarations can get quite complicated:
typedef int (*foo)(int bar, int, baz); /* Valid C89 and C99 */
Depending on whitespace surrounding
> symbols, the lexer can issue different tokens. For example:
The reality might be more complicated or not, depending on how you want to handle comments inside a type argument list.
One language that takes this approach is Swift. Swift uses whitespace to assign meaning to operator tokens.
This solution can be considered language-level, since it affects the set and meaning of accepted programs. This solution is also applicable to lexer-less parsing techniques, like PEG.
Parsing expression grammar PEG formalism has an interesting feature, the prioritized choice operator
/, which unlike the usual choice operator
| allows one grammar rule to be prioritized over another.
For example, it allows you to say that all expressions where
> are balanced should be type-expressions, otherwise they should be value expressions:
constructor_call <- identifier '<' type_parameters '>' '(' parameters ')' infix_expression <- identifier (('<' / '>') identifier)* expression <- # `constructor_call` takes priority if ambiguity arises: constructor_call / infix_expression / '(' expression ')'
If PEGs allow you to make a choice between two production rules, Generalized parsers (GLR) allow you to defer the choice until after parsing. In the case of syntactic ambiguity, GLR parsers will produce two parse trees: one for type expression, another for infix expression. At a later stage of compilation (when you construct your symbol table) you can make a choice of discarding one of the parse trees depending whether the identifiers around the
< symbol are in symbol table of types or not.
A disadvantage of this approach is that a GLR parser will produce multiple parse trees for all ambiguities in your grammar—whether you know about them in advance or not.
In case of a hand-written parser, you can do something similar to the lexer hack, but at the parser level. As you parse, you construct a symbol table for types, and whenever you encounter an identifier, you look it up in the symbol table and if it’s there you can continue parsing the type expression, otherwise you parse an infix expression.
This is how, for example, the hand-written Kotlin parser works.
The root of our problem is the following issue: In the same context we want to use
> symbols as both infix operators and as parenthesis-like delimiters. So the ambiguity is inherent to the problem.
A language-level solution would be to use a different syntax for one of the cases and decide whether
> are infix operators, or whether they are parenthesis-like delimiters.
One example of this is Scala, which uses square brackets as type parameters which avoids the problem altogether:
x = Foo[Bar](1 + 2) x = foo < bar > (1 + 2)
> are always parsed as infix operators, while
] are always parsed as parenthesis-like matching delimiters, there’s no confusion.
Rust is one of the languages that continues the tradition of using
> symbols for both type parameters and infix comparison operators. However, in expression context it requires the programmer to explicitly disambiguate them by using a sequence of
< symbols. For example:
let x: foo::Foo<Bar> = foo::Foo::<Bar>(1 + 2);
Inside the type annotation you can write
Foo<Bar>, because it is a type-only context, however in an expression context you need to write
You can classify the above solutions into syntactic and semantic ones.
Syntactic solutions are those that make the decision solely based on the syntax, in our case:
- Lexer whitespace sensitivity
- Prioritized choice (PEG)
Both of the above accept slightly different languages.
Semantic solutions are those that rely on a symbol table of type names to disambiguate:
- The lexer hack
- Generalized parsing
- Hand-written parser
These three solutions accept the same language.
If you’re implementing an existing language, you should carefully read the spec (if one exists) to see which of the solutions could be applicable. However, if you’re designing a new language, you have the power to come up with a syntax that is not conceptually ambiguous. Scala and Rust are good examples of the power of that approach.