Programming Languages

As syntaxtrees may define arbitrary tree structures, some kind of validation is necessary to ensure that certain trees conform to certain programming languages. The validation concept is losely based on XML Schema and RelaxNG, the syntax of the latter is also used to describe the grammars in a user friendly textformat.

The traditional approach

A somewhat typical grammar to represent an if statement with an optional else-statement in a nondescript language could look very similar to this:

if        ::=  'if' <expr> 'then' <stmt> ['else' <stmt>]
expr      ::=  '(' <expr> <binOp> <expr> ')'
               | <var_name>
               | <val_const>
expr_list ::=  <expr>
               | <expr> ',' <expr_list>
               | ''
stmt      ::=  <var_name> = <expr>
               | <var_name> '(' <expr_list> ')'

This approach works fine for typical compilers: They need to derive a syntax tree from any stream of tokens. It is therefore important to keep an eye on all sorts of syntactical elements. This comes with its very own set of problems:

  1. The role of whitespace has to be specified.
  2. Some separating characters have to be introduced very carefully. This is usually done using distinctive syntactic elements that are not allowed in variable names (typically all sorts of brackets and punctuation).
  3. Handing out proper error messages for syntactically incorrect documents is hard. A single missing character may change the semantics of the whole document. This implies that semantic analysis is usually only possible on a syntactically correct document.

The BlattWerkzeug approach

But BlattWerkzeug is in a different position: It is not meant to create a syntax tree from some stream of tokens but rather begins with an empty syntax tree. This frees it from many of the problems that have been mentioned above:

  1. There is no whitespace, only the structure of the tree.
  2. There is no need for separation characters, only the structure of the tree.
  3. Syntax errors equate to missing nodes and can be communicated very clearly. Semantic analysis does not need to rely on heuristics on how the tree could change if the “blanks” were filled in.

This entirely shifts the way one has to reason about the validation rules inside of BlattWerkzeug: There is no need to worry about the syntactic aspects of a certain language, the “grammar” that is required doesn’t need to know anything about keywords or separating characters. Its sole job is to describe the structure and the semantics of trees that are valid in that specific language.

Example AST: if-Statement

The following example further motivates this reasoning. An if statement can be described in terms of its structure and the underlying semantics: It uses a predicate to distinguish whether execution should continue on a positive branch or a negative branch.

This is a possible syntaxtree for an if statement in some nondescript language that could look like this:

if (a > b) then
  writeln('foo')
else
  err(2, 'bar')

In BlattWerkzeug, an if statement could be represented by using three child groups that could be called predicate, positive and negative. Each of these child groups may then have their own list of children.

digraph SyntaxTree { graph [fontsize=10 fontname="Verdana" bgcolor="transparent"]; node [fontsize=10 fontname="Verdana" shape=Mrecord]; edge [fontsize=10 fontname="Verdana"]; r [label="{lang.if}"]; subgraph cluster_r_pred { label="pred"; r_pred_0 [label="{{lang.expBin}|{op|gt}}"]; subgraph cluster_r_pred_0_lhs { label="lhs"; r_pred_0_lhs_0 [label="{{lang.expVar}|{name|a}}"]; } r_pred_0 -> r_pred_0_lhs_0; subgraph cluster_r_pred_0_rhs { label="rhs"; r_pred_0_rhs_0 [label="{{lang.expVar}|{name|b}}"]; } r_pred_0 -> r_pred_0_rhs_0; } r -> r_pred_0; subgraph cluster_r_positive { label="positive"; r_positive_0 [label="{{lang.call}|{name|writeln}}"]; subgraph cluster_r_positive_0_arguments { label="arguments"; r_positive_0_arguments_0 [label="{{lang.expConst}|{value|foo}}"]; } r_positive_0 -> r_positive_0_arguments_0; } r -> r_positive_0; subgraph cluster_r_negative { label="negative"; r_negative_0 [label="{{lang.call}|{name|err}}"]; subgraph cluster_r_negative_0_arguments { label="arguments"; r_negative_0_arguments_1 [label="{{lang.expConst}|{value|bar}}"]; r_negative_0_arguments_0 [label="{{lang.expConst}|{value|2}}"]; } r_negative_0 -> r_negative_0_arguments_0; r_negative_0 -> r_negative_0_arguments_1; } r -> r_negative_0; }

Now lets see what happens if the source is invalidated by omitting the predicate and the then:

if
  writeln('foo')
else
  err(2, 'bar')

In a typical language (tm) the most probable error would be something like “Invalid predicate: expression writeln('foo') is not of type boolean” and “Missing keyword then”. But the chosen indentation somehow hints that using the call to writeln as a predicate was not what the author intended to do.

In BlattWerkzeug the predicate may be omitted without touching the positive or negative branch. It is therefore trivial to tell the user that he has forgotten to supply a predicate.

digraph SyntaxTree { graph [fontsize=10 fontname="Verdana" bgcolor="transparent"]; node [fontsize=10 fontname="Verdana" shape=Mrecord]; edge [fontsize=10 fontname="Verdana"]; r [label="{lang.if}"]; subgraph cluster_r_positive { label="positive"; r_positive_0 [label="{{lang.call}|{name|writeln}}"]; subgraph cluster_r_positive_0_arguments { label="arguments"; r_positive_0_arguments_0 [label="{{lang.expConst}|{value|foo}}"]; } r_positive_0 -> r_positive_0_arguments_0; } r -> r_positive_0; subgraph cluster_r_negative { label="negative"; r_negative_0 [label="{{lang.call}|{name|err}}"]; subgraph cluster_r_negative_0_arguments { label="arguments"; r_negative_0_arguments_1 [label="{{lang.expConst}|{value|bar}}"]; r_negative_0_arguments_0 [label="{{lang.expConst}|{value|2}}"]; } r_negative_0 -> r_negative_0_arguments_0; r_negative_0 -> r_negative_0_arguments_1; } r -> r_negative_0; }

Grammar Validation

Todo

Give an example for the if-statement that has been used above.

BlattWerkzeug uses its own grammar language that is very losely inspired by the RelaxNG compact notation. The mental model however is very similar to typical grammars, but is strictly concerned with the structure of the syntaxtree. A BlattWerkzeug grammar consists of a name and multiple node definitions:

grammar "ex1" {
  node "element" {
  }
}

This grammer defines a language named ex1 which allows a single node with the name element to be present in the syntax tree. Allowed properties of nodes are defined as follows:

grammar "ex2" {
  node "element" {
    prop "name" { string }
  }
}

The curly brackets for the property need to denote at least the type of the property, valid values are string and number. Both of these properties may be limited further, see the section Property Restrictions for more details.

Multiple node definitions can be simply stated on after another as part of the grammar section:

grammar "ex3" {
  node "element" {
    prop "name" { string }
  }
  node "attribute" {
    prop "name" { string }
    prop "value" { string }
  }
}

Valid children of a node are defined via the children directive, a name and the corresponding “production rule”. The production rule allows to specify sequences (using a space), alternatives (using a pipe “|”) and “interleaving” (using the ampersand “&”). The mentioned elements can be quantified using the standard * (0 to unlimited), + (1 to unlimited) and ? (0 or 1) multiplicity operators. This example technically defines two sequences “elements” and “attributes” that allow zero or more occurences of the respective entity:

grammar "ex4" {
  node "element" {
    prop "name" { string }
    children "elements" ::= element*
    children "attributes" ::= attribute*
  }
  node "attribute" {
    prop "name" { string }
    prop "value" { string }
  }
}

Property Restrictions

Children Restrictions

Example Grammar: XML

grammar "dxml" {
  node "element" {
    terminal "tag-open-begin" "<"
    prop "name" { string }
    children allowed "attributes" ::= attribute*
    terminal "tag-open-end" ">"
    children allowed "elements" ::= element* & text* & interpolate* & if*
    terminal "tag-close" "<ende/>"
  }
  node "attribute" {
    prop "name" { string }
    terminal "equals" "="
    terminal "quot-begin" "\""
    children allowed "value" ::= text* & interpolate*
    terminal "quot-end" "\""
  }
  node "text" {
    prop "value" { string }
  }
  node "interpolate" {
    children allowed "expr" ::= expr
  }
  node "if" {
    children allowed "condition" ::= expr
    children allowed "body" ::= element* & text* & interpolate* & if*
  }
  typedef "expr" ::= exprVar | exprConst | exprBinary
  node "exprVar" {
    prop "name" { string }
  }
  node "exprConst" {
    prop "name" { string }
  }
  node "exprBinary" {
    children allowed "lhs" ::= expr
    children allowed "operator" ::= binaryOperator
    children allowed "rhs" ::= expr
  }
  node "binaryOperator" {
    prop "operator" { string }
  }
}

Example Grammar: SQL

grammar "sql" {
  typedef "expression" ::= columnName | binaryExpression | constant | parameter | functionCall
  node "columnName" {
    prop "refTableName" { string }
    terminal "dot" "."
    prop "columnName" { string }
  }
  node "constant" {
    prop "value" { string }
  }
  node "parameter" {
    terminal "colon" ":"
    prop "name" { string }
  }
  node "functionCall" {
    prop "name" { string }
    terminal "paren-open" "("
    children sequence "arguments", between: "," ::= expression*
    terminal "paren-close" ")"
  }
  node "starOperator" {
    terminal "star" "*"
  }
  node "relationalOperator" {
    prop "operator" { string { enum "<" "<=" "=" ">=" ">" "LIKE" "NOT LIKE" } }
  }
  node "binaryExpression" {
    children sequence "lhs" ::= expression
    children sequence "operator" ::= relationalOperator
    children sequence "rhs" ::= expression
  }
  node "select" {
    terminal "keyword" "SELECT"
    prop? "distinct" { boolean }
    children allowed "columns", between: "," ::= expression* & starOperator?
  }
  node "delete" {
    terminal "keyword" "DELETE"
  }
  node "tableIntroduction" {
    prop "name" { string }
    prop? "alias" { string }
  }
  node "crossJoin" {
    children sequence "table" ::= tableIntroduction
  }
  node "innerJoinOn" {
    terminal "keyword" "INNER JOIN"
    children sequence "table" ::= tableIntroduction
    terminal "keywordOn" "ON"
    children sequence "on" ::= expression
  }
  node "innerJoinUsing" {
    terminal "keyword" "INNER JOIN"
    children sequence "table" ::= tableIntroduction
    terminal "keywordUsing" "USING"
    children sequence "using" ::= expression
  }
  typedef "join" ::= crossJoin | innerJoinUsing | innerJoinOn
  node "from" {
    terminal "keyword" "FROM"
    children sequence "tables", between: "," ::= tableIntroduction+
    children sequence "joins" ::= join*
  }
  node "whereAdditional" {
    prop "operator" { string { enum "and" "or" } }
    children sequence "expression" ::= expression
  }
  node "where" {
    terminal "keyword" "WHERE"
    children sequence "expressions" ::= sql.expression sql.whereAdditional*
  }
  node "groupBy" {
    terminal "keyword" "GROUP BY"
    children allowed "expressions", between: "," ::= sql.expression*
  }
  node "orderBy" {
    terminal "keyword" "ORDER BY"
    children allowed "expressions", between: "," ::= sql.expression*
  }
  node "querySelect" {
    children sequence "select" ::= select
    children sequence "from" ::= from
    children sequence "where" ::= where?
    children sequence "groupBy" ::= groupBy?
    children sequence "orderBy" ::= orderBy?
  }
  node "queryDelete" {
    children sequence "delete" ::= delete
    children sequence "from" ::= from
    children sequence "where" ::= where?
  }
  typedef "query" ::= querySelect | queryDelete
}