Saturday, January 5, 2008

The Gem proto-language syntax



Gem models use a very simple textual concrete syntax which is described below. I call this a proto-language because there are no semantics tied directly to this syntax. Instead the syntax is meant to be interpreted by processing layers applied to the model. XML has a similar proto-language characteristic. You can write XML and use a tool to check that it is well-formed, but the meaning of the XML depends entirely on how you choose to interpret it.

Gem Syntax Grammer
  • Root := Expr*
  • Expr := Prefix? (Token | Group | Quote) Suffix?
    • Prefix := A valid prefix Symbol with no space between it and the following Expr.
    • Suffix := A valid suffix Symbol with no space between it and the preceding Expr.
  • Token := Word | Numeric | Operator
    • Word := A valid Java identifier.
    • Numeric := [0 .. 9]+
    • Operator := Symbol
  • Group := AngleGroup | CurlyGroup | ParenGroup | SquareGroup
    • AngleGroup := < Expr* >
    • CurlyGroup := { Expr* }
    • ParenGroup := ( Expr* )
    • SquareGroup := [ Expr* ]
  • Quote := DoubleQuote | SingleQuote
    • DoubleQuote := " Fragment* "
    • SingleQuote := ' Fragment* '
  • Fragment := Text | Escape
  • Text := A sequence of characters not containing an escape Symbol or quote terminator (" or ').
  • Escape := CharEscape | ExprEscape
    • CharEscape := A valid character escape Symbol followed by a character escape sequence.
    • ExprEscape := A valid expression escape Symbol followed by a single Expr.
  • Symbol := A member of the set of characters on your keyboard which are not letters or digits.
Notes:
  • It's hard to get the look of the grammar correct in Blogger ... sorry about that.
  • The characters *, ?, | and + have their usual grammatical meanings:
    • * : a list of zero or more of the preceding rule
    • ? : an optional single instance of the preceding rule
    • | : used between alternatives
    • + : a list of 1 or more of the preceding rule
  • Characters in large bold Courier like [ and ] are literals.
  • The valid characters for Prefix, Suffix and Escape are not fixed here. They can be assigned when initializing the parser. I have been using ` (back-tick) and ~ (tilde) as prefixes and suffixes. Currently ` is the CharEscape initiator (but it really should be \ to fit the C/Java style) and ~ is the ExprEscape initiator. I'm considering adding ^ as a Prefix/Suffix. All of these could change.

Examples
  • Hello World!
    • two Words followed by an Operator
  • abc [123 {xyz} ]
    • use of Groups and Numerics (and Words)
For the following assume that ~ is a valid Prefix and ^ is a valid ExprEscape.
  • ~foo + ~ bar
    • the first ~ is a Prefix, the second is an Operator (because of the spacing)
  • "Hello ^{foo bar} World"
    • DoubleQuote with embedded CurlyGroup

Possible Enhancements
  • Add Infix and Outfix which would work like Prefix and Suffix but inside the Group delimiters like this:
    • AngleGroup := < Infix? Expr* Outfix? >
  • Some way of indicating that certain groups don't apply within a scope (e.g. so <> can be used as Operators).
  • The Numeric rule is currently just a string of digits. In the future it could be enhanced to allow more of the Java numeric literal syntax.

0 comments: