SSYN Model and Syntax Specification

Jason Diamond

$Id: ssyn_model_and_syntax.xml 29 2004-08-25 22:12:33Z jason $


Table of Contents

1. Introduction
2. Model
2.1. Elements
2.2. Names
2.3. Values
2.4. Child Elements
2.5. Parent Elements
2.6. Common Structures
3. Syntax
3.1. Encoding
3.2. Names
3.3. Simple Values
3.4. Block Values
3.5. Child Elements
3.6. Directives
3.7. Comments
3.8. Escaping Characters
4. Conformance
5. References

1. Introduction

 

A lot of people nowadays forget that the M in XML stands for markup. Markup is a noun derived from a verb: it implies that you start with a piece of continuous text, and you then annotate it, without changing the original content.

 
--Michael Kay  

This Structured Syntax (SSYN) specification defines a set of both abstract and syntactic rules designed for the interchange of structured information.

SSYN has been designed as an alternative to XML for the increasing number of cases where XML is currently used to store information in a rigidly structured manner. SSYN is woefully inadequate for use with document-like information where the core of the document content is “a piece of continuous text” and can, thus, never be a replacement for XML.

SSYN has been designed to be as simple as possible for developers who implement the tools that process SSYN documents, developers who use those tools, and authors of SSYN documents while still being flexible enough to accommodate a variety of use cases.

SSYN is meant to be easy to read. Structure is clearly delineated with indentation.

Example 1. 

purchase order: 1999-10-20
ship to:
    name: Alice Smith
    street: 123 Maple Street
    city: Mill Valley
    state: CA
    zip: 90952
    country: US
bill to:
    name: Robert Smith
    street: 8 Oak Avenue
    city: Old Town
    state: PA
    zip: 95819
    country: US
comment::
    Hurry, my lawn is going wild!
items:
    : 872-AA
        product name: Lawnmower
        quantity: 1
        price: 148.95
        comment::
            Confirm this is electronic.
    : 926-AA
        product name: Baby Monitor
        quantity: 1
        price: 39.98
        ship date: 1999-05-21

SSYN “documents” can be reasoned about in at least two forms: one is the abstract model and the other is the concrete syntax. There can be many syntactic representations of a given model. (TODO: Should we define a canonical syntax in this specification or elsewhere?)

This specification defines both the abstract SSYN model and the concrete SSYN syntax but does not prescribe any specific programmatic interface for dealing with either. Any API that interprets the syntactic constructs described herein in a manner that exposes the necessary model information can be considered conformant (assuming it successfully processes all of the documents in the SSYN test suite so that it can generate the expected results).

Throughout this specification, examples of SSYN documents and elements will be represented with syntax. Do not confuse these lexical serializations with the abstract model they represent.

2. Model

The SSYN model consists of an ordered sequence of elements.

The SSYN model is not isomorphic with any known data model used by the dominant programming languages at this time. The SSYN model is simple and flexible enough that it can be emulated with any decent programming language’s native data types whether they be arrays, lists, vectors, associative arrays, dictionaries, maps, hash tables, or custom classes and objects.

It is not, however, the purpose of this specification to define how the SSYN model can be used in these disparate environments as the best techniques for each would vary from one to the other.

2.1. Elements

Elements are made up of up to four components:

  • An optional name

  • An optional value

  • An optional ordered list of child elements

  • An optional parent element

2.2. Names

Names are sequences of valid (except NUL) Unicode characters.

2.3. Values

Values are sequences of valid (except NUL) Unicode characters.

2.4. Child Elements

Child elements are elements with parent elements. Child elements with the same parent element will be referred to as sibling elements. Child elements and their child elements, etc, are referred to as descendant elements to the element that’s the parent of the first set of child elements. Parent elements and their parent elements, etc, are referred to as ancestor elements.

The parent/child relationship in the SSYN model allows for arbitrary tree-like structures of elements to be constructed by document authors. Child elements in the model should be accessible via parent elements.

2.5. Parent Elements

Parent elements act as “containers” for child elements. Parent elements in the model should be accessible via child elements. Elements without parents are considered “top-level” elements.

2.6. Common Structures

As previously stated, the SSYN model is unique and can most likely not be supported natively by most programming languages as they exist today. This does not mean that SSYN is inappropriate for modelling the data structures of these languages. In fact, the SSYN model can be considered a superset of most other models.

For example, to represent the data structures sometimes referred to as arrays, lists, or vectors, SSYN can use an element with multiple unnamed child elements.

Example 2. 

a list:
  : item1
  : item2
  : item3
  : item4

To represent the data structures sometimes referred to as associative arrays, dictionaries, maps, or hash tables, SSYN can use an element with multiple named child elements.

Example 3. 

a dict:
  key1: value1
  key2: value2

The simplicity of the SSYN model allows one to combine these structures to create new ones.

Example 4. 

a list of lists:
  :
    : item1
    : item2
  :
    : item3
    : item4

Example 5. 

a dict of dicts:
  key1:
    key2: value1
    key3: value2
  key4:
    key5: value3
    key6: value4

Example 6. 

a list of dicts:
  :
    key1: value1
    key2: value2
  :
    key3: value3
    key4: value4

Example 7. 

a dict of lists:
  key1:
    : item1
    : item2
  key2:
    : item3
    : item4

3. Syntax

In order to store or transfer a SSYN model onto or via some medium, the model must be encoded as a sequence of characters as defined by this section. Alternate representations may be defined at some future point but they MUST NOT be referred to as SSYN.

The sequence of characters that make up a serialized SSYN model must, of course, be encoded into some sequence of bytes. The section on encoding contains more information on the encodings conformant SSYN parsers are required to support and the means by which encodings can be detected.

The following classes of characters are defined for use throughout this specification.

Character Classes

space characters

TAB (U-0009)

SPACE (U-0032)

end-of-line characters

LF (U-000A)

VT (U-000B)

FF (U-000C)

CR (U-000D) followed by but not including any character other than LF (U-000A)

CR (U-000D) followed by and including LF (U-000A)

NEL (U-0085)

LS (U-2028)

PS (U-2029)

Note that a carriage return character followed by a line feed character is considered one end-of-line character for the purposes of this specification.

When this specification refers to the space character or the end-of-line character, that is to be interpreted as the space or end-of-line character that literally appears in the document which can be any of the above characters according to their class.

This specification makes no requirement on conformant SSYN parsers to expose end-of-line characters to clients as any specific character. Implementations can choose to expose the exact character(s) as found in the input stream or replace those characters with those chosen by the client.

Conformant SSYN parsers can be implemented in a variety of ways including—but not limited to—character-oriented parsers using a traditional state machine or a line-oriented parsers using regular expressions.

As an example, the following regular expression can extract enough interesting information from a single line of input to be useful in implementing a line-oriented parser.

Example 8. 

(\s*)((?:\|\||\|:|[^:])*)(::?)?(\s*)(.*)

The groups matched by the above expression include the spaces leading up to the name (or the empty string), the name (or the empty string), the separator (‘:’, ‘::’ or nothing), the spaces leading up to the value (or the empty string), and (possibly part of) the value (or the empty string).

Using an expression like the above would still require unescaping the characters that needed to be unescaped as well examining the end of the line for the line continuation character.

The space characters leading up to names and values matched by the above regular expression would need to be used the calculate the necessary indentation of child elements and values that span multiple lines and then the result of that calculation would be used to check subsequent lines to see if they contained a child element or more of the current element’s value.

3.1. Encoding

All conformant SSYN parsers MUST be able to process documents using the UTF-8, UTF-16, or UTF-32 encodings of Unicode 4.0.

SSYN documents encoded in UTF-16 or UTF-32 MUST begin with a Byte Order Mark. Documents encoded in UTF-8 MAY begin with a Byte Order Mark. The Byte Order Mark is used only to detect the encoding of the document and is not considered part of the document content.

The following table details the bytes that make up the Byte Order Marks for the different encodings.

Table 1. Byte Order Marks

BOM Encoding
00 00 FE FF UTF-32, big endian
FF FE 00 00 UTF-32, little endian
FE FF UTF-16, big endian
FF FE UTF-16, little endian
EF BB BF UTF-8

The absence of a Byte Order Mark indicates the document is encoded in UTF-8.

TODO: Introduce the special !SSYN directive that can be used to declare what encoding a document uses if the encoding is not a UTF encoding.

3.2. Names

Names must be serialized on a single line and are terminated with a literal colon, an end-of-line character, or the end of input. Terminating a name with a literal colon is used to separate an element’s name from its value.

All three of the elements depicted in the following example have a name of “name”.

Example 9. 

name
name:
name: value

Names in the model are defined as being a sequence of Unicode characters but, because of their role in terminating a name, colons and end-of-line characters in names must be escaped in order to be serialized.

The element depicted in the following example has a name of “na:me”.

Example 10. 

na|:me: value

The element depicted in the following example has a name of “na$me” where “$” is a placeholder for the line feed character.

Example 11. 

na|LF!me: value

Names that begin with the “!” character indicate that the element is to be considered a directive. Names that begin with the “#” character indicate that the element is to be considered a comment. Given that, “!” and “#” characters must be escaped if a document author wishes a “normal” element name to begin with a “!” or “#”.

The elements depicted in the following example are not directives or comments.

Example 12. 

|!name: value
|#name: value

See Escaping Characters section for more information on the syntax of escaping characters.

As defined in the model, names are an optional component of elements.

The element depicted in the following example has no name but does have a value.

Example 13. 

: value

Note that space characters appearing immediately before any non-space characters on a line are only used for determining the depth of the element and are not considered part of the element’s name. To include leading spaces in a name, the spaces must be escaped.

3.3. Simple Values

There’s only one type of value in the model but there’s two different ways of encoding values in the syntax.

The simple value syntax is indicated with one literal colon and terminated with an end-of-line character or end of input in the syntax but does not include any terminating end-of-line character in the model.

Simple values will typically be serialized on the same line as the name of an element and will be terminated by the end-of-line character on that same line.

The element depicted in the following example has a value of “value”.

Example 14. 

name: value

Simple values can extend across the line the separator appears on if a line continuation character is used.

The element depicted in the following example has a value of “value”.

Example 15. 

name: val|
      ue

Since simple values are normally terminated with end-of-line characters, such characters must be escaped in order to be included in the actual value. Unlike names, however, colons do not need to be escaped.

The element depicted in the following example has a value of “val:ue$” where “$” is a placeholder for the line feed character.

Example 16. 

name: val:ue|LF!

Note that space characters appearing immediately after a literal colon character but before any non-space characters are not considered part of the element’s value. To include leading spaces in a value, the spaces must be escaped.

3.4. Block Values

Block values enable serializing actual values containing end-of-line characters without having to escape those end-of-line characters.

Block values are indicated by using two literal colons instead of the one literal colon used to indicate simple values.

Block values can span multiple lines without the use of the line continuation character. Each line in a block value must be indented by a number of space characters equal to the number of characters on the line leading up to the first non-space character in the block value.

Block values are terminated by a line with a fewer number of leading space characters than the number of space characters used to indent the first line containing non-space characters in the block value or the end of input.

End-of-line characters in the block value are included in the actual value unless a line continuation character is used.

The element depicted in the following example has a value of “value$” where “$” is a placeholder for the end-of-line character.

Example 17. 

name:: value

The element depicted in the following example has a value of “line1$line2$” where “$” is a placeholder for the end-of-line character.

Example 18. 

name:: line1
       line2

Care must be taken by document authors to ensure that subsequent lines in a block value are properly indented.

The element depicted in the following example has a value of “value$” where “$” is a placeholder for the end-of-line character. The line containing the string “name2” must be interpreted by conforming parsers as a child element with a name of “name2” and no value.

Example 19. 

name1:: value
  name2

Block values can begin on a line after the line the block value indicator appears on. All space and end-of-line characters leading up to the first non-space character are not included in the actual value.

The element depicted in the following example also has a value of “line1$line2$” where “$” is a placeholder for the end-of-line character.

Example 20. 

name::
  line1
  line2

Note that, like simple values, leading space characters in a block value are not considered part of the element’s value so care must be taken to insure that any leading space characters are properly escaped.

The element depicted in the following example has a value of “__line1$line2$” where “_” is a placeholder for the space character and “$” is a placeholder for the end-of-line character.

Example 21. 

name::
  |  line1
  line2

3.5. Child Elements

Child elements are indicated by indenting an element a number of space characters greater than the number of space characters used to indent the parent element.

Elements indented with the same number of space characters are considered siblings to each other and children of the nearest parent element previously appearing in the document indented with a lesser number of space characters.

The element depicted in the following example has two child elements with names of “name2” and “name3”.

Example 22. 

name1
  name2
  name3

Elements with child elements can, as defined in the model, also have a value.

The element depicted in the following example has a value of “value” and two child elements with names of “name2” and “name3”.

Example 23. 

name1: value
  name2
  name3

Care must be taken by document authors when an element has both a block value and child elements.

The element depicted in the following example has a value of “line1$line2$” where “$” is a placeholder for the end-of-line character and one child element with a name of “name2”.

Example 24. 

name1:: line1
        line2
  name2

Literal space characters on a line appearing before any non-space characters are only used for determining the indentation level of an element or value. To include leading spaces in a name or value, the spaces must be escaped.

The elements depicted in the following example have names of “_name” and values of “_value” where “_” is a placeholder for the space character.

Example 25. 

| name: | value
  | name: | value

3.6. Directives

Describe the directive syntax here.

Mention the special !SSYN directive.

3.7. Comments

Describe the comment syntax here.

3.8. Escaping Characters

There are three different ways to escape characters. All three use the pipe character as the escape character. The pipe character was chosen because it occurs much less frequently in typical text than the world’s most popular escape character: the backslash. This makes it possible to have names and values in the model that refer to file paths (on systems that use backslashes as the path separator) or regular expressions.

Simple escaping is done by preceding what a SSYN parser would normally consider a special character with a pipe character. There are four special characters in SSYN: pipe (|), colon (:), bang (!), and hash (#).

Pipe characters must always be escaped. Colon characters only need be escaped when appearing in a name but are allowed to be escaped in values. Bang and hash characters only need to be escaped when appearing as the first character in a name but are allowed to be escaped elsewhere.

Another form of simple escaping is escaping space characters which is done by prefixing them with a pipe character. This is normally only required when a name or value needs to begin with leading space characters.

Named escaping is done by preceding a name with a pipe and terminating that name with a bang. The names that can be used in this manner and their substitution values appear in the following table.

Table 2. Named Character References

Name Code Point Description
SOH 0001 Start of heading
STX 0002 Start of text
ETX 0003 End of text
EOT 0004 End of transmission
ENQ 0005 Enquiry
ACK 0006 Acknowledge
BEL 0007 Bell
BS 0008 Backspace
TAB 0009 Horizontal tab
LF 000A Line feed
VT 000B Vertical tab
FF 000C Form feed
CR 000D Carriage return
SO 000E Shift out
SI 000F Shift in
DLE 0010 Data link escape
DC1 0011 Device control 1
DC2 0012 Device control 2
DC3 0013 Device control 3
DC4 0014 Device control 4
NAK 0015 Negative acknowledge
SYN 0016 Synchronous idle
ETB 0017 End transmission block
CAN 0018 Cancel
EM 0019 End of medium
SUB 001A Substitute
ESC 001B Escape
FS 001C File separator
GS 001D Group separator
RS 001E Record separator
US 001F Unit separator
DEL 007F Delete
NEL 0085 Next line
LS 2028 Line separator
PS 2029 Paragraph separator

TODO: Should we include HTML4 entities? Should we include other “control”-like characters from Unicode?

Numeric escaping is done by preceding a sequence of hexadecimal digits with a pipe and terminating those digits with a hash. The replacement character for numeric escapes is the Unicode character with the code point indicated by the hexadecimal digits in the escape. Any valid Unicode character (except NUL) can be referenced in this manner.

4. Conformance

Conformance with this specification is determined by validating implementations against the SSYN test suite.

Each test in the test suite consists of a SSYN document and an “expected” document. It’s expected that test runners will parse each SSYN document and create a result document with the same format as the “expected” documents. The result documents could then be compared, byte-for-byte, with the “expected” documents.

In order for this comparison to work reliably the format for the expected/result files is very strict.

Each line in an expected/result file reperesents one element in the SSYN document. Lines start with the element’s depth encoded as a decimal number. The depth of an element is calculated as one plus the number of ancestors the element has. The depth is followed by a single space.

The name of the element is encoded next. It must be delimited with single quote characters so must begin with a literal single quote. Any characters in the name with a code point of less than 32 or greater than 126 must be escaped with their numeric escape codes. Pipe characters must be escaped with “||”. Single quote characters must be escaped with “|27#”. No other characters may be escaped. The name is followed by a literal single quote and a single space.

The value of the element is then encoded using the same rules as encoding names. The line is terminated with a single line feed character.

The following example represents a test.

Example 26. 

name1: value1
  name2:: value2

The following example represents the expected/result file for the previous example.

Example 27. 

1 'name1' 'value1'
2 'name2' 'value2|A#'

TODO: Maybe it would be a good idea to link each example in the spec to an example in this section that shows the expected/result file for that example. Each example in the spec should also be part of the test suite.

5. References

TODO: Put references here.