Table of Contents
A lot of people nowadays forget that the M in XML stands for markup. Markup is a noun derived from a verb: it implies that you start with a piece of continuous text, and you then annotate it, without changing the original content. |
||
| --Michael Kay | ||
This Structured Syntax (SSYN) specification defines a set of both abstract and syntactic rules designed for the interchange of structured information.
SSYN has been designed as an alternative to XML for the increasing number of cases where XML is currently used to store information in a rigidly structured manner. SSYN is woefully inadequate for use with document-like information where the core of the document content is “a piece of continuous text” and can, thus, never be a replacement for XML.
SSYN has been designed to be as simple as possible for developers who implement the tools that process SSYN documents, developers who use those tools, and authors of SSYN documents while still being flexible enough to accommodate a variety of use cases.
SSYN is meant to be easy to read. Structure is clearly delineated with indentation.
Example 1.
purchase order: 1999-10-20
ship to:
name: Alice Smith
street: 123 Maple Street
city: Mill Valley
state: CA
zip: 90952
country: US
bill to:
name: Robert Smith
street: 8 Oak Avenue
city: Old Town
state: PA
zip: 95819
country: US
comment::
Hurry, my lawn is going wild!
items:
: 872-AA
product name: Lawnmower
quantity: 1
price: 148.95
comment::
Confirm this is electronic.
: 926-AA
product name: Baby Monitor
quantity: 1
price: 39.98
ship date: 1999-05-21
SSYN “documents” can be reasoned about in at least two forms: one is the abstract model and the other is the concrete syntax. There can be many syntactic representations of a given model. (TODO: Should we define a canonical syntax in this specification or elsewhere?)
This specification defines both the abstract SSYN model and the concrete SSYN syntax but does not prescribe any specific programmatic interface for dealing with either. Any API that interprets the syntactic constructs described herein in a manner that exposes the necessary model information can be considered conformant (assuming it successfully processes all of the documents in the SSYN test suite so that it can generate the expected results).
Throughout this specification, examples of SSYN documents and elements will be represented with syntax. Do not confuse these lexical serializations with the abstract model they represent.
The SSYN model consists of an ordered sequence of elements.
The SSYN model is not isomorphic with any known data model used by the dominant programming languages at this time. The SSYN model is simple and flexible enough that it can be emulated with any decent programming language’s native data types whether they be arrays, lists, vectors, associative arrays, dictionaries, maps, hash tables, or custom classes and objects.
It is not, however, the purpose of this specification to define how the SSYN model can be used in these disparate environments as the best techniques for each would vary from one to the other.
Elements are made up of up to four components:
An optional name
An optional value
An optional ordered list of child elements
An optional parent element
Child elements are elements with parent elements. Child elements with the same parent element will be referred to as sibling elements. Child elements and their child elements, etc, are referred to as descendant elements to the element that’s the parent of the first set of child elements. Parent elements and their parent elements, etc, are referred to as ancestor elements.
The parent/child relationship in the SSYN model allows for arbitrary tree-like structures of elements to be constructed by document authors. Child elements in the model should be accessible via parent elements.
Parent elements act as “containers” for child elements. Parent elements in the model should be accessible via child elements. Elements without parents are considered “top-level” elements.
As previously stated, the SSYN model is unique and can most likely not be supported natively by most programming languages as they exist today. This does not mean that SSYN is inappropriate for modelling the data structures of these languages. In fact, the SSYN model can be considered a superset of most other models.
For example, to represent the data structures sometimes referred to as arrays, lists, or vectors, SSYN can use an element with multiple unnamed child elements.
To represent the data structures sometimes referred to as associative arrays, dictionaries, maps, or hash tables, SSYN can use an element with multiple named child elements.
The simplicity of the SSYN model allows one to combine these structures to create new ones.
In order to store or transfer a SSYN model onto or via some medium, the model must be encoded as a sequence of characters as defined by this section. Alternate representations may be defined at some future point but they MUST NOT be referred to as SSYN.
The sequence of characters that make up a serialized SSYN model must, of course, be encoded into some sequence of bytes. The section on encoding contains more information on the encodings conformant SSYN parsers are required to support and the means by which encodings can be detected.
The following classes of characters are defined for use throughout this specification.
Character Classes
TAB (U-0009)
SPACE (U-0032)
LF (U-000A)
VT (U-000B)
FF (U-000C)
CR (U-000D) followed by but not including any character other than LF (U-000A)
CR (U-000D) followed by and including LF (U-000A)
NEL (U-0085)
LS (U-2028)
PS (U-2029)
Note that a carriage return character followed by a line feed character is considered one end-of-line character for the purposes of this specification.
When this specification refers to the space character or the end-of-line character, that is to be interpreted as the space or end-of-line character that literally appears in the document which can be any of the above characters according to their class.
This specification makes no requirement on conformant SSYN parsers to expose end-of-line characters to clients as any specific character. Implementations can choose to expose the exact character(s) as found in the input stream or replace those characters with those chosen by the client.
Conformant SSYN parsers can be implemented in a variety of ways including—but not limited to—character-oriented parsers using a traditional state machine or a line-oriented parsers using regular expressions.
As an example, the following regular expression can extract enough interesting information from a single line of input to be useful in implementing a line-oriented parser.
The groups matched by the above expression include the spaces leading up to the name (or the empty string), the name (or the empty string), the separator (‘:’, ‘::’ or nothing), the spaces leading up to the value (or the empty string), and (possibly part of) the value (or the empty string).
Using an expression like the above would still require unescaping the characters that needed to be unescaped as well examining the end of the line for the line continuation character.
The space characters leading up to names and values matched by the above regular expression would need to be used the calculate the necessary indentation of child elements and values that span multiple lines and then the result of that calculation would be used to check subsequent lines to see if they contained a child element or more of the current element’s value.
All conformant SSYN parsers MUST be able to process documents using the UTF-8, UTF-16, or UTF-32 encodings of Unicode 4.0.
SSYN documents encoded in UTF-16 or UTF-32 MUST begin with a Byte Order Mark. Documents encoded in UTF-8 MAY begin with a Byte Order Mark. The Byte Order Mark is used only to detect the encoding of the document and is not considered part of the document content.
The following table details the bytes that make up the Byte Order Marks for the different encodings.
Table 1. Byte Order Marks
| BOM | Encoding |
|---|---|
| 00 00 FE FF | UTF-32, big endian |
| FF FE 00 00 | UTF-32, little endian |
| FE FF | UTF-16, big endian |
| FF FE | UTF-16, little endian |
| EF BB BF | UTF-8 |
The absence of a Byte Order Mark indicates the document is encoded in UTF-8.
TODO: Introduce the special !SSYN directive that can be used to declare what encoding a document uses if the encoding is not a UTF encoding.
Names must be serialized on a single line and are terminated with a literal colon, an end-of-line character, or the end of input. Terminating a name with a literal colon is used to separate an element’s name from its value.
All three of the elements depicted in the following example have a name of “name”.
Names in the model are defined as being a sequence of Unicode characters but, because of their role in terminating a name, colons and end-of-line characters in names must be escaped in order to be serialized.
The element depicted in the following example has a name of “na:me”.
The element depicted in the following example has a name of “na$me” where “$” is a placeholder for the line feed character.
Names that begin with the “!” character indicate that the element is to be considered a directive. Names that begin with the “#” character indicate that the element is to be considered a comment. Given that, “!” and “#” characters must be escaped if a document author wishes a “normal” element name to begin with a “!” or “#”.
The elements depicted in the following example are not directives or comments.
See Escaping Characters section for more information on the syntax of escaping characters.
As defined in the model, names are an optional component of elements.
The element depicted in the following example has no name but does have a value.
Note that space characters appearing immediately before any non-space characters on a line are only used for determining the depth of the element and are not considered part of the element’s name. To include leading spaces in a name, the spaces must be escaped.
There’s only one type of value in the model but there’s two different ways of encoding values in the syntax.
The simple value syntax is indicated with one literal colon and terminated with an end-of-line character or end of input in the syntax but does not include any terminating end-of-line character in the model.
Simple values will typically be serialized on the same line as the name of an element and will be terminated by the end-of-line character on that same line.
The element depicted in the following example has a value of “value”.
Simple values can extend across the line the separator appears on if a line continuation character is used.
The element depicted in the following example has a value of “value”.
Since simple values are normally terminated with end-of-line characters, such characters must be escaped in order to be included in the actual value. Unlike names, however, colons do not need to be escaped.
The element depicted in the following example has a value of “val:ue$” where “$” is a placeholder for the line feed character.
Note that space characters appearing immediately after a literal colon character but before any non-space characters are not considered part of the element’s value. To include leading spaces in a value, the spaces must be escaped.
Block values enable serializing actual values containing end-of-line characters without having to escape those end-of-line characters.
Block values are indicated by using two literal colons instead of the one literal colon used to indicate simple values.
Block values can span multiple lines without the use of the line continuation character. Each line in a block value must be indented by a number of space characters equal to the number of characters on the line leading up to the first non-space character in the block value.
Block values are terminated by a line with a fewer number of leading space characters than the number of space characters used to indent the first line containing non-space characters in the block value or the end of input.
End-of-line characters in the block value are included in the actual value unless a line continuation character is used.
The element depicted in the following example has a value of “value$” where “$” is a placeholder for the end-of-line character.
The element depicted in the following example has a value of “line1$line2$” where “$” is a placeholder for the end-of-line character.
Care must be taken by document authors to ensure that subsequent lines in a block value are properly indented.
The element depicted in the following example has a value of “value$” where “$” is a placeholder for the end-of-line character. The line containing the string “name2” must be interpreted by conforming parsers as a child element with a name of “name2” and no value.
Block values can begin on a line after the line the block value indicator appears on. All space and end-of-line characters leading up to the first non-space character are not included in the actual value.
The element depicted in the following example also has a value of “line1$line2$” where “$” is a placeholder for the end-of-line character.
Note that, like simple values, leading space characters in a block value are not considered part of the element’s value so care must be taken to insure that any leading space characters are properly escaped.
The element depicted in the following example has a value of “__line1$line2$” where “_” is a placeholder for the space character and “$” is a placeholder for the end-of-line character.
Child elements are indicated by indenting an element a number of space characters greater than the number of space characters used to indent the parent element.
Elements indented with the same number of space characters are considered siblings to each other and children of the nearest parent element previously appearing in the document indented with a lesser number of space characters.
The element depicted in the following example has two child elements with names of “name2” and “name3”.
Elements with child elements can, as defined in the model, also have a value.
The element depicted in the following example has a value of “value” and two child elements with names of “name2” and “name3”.
Care must be taken by document authors when an element has both a block value and child elements.
The element depicted in the following example has a value of “line1$line2$” where “$” is a placeholder for the end-of-line character and one child element with a name of “name2”.
Literal space characters on a line appearing before any non-space characters are only used for determining the indentation level of an element or value. To include leading spaces in a name or value, the spaces must be escaped.
The elements depicted in the following example have names of “_name” and values of “_value” where “_” is a placeholder for the space character.
There are three different ways to escape characters. All three use the pipe character as the escape character. The pipe character was chosen because it occurs much less frequently in typical text than the world’s most popular escape character: the backslash. This makes it possible to have names and values in the model that refer to file paths (on systems that use backslashes as the path separator) or regular expressions.
Simple escaping is done by preceding what a SSYN parser would normally consider a special character with a pipe character. There are four special characters in SSYN: pipe (|), colon (:), bang (!), and hash (#).
Pipe characters must always be escaped. Colon characters only need be escaped when appearing in a name but are allowed to be escaped in values. Bang and hash characters only need to be escaped when appearing as the first character in a name but are allowed to be escaped elsewhere.
Another form of simple escaping is escaping space characters which is done by prefixing them with a pipe character. This is normally only required when a name or value needs to begin with leading space characters.
Named escaping is done by preceding a name with a pipe and terminating that name with a bang. The names that can be used in this manner and their substitution values appear in the following table.
Table 2. Named Character References
| Name | Code Point | Description |
|---|---|---|
| SOH | 0001 | Start of heading |
| STX | 0002 | Start of text |
| ETX | 0003 | End of text |
| EOT | 0004 | End of transmission |
| ENQ | 0005 | Enquiry |
| ACK | 0006 | Acknowledge |
| BEL | 0007 | Bell |
| BS | 0008 | Backspace |
| TAB | 0009 | Horizontal tab |
| LF | 000A | Line feed |
| VT | 000B | Vertical tab |
| FF | 000C | Form feed |
| CR | 000D | Carriage return |
| SO | 000E | Shift out |
| SI | 000F | Shift in |
| DLE | 0010 | Data link escape |
| DC1 | 0011 | Device control 1 |
| DC2 | 0012 | Device control 2 |
| DC3 | 0013 | Device control 3 |
| DC4 | 0014 | Device control 4 |
| NAK | 0015 | Negative acknowledge |
| SYN | 0016 | Synchronous idle |
| ETB | 0017 | End transmission block |
| CAN | 0018 | Cancel |
| EM | 0019 | End of medium |
| SUB | 001A | Substitute |
| ESC | 001B | Escape |
| FS | 001C | File separator |
| GS | 001D | Group separator |
| RS | 001E | Record separator |
| US | 001F | Unit separator |
| DEL | 007F | Delete |
| NEL | 0085 | Next line |
| LS | 2028 | Line separator |
| PS | 2029 | Paragraph separator |
TODO: Should we include HTML4 entities? Should we include other “control”-like characters from Unicode?
Numeric escaping is done by preceding a sequence of hexadecimal digits with a pipe and terminating those digits with a hash. The replacement character for numeric escapes is the Unicode character with the code point indicated by the hexadecimal digits in the escape. Any valid Unicode character (except NUL) can be referenced in this manner.
Conformance with this specification is determined by validating implementations against the SSYN test suite.
Each test in the test suite consists of a SSYN document and an “expected” document. It’s expected that test runners will parse each SSYN document and create a result document with the same format as the “expected” documents. The result documents could then be compared, byte-for-byte, with the “expected” documents.
In order for this comparison to work reliably the format for the expected/result files is very strict.
Each line in an expected/result file reperesents one element in the SSYN document. Lines start with the element’s depth encoded as a decimal number. The depth of an element is calculated as one plus the number of ancestors the element has. The depth is followed by a single space.
The name of the element is encoded next. It must be delimited with single quote characters so must begin with a literal single quote. Any characters in the name with a code point of less than 32 or greater than 126 must be escaped with their numeric escape codes. Pipe characters must be escaped with “||”. Single quote characters must be escaped with “|27#”. No other characters may be escaped. The name is followed by a literal single quote and a single space.
The value of the element is then encoded using the same rules as encoding names. The line is terminated with a single line feed character.
The following example represents a test.
The following example represents the expected/result file for the previous example.
TODO: Maybe it would be a good idea to link each example in the spec to an example in this section that shows the expected/result file for that example. Each example in the spec should also be part of the test suite.