TITLE: | ISO/IEC WD 19757-5: Document Schema Definition Languages (DSDL) Part 5: Datatype Library Language (DTLL) |
SOURCE: | Jeni Tennison |
PROJECT: | WD 19757-5: Document Schema Definition Language (DSDL) Part 5 - Datatypes |
PROJECT EDITOR: | Mr. Martin Bryan |
STATUS: | Working Draft |
ACTION: | For review and comments by national standards bodies prior to discussion at the November 2004 meeting of SC34 WG1 |
DATE: | 2004-09-19 |
DISTRIBUTION: | JTC1, SC34 and Liaisons |
REFER TO: | N0546a1 - 2004-09-19 - Datatypes in XML |
REPLY TO: |
Dr. James David Mason (ISO/IEC JTC 1/SC 34 Secretariat - Standards Council of Canada) Crane Softwrights Ltd. Box 266, Kars, ON K0A-2E0 CANADA Telephone: +1 613 489-0999 Facsimile: +1 613 489-0995 Network: [email protected] http://www.jtc1sc34.org |
Datatype Library Language (DTLL)
Jeni Tennison
Unlike XML Schema, RELAX NG doesn't provide a mechanism for users to define their own types. If they're not satisfied with the two built-in types of string and token, RELAX NG users have to create a datatype library, which they then refer to from the schema.
Most RELAX NG validators provide built-in support for the XML Schema datatype library. Many also support an interface that allows you to plug in datatype modules, written in the programming language of your choice, to define extra datatypes. But the fact that these datatype libraries have to be programmed means that ordinary users find them hard to construct.
One option would be for RELAX NG validators to support datatype definition via XML Schema - using <xs:simpleType> elements to create new atomic types. However, there are several problems with this:
• It wouldn't be particularly easy for implementations to support the <xs:simpleType> elements in isolation, but RELAX NG validators don't want to have to be able to understand XML Schema schemas.
• It wouldn't be particularly easy for RELAX NG users to switch to using the very different style employed by XML Schema, and again RELAX NG users don't want to have to be able to write XML Schema schemas.
• Creating user-defined datatypes based on the XML Schema datatypes means incorporating all the built-in types, including types that are unlikely be required for a particular schema.
• In general, the XML Schema type system goes against RELAX NG's open philosophy, for example by dictating the required format for numbers and dates when different markup languages might reasonably use different formats (for internationalisation reasons, for example).
So the first motivation for putting together a language for datatype libraries is to enable RELAX NG users to construct their own datatypes without having to resort to a procedural programming language or having to learn how to use XML Schema, which might not be suited for their needs.
The second motivation is to provide a mechanism for defining the datatypes that can be used in XPath 2.0, again without recourse to XML Schema. The noteworthy point here is that, as a processing language rather than a validation language, XPath 2.0 has slightly different requirements from a datatype library language than RELAX NG validators.
datatypes xs = "http://www.w3.org/2001/XMLSchema-datatypes"
default namespace dt = "http://www.jenitennison.com/datatypes"
namespace local = ""
start = \datatypes
<datatypes> is the document element.
The version attribute holds the version of the datatype library language. The current version is 0.3.
If a DTLL version 0.3 processor encounters a datatype library with a version higher than 0.3, it must treat any attributes or elements that it doesn't understand (that are not part of DTLL 0.3) in the same way as it would treat extension attributes or elements found in the same location.
\datatypes = element datatypes {
attribute version { "0.3" },
ns?,
extension-attribute*,
top-level-element*
}
top-level-element |= named-datatype
top-level-element |= top-level-map
top-level-element |= \include
top-level-element |= \div
top-level-element |= extension-top-level-element
<include> elements include datatype libraries from elsewhere. It is as if the content of the included document (the children of the <datatypes>element) is inserted into the datatype library in place of the <include> element.
\include = element include {
attribute href { xs:anyURI },
extension-attribute*
}
<div> elements are simply used to partition a datatype library and to provide a scope for ns attributes.
\div = element div {
ns?,
extension-attribute*,
top-level-element*
}
Extension top-level elements can be used to hold data that is used within the datatype library (such as value elements for enumerations), documentation, or information that is used by implementations. For example, an extension top-level element can be used by an implementation to define extension functions (using XSLT, for example) that can be used in the XPath expressions used within the datatype library.
extension-top-level-element = extension-element
Named datatypes are given at the top level of the datatype library using <datatype> elements. Each named datatype has a qualified name that can be used to refer to it.
The name of the datatype is given in the name attribute. If this is unprefixed, the nearest ancestor ns attribute (including one on the <datatype> element itself) is used to provide the namespace for the datatype.
named-datatype = element datatype {
attribute name { xs:QName }, ns?,
extension-attribute*,
datatype-definition-element*
}
Anonymous datatypes are used to provide the datatype for a parameter, property or variable if that parameter, property or variable's type can't be referred to by name.
anonymous-datatype = element datatype {
extension-attribute*,
datatype-definition-element*
}
Datatypes are referenced using qualified names. If the qualified name hasn't got a prefix, the nearest ancestor ns attribute (including one on the element that's referring to the datatype) is used to resolve the name.
datatype-reference = xs:QName
A datatype definition consists of parameters and constraints, value definition, maps to other types, and a collation.
datatype-definition-element |= paramDef
datatype-definition-element |= constraint
datatype-definition-element |= property
datatype-definition-element |= parse
datatype-definition-element |= condition
datatype-definition-element |= except
datatype-definition-element |= variable
datatype-definition-element |= local-map
datatype-definition-element |= super
datatype-definition-element |= sub
datatype-definition-element |= collate
datatype-definition-element |= extension-definition-element
Extension definition elements can be used at any point within a datatype definition. If a processor doesn't recognise an extension definition element, it must ignore it and behave as if the value passed whatever test the extension definition element represented.
Example 1. Using Extension Definition Elements for Documentation
Extension definition elements can be used to hold documentation about the datatype. For example, an <eg:example> element might be used to provide example legal values of the datatype:
<datatype name="RRGGBBColour">
<eg:example>#FFFFFF</eg:example>
<eg:example>#123456</eg:example>
<parse name="RRGGBB">
<regex>#(?[RR][0-9A-F]{2})(?[GG][0-9A-F]{2})(?[BB][0-9A-F]{2})</regex>
</parse>
...
</datatype>
extension-definition-element = extension-element
Certain aspects of a datatype definition can be negated by being placed in an <except> element. A value is only valid if it isn't valid according to any of the datatype definition elements held within an <except> element.
except = element except {
extension-attribute*,
negative-test+
}
negative-test |= condition
negative-test |= constraint
negative-test |= variable
negative-test |= parse
Parsing can perform two functions: it tests whether a value adheres to a particular format, and can assign a tree value to a variable to enable pieces of the string value to be extracted, tested, assigned to properties and so on.
The <parse> element holds any number of parsing methods, one or more of which must be satisfied in order for the value to be considered valid. The name attribute, if present, specifies the name of the variable to which the tree resulting from the parse is assigned. The first successful parse is used to give the value of this variable (thus the processor does not have to attempt to perform any parses once one has been successful).
A datatype can specify as many <parse> elements as it wishes. All must be satisfied by a value for that value to be a legal value of the datatype. A datatype that doesn't specify a <parse> element (either itself or by inheritance from a supertype) is an abstract datatype.
parse = element parse {
name?, preprocess*,
extension-attribute*,
parsing-method+
}
Before a value is parsed by a <parse> element, it can be preprocessed. This does not change the string value, but it may simplify the specification of the parsing method that's used.
The only built-in form of preprocessing is whitespace processing. The whitespace can be preserved ('preserve'), whitespace characters replaced by space characters ('replace'), or leading and trailing whitespace stripped and sequences of whitespace characters replaced by spaces ('collapse', the default).
preprocess |= attribute whitespace {
"preserve" | "replace" | "collapse"
}
There are three core methods of parsing: via a regular expression, by enumerating legal values, and by specifying a list. This set of methods can be supplemented by extension parsing elements.
parsing-method |= regex
parsing-method |= enumeration
parsing-method |= \list
parsing-method |= extension-parsing-element
The <regex> element specifies parsing via an extended regular expression. To be a legal value, the entire string value must be matched by the regular expression. (Although it's legal to use ^ and $ to mark the beginning and end of the matched string, it's not necessary.)
The tree value generated by parsing consists of a root (document) node with text node and element children. The string value of the root (document) node is the string value itself. There is one element for each named subexpression. The element's name being the name of the subexpression with the namespace indicated by the prefix indicated in the name. If no prefix is used, the element is in no namespace. The string value of each of these elements is the matched part of the string value as a whole.
Example 2. Regular Expression Parsing
For example, the regex:
(?[year]-?[0-9]{4})-(?[month][0-9]{2})-(?[day][0-9]{2})
parsing the value:
2003-12-19
generates the tree:
(root)
+- year
| +- "2003"
+- "-"
+- month
| +- "12"
+- "-"
+- day
+- "19"
regex = element regex {
regex-flags*,
extension-attribute*,
extended-regular-expression
}
Four attributes modify the way in which regular expressions are applied. These are equivalent to the flags available within XPath 2.0.
By default, the "." meta-character matches all characters except the newline (#xA) character. If dot-all="true" then "." matches all characters, including the newline character.
regex-flags |= attribute dot-all { boolean }
By default, ^ matches the beginning of the entire string and $ the end of the entire string. If multi-line="true" then ^ matches the beginning of each line as well as the beginning of the string, and $ matches the end of each line as well as the end of the string. Lines are delimited by newline (#xA) characters.
regex-flags |= attribute multi-line { xs:boolean }
By default, the regular expression is case sensitive. If case-insensitive="true" then the matching is case-insensitive, which means that the regular expression "a" will match the string "A".
regex-flags |= attribute case-insensitive { xs:boolean }
By default, whitespace within the regular expression matches whitespace in the string. If ignore-whitespace="true", whitespace in the regular expression is removed prior to matching, and you need to use "\s" to match whitespace. This can be used to create more readable regular expressions.
Example 3. Ignoring Whitespace in Regular Expressions
<regex ignore-whitespace="true">
(?[year][0-9]{4})-
(?[month][0-9]{2})-
(?[day][0-9]{2})
</regex>
Note: This is not the same as <parse whitespace="collapse">...</parse>, which preprocesses the string value itself.
regex-flags |= attribute ignore-whitespace { xs:boolean }
Boolean values are 'true' or 'false', with optional leading and trailing whitespace.
boolean = xs:boolean { pattern = "true|false" }
The <enumeration> element specifies parsing against a list of allowed values. Allowed values are represented by value elements (see below). The code attribute holds an XPath expression. A string value is legal if there is a value element such that when the value element is used as the context node for evaluating the expression held in the code attribute, the result is the string value.
Example 4. Enumerated Values
For example, the following allows the string values 'Jan', 'Feb' and so on:
<enumeration name="month" code="@abbr">
<my:month abbr="Jan">January</my:month>
<my:month abbr="Feb">February</my:month>
...
</enumeration>
The value assigned to the variable declared by the <parse> element (if there is one; that is, if the name attribute is given) is a node-set of those value elements for which the result of evaluating the expression held in the code attribute is the string value. Note that the node-set holds the original value elements and it is therefore possible to go up the tree to access information about their ancestors.
If the code attribute is missing, the default is "string(.)". In other words, the default is that the legal enumerated values are the string values of the value elements.
enumeration = element enumeration {
attribute code { XPath }?,
extension-attribute*,
values
}
There are two methods of specifying value elements when creating an enumerated list of legal values: through a values attribute or using the element children of the <enumeration> element.
The values attribute holds an XPath that evaluates to a set of value elements. This is particularly useful for referencing lists of legal values that are stored externally.
Example 5. Referencing External Code Lists
<enumeration values="document('languages.xml')/languages/language"
code="@two-letter-abbr" />
values |= attribute values { XPath }
The second way of listing the legal values in an enumeration is via a list of element children of the <enumeration> element. These can be elements in any namespace other than "http://www.jenitennison.com/datatypes" or <value> elements. This method can be used for simple enumerations that are not worth listing externally.
For the purpose of the node-set held by the variable declared via the <enumeration> element, the children of the <enumeration> element are considered to be the only children of a new document node.
Example 6. Parsing Using Enumerations
For example, given:
<enumeration name="whitespace-treatment">
<value>preserve</value>
<value>replace</value>
<value>collapse</value>
</enumeration>
if the string value is 'replace' then the $this.whitespace-treatment variable is set to the second <value> element in the tree:
root
+- value
| +- "preserve"
+- value
| +- "replace"
+- value
+- "collapse"
values |= value-element+
value-element |= element value { anything }
value-element |= extension-value-element
extension-value-element = extension-element
The <list> element specifies parsing of the string value into a list of values. This uses the same method as that used in RELAX NG, except that a separator attribute is used to provide a regular expression that is used to break up the list into items.
The result of parsing the string value based on the <list> element is a node-set of sibling elements, each of whose string value is of the item type. The names of the item elements are implementation-defined.
Example 7. Parsing Lists
For example, if you have:
<list separator="\s*,\s*">
<oneOrMore><data type="integer" /></oneOrMore>
</list>
and the string value:
1, 2, 3, 45
then the variable is set to the elements in the tree:
root
+- item
| +- "1"
+- item
| +- "2"
+- item
| +- "3"
+- item
+- "45"
These elements need not be named 'item'.
The separator attribute specifies a regular expression that matches the separators in the list. The default is "\s+" (one or more whitespace characters). It is an error if the regular expression matches an empty string (i.e. if it matches "").
\list = element list {
attribute separator { regular-expression }?,
extension-attribute*,
item-pattern+
}
Item patterns follow the same syntax as that used in RELAX NG, except that datatypes are referenced via a slightly different syntax.
item-pattern |= group
item-pattern |= choice
item-pattern |= optional
item-pattern |= zeroOrMore
item-pattern |= oneOrMore
item-pattern |= \list
item-pattern |= item-datatype
item-pattern |= item-value
group = element group { extension-attribute*, item-pattern+ }
choice = element choice { extension-attribute*, item-pattern+ }
optional = element optional { extension-attribute*, item-pattern+ }
zeroOrMore = element zeroOrMore { extension-attribute*, item-pattern+ }
oneOrMore = element oneOrMore { extension-attribute*, item-pattern+ }
item-datatype |= element data {
extension-attribute*,
type, param*
}
item-datatype |= anonymous-datatype
item-value = element value {
extension-attribute*,
type?, text
}
Extension parsing elements can be used to parse elements using methods other than the core methods explained above. Extension parsing elements can be used, for example, to parse a value using EBNF or PEGs.
If the extension parsing element isn't recognised, the value is considered to fail the parse. If the extension parsing element occurs in a <parse> element without any alternative parsing methods, this means no value can match the datatype, and the implementation must issue a warning. Usually, an extension parsing element will be used alongside a built-in parsing method.
Example 8. Using Extension Parsing Elements
<parse name="path">
<ext:ebnf ref="http://www.w3.org/1999/xpath" />
<regex dot-all="true">.*</regex>
</parse>
extension-parsing-element = extension-element
Constraints and conditions define tests that must be true. A constraint is a compile-time test (that checks the values of parameters) whereas a condition is a run-time test (that checks a value).
Tests that involve parameters are only evaluated if the parameter has been assigned a value.
Constraints encode relationships between parameters. The test must evaluate as true for the datatype definition to be legal.
Example 9. Constraints
<constraint test="dt:le($type.min, $type.max)" />
constraint = element constraint {
extension-attribute*,
test
}
The <condition> element tests whether a particular condition is satisfied by a value. The value is not valid if the test evaluates to false.
condition = element condition {
extension-attribute*,
test
}
Tests are done through a test attribute which holds an XPath expression. If the effective boolean value of the test attribute is true then the test succeeds.
test = attribute test { XPath }
Parameters, properties and variables all declare variables for use in binding expressions (i.e. XPath expressions). Parameter variables are of the form $type.name where name is the name of the parameter; property variables are of the form $this.name where name is the name of the property; ordinary variables just use the name of the variable.
The type attribute (or <datatype> child element) specifies the type of the parameter, property or variable. The provided value is cast to that type as follows:
1. If the type of the provided value is a string, the value is taken as the string value of the required type and parsed/tested accordingly. It is an error if it is not a legal value or if the required type is an abstract type.
2. If the provided value is a subtype of the required type, that value is used as is.
3. If the provided value is of type T and the required type is type R, and there is a map from T to R, then that map is used to convert the value to the required type. If there are several possible maps then it is an error if they result in different values.
4. If the provided value is a supertype of the required type, that value must meet all the additional constraints specified by the required type, otherwise it's an error.
5. In other cases, the string value of the provided value is taken as the string value of the required type and parsed/tested accordingly. It is an error if it is not a legal string value or if the required type is an abstract type.
Variable binding is carried out in order. It is an error if a variable is referenced without being declared (although note that parameters and properties do not have to be declared locally, since they can be inherited from a supertype).
Parameters hold datatype-level values that can be used to parameterise the conditions that might apply to legal values of the datatype.
Parameters are defined using a <param> element. The value of the parameter is accessible in the bindings and tests of following parameters, properties, constraints and conditions via a variable reference of the form $type.name where name is the name of the param.
Example 10. Parameter Declarations
<param name="min" type="integer" value="0" subtype="ge" />
assigns the integer 0 to the variable $type.min.
paramDef = element param {
name, type?, binding?, subtype?,
extension-attribute*
}
The subtype attribute controls the relationship between the value of the parameter for a subtype and the value of the parameter for its supertype.
subtype = attribute subtype { relation }
There are five built-in relationships -- eq, lt, le, gt, ge -- and this set can be extended. The value 'eq' effectively fixes the value of the parameter. The value 'any' means that the subtype's value can be anything.
relation = "any" | "eq" | "lt" | "le" | "gt" | "ge" | extension-relation
Extension relationships are implementation-defined, and represented by a qualified name. For example, an implementation might define a ext:substring relationship that indicates that the subtype's value must be a substring of the supertype's value. If an implementation doesn't recognise the extension relationship then it must treat it as 'any'.
extension-relation = xs:QName - xs:NCName
The <param> element defined here (as opposed to parameter declarations, described above), assigns a value to a parameter and is used when subtyping a datatype. A binding must be specified and the value of the parameter must meet the constraints specified on the parameter declaration.
param = element param {
name, binding,
extension-attribute*
}
The <property> element specifies a property of the datatype. The values of properties are available via the dt:property() extension function (or via other APIs). The value of a property for a value can be referenced using $this.name where name is the value of the name attribute on the <property> element. Properties are inherited by subtypes. If no binding is specified for a property, the datatype is an abstract datatype; the value of that property can be supplied in its subtypes or in mappings to the datatype.
Example 11. Mapping and Supertypes
For example, consider:
<datatype name="RGBcolour">
<property name="red" type="byte" />
<property name="green" type="byte" />
<property name="blue" type="byte" />
<map to="HSLcolour">
<property name="hue" select="..." />
<property name="saturation" select="..." />
<property name="luminence" select="..." />
</map>
</datatype>
<datatype name="RRGGBB">
<super type="RGBcolour" />
<regex name="colour" ignore-whitespace="true">
#(?[red][0-9A-F]{2})
(?[green][0-9A-F]{2})
(?[blue][0-9A-F]{2})
</regex>
<property name="red" select="$colour/red" />
<property name="green" select="$colour/green" />
<property name="blue" select="$colour/blue" />
</datatype>
property = element property {
name, type?, binding?,
extension-attribute*
}
The <variable> element binds a value to a variable. Variables are similar to properties except that they are not inherited by subtypes and their values aren't accessible via APIs. Variables therefore must have a binding specified. The value of a variable is accessed through $name , where name is the name of the variable. Variables are used for intermediate calculations.
variable = element variable {
name, type?, binding,
extension-attribute*
}
There are two ways to specify a type: via a type attribute (with <param> elements further parameterising the type) or via an anonymous <datatype> element.
type |= attribute type { datatype-reference }, param*
type |= anonymous-datatype
There are three built-in ways to bind a value to a parameter, property or variable: through the value attribute, which holds a literal value, through a select attribute, which holds an XPath expression, or through a sequence of <property> elements. Implementations can also define their own extension binding elements.
binding = (literal-value | select | property+), extension-binding-element*
If a value attribute is specified, its value is the string value of the value of the variable; the type of the variable is used to interpret that value.
literal-value = attribute value { text }
If a select attribute is specified, the XPath expression it contains is evaluated to give the value of the parameter, property or variable.
select = attribute select { XPath }
If a sequence of <property> elements is used, they provide a value of an abstract type. The type specified by the type attribute (or the anonymous datatype) must be an abstract type.
Extension binding elements are used where more power is needed to specify the value of a parameter, property or variable. This can be used to provide values using methods such as XSLT or MathML. If an implementation does not support any of the extension binding elements specified, then it must assign to the variable the value specified by the value or select attribute instead. If an implementation supports one or more of the extension binding elements, then it must use that element to calculate the value of the variable.
extension-binding-element = extension-element
There are two ways of specifying a mapping to another datatype: through a basic map and through a supertype relationship. Mappings are uni-directional: if there's a mapping from datatype A to datatype B then every legal value of datatype A must map onto a legal value of datatype B, but the reverse is not necessarily the case. The only difference between a map and a supertype is that a subtype inherits properties, parameters and collations from its supertype but doesn't from datatypes onto which it maps.
The <map> element defines a map from one datatype to another. Maps can be local (in which case they define a map to or from the datatype in which they're specified) or top-level. To be a legal value of a datatype, a value must be castable to a legal value of each datatype to which the datatype maps. The content of the <map> element defines how the mapping is done.
Note: Note that it is possible for there to be maps to and from two datatypes, but it is not necessarily the case that a round-trip will result in the same string value.
Example 12. Changes When Round-Tripping
For example, with the datatype definitions:
<datatype name="UKDate">
<regex name="date" ignore-whitespace="true">
(?[day][0-9]{1,2})/(?[month][0-9]{1,2})/(?[year][0-9]{4})
</regex>
<property name="year" select="$date/year" />
<property name="month" select="$date/month" />
<property name="day" select="$date/day" />
<map to="ISODate"
select="concat(format-number($this.year, '0000'), '-',
format-number($this.month, '00'), '-',
format-number($this.day, '00'))" />
</datatype>
<datatype name="ISODate">
<regex name="date" ignore-whitespace="true">
(?[year][0-9]{4})/(?[month][0-9]{2})/(?[day][0-9]{2})
</regex>
<property name="year" select="$date/year" />
<property name="month" select="$date/month" />
<property name="day" select="$date/day" />
<map to="UKDate"
select="concat($this.day, '/', $this.month, '/', $this.year)" />
</datatype>
the UKDate "5/1/1947" maps to the ISODate "1947-01-05", which maps back to the UKDate "05/01/1947".
Local maps appear within <datatype> elements and define maps to or from the datatype in which the <map> element appears from or to the datatype specified by the from or to attribute.
local-map = element map {
(from | to), mapping,
extension-attribute*
}
Top-level maps appear within the <datatypes> element and define maps from the datatype referenced in the from attribute to the datatype referenced in the to attribute.
top-level-map = element map {
from, to, mapping,
extension-attribute*
}
The to attribute holds a reference to a datatype.
to = attribute to { datatype-reference }
The from attribute holds a reference to a datatype.
from = attribute from { datatype-reference }
Mapping definitions are carried out in two ways depending on whether the target datatype is an abstract datatype or a concrete datatype. If the target datatype is concrete, the map is done through a binding which creates a string which is a valid string value for the target datatype. If the target datatype is abstract, the map provides values for unbound properties defined by the target datatype.
mapping = binding
The <super> element defines a map from the datatype to a supertype. There are several differences between a map and a subtype-supertype relationship:
• a subtype cannot be a supertype of its supertype (no circularity)
• if both types are concrete, all string values that are legal for the subtype must also be legal for the supertype
• a subtype inherits parameters, properties and the collation of its supertype
• it is not legal for an abstract type to have a concrete supertype
A type can have multiple supertypes. It inherits properties and parameters from all of them. A subtype also inherits the collations from its supertypes; either the subtype must define its own collations or all the supertypes must use the same collations.
super = element supertype {
attribute ref { datatype-reference },
extension-attribute*,
param*
}
A datatype can define itself by reference to subtypes, which is equivalent to creating a union type.
Example 13. Creating Union Types
<datatype name="myDate">
<property name="year" type="year" />
<property name="month" type="month" />
<property name="day" type="day" />
<subtype ref="ISODate" />
<subtype>
<parse name="date">
<regex>(?[day][0-9]{1,2}) (?[month][A-Z]{3}) (?[year][0-9]{4})</regex>
</parse>
<property name="year" select="$date/year" />
<property name="month" select="$date/month" />
<property name="day" select="$date/day" />
</subtype>
</datatype>
The <subtype> element references an existing datatype via the ref attribute. The <datatype> element in which the <subtype> element appears is a supertype for the referenced datatype. The subtype can be parameterised.
sub |= element subtype {
attribute ref { datatype-reference },
extension-attribute*,
param*
}
The <subtype> element creates a local, anonymous datatype and defines its supertype as being the datatype defined by the <datatype> element in which the <subtype> element appears.
sub |= element subtype {
extension-attribute*,
datatype-definition-element*
}
Collations define how two values of a particular datatype can be compared. Collations can be specified either through a simple collation or through a combination of simple collations. Each datatype can specify multiple collations, but there can be only one collation without a test attribute. When comparing two values, the first collation whose test attribute evaluates to true for both values is used to compare the values. If there is no such collation, the values are deemed incomparable.
collate |= simple-collation
collate |= complex-collation
A simple collation compares two values by comparing the values given by the binding for the collation. These values are compared according to the type specified either by the type attribute or the uri attribute; any parameters specified are passed into the relevant collation in order to carry out the comparison.
If no binding is specified, it defaults to using the value itself (the same as select="$this").
If no type or URI is specified, it defaults to using the codepoint collation (the same as uri="http://www.jenitennison.com/datatypes/collations/codepoint").
If order="descending" is specified, the comparison is reversed (if value A is less than value B, the result of the collation is that value A is greater than value B).
simple-collation = element collate {
test?,
extension-attribute*,
(type | uri)?,
order,
param*,
collation-binding?
}
order = attribute order { "ascending" | "descending" }
A complex collation performs multiple simple collations in sequence. If value A is equal to value B according to the first simple collation, then the two values are compared using the second simple collation. If they are still equal, the third simple collation is used to compare them and so on. Only if they are equal for all the collations are they equal overall.
complex-collation = element collate {
test?,
extension-attribute*,
param*,
simple-collation, simple-collation+
}
If a single binding is specified, it is evaluated and the resulting value used to compare the two values.
collation-binding |= binding
If a minimum and maximum binding is specified, they are both evaluated. A value A is greater than a value B if A's min is greater than B's max. A value A is less than a value B if A's max is less than B's min. A value A is equal to a value B if A's min is equal to A's max, B's min and B's max (i.e. the minimum and maximum for both values are equal, and equal to each others'). In other cases, the order relationship is indeterminate.
If extension binding elements are used, they are paired, with the first specifying the minimum value and the second the maximum value.
collation-binding |= min-binding, max-binding,
(min-extension-binding-element,
max-extension-binding-element)*
min-binding |= literal-min-value
min-binding |= min-select
max-binding |= literal-max-value
max-binding |= max-select
literal-min-value = attribute value.min { text }
literal-max-value = attribute value.max { text }
min-select = attribute select.min { XPath }
max-select = attribute select.max { XPath }
min-extension-binding-element = extension-binding-element
max-extension-binding-element = extension-binding-element
There are three built-in collation URIs: one for the Unicode codepoint collation, one for numerical ordering, and one for locale-aware string comparisons.
The uri attribute specifies a collation that should be used to compare the values.
uri = attribute uri { collation-uri }
collation-uri |= built-in-collation-uri
collation-uri |= extension-collation-uri
built-in-collation-uri |= codepoint-collation-uri
built-in-collation-uri |= number-collation-uri
built-in-collation-uri |= string-collation-uri
The codepoint collation uses the codepoints of the characters in the values to order the values.
codepoint-collation-uri = xs:anyURI "http://www.jenitennison.com/datatypes/collations/codepoint"
The number collation compares the values numerically. The order is the same as that specified for xs:double in XPath 2.0.
number-collation-uri = xs:anyURI "http://www.jenitennison.com/datatypes/collations/number"
The string collation compares the values as strings. It takes a number of parameters to specify how the strings should be collated, namely:
lang
specifies the language used for comparing the strings; this must be a legal language specifier as in xs:language.
case
specifies the case ordering. The legal values are "upper-first", "lower-first" or "ignore". "ignore" means that tertiary differences are ignored (as in strength="secondary"). Default is language-dependent.
strength
"primary", "secondary", "tertiary" or "identical". Default is "identical". Setting to "primary" or "secondary" ignores case differences.
Example 14. Customising the String Collation
For example, to create a collation for strings that are in English and should be compared case-insensitively, use:
<collate uri="http://www.jenitennison.com/datatypes/collations/string">
<param name="lang" value="en" />
<param name="case" value="ignore" />
</collate>
string-collation-uri = xs:anyURI "http://www.jenitennison.com/datatypes/collations/string"
Implementations can define their own collation URIs, including parameterised URIs in order to support language-sensitive collations. Implementations are encouraged to provide users with ways of specifying collations, perhaps using extension top-level elements.
extension-collation-uri = xs:anyURI - built-in-collation-uri
We have several possible choices about what variant of XPath to accept:
• XPath 1.0
• XPath 2.0
• a restricted version of XPath 2.0
• control version via xpath-version attribute
• implementation-defined
Whichever we use, implementations will still be able to support more via extension binding elements. I think, therefore, that the last two options aren't necessary.
The useful things in XPath 2.0 are its support for if expressions and for sequences of atomic values; there's also a lot that's in excess of what's required. Subsetting just makes it harder to get conformant processors and for users to remember which bits are in and which bits are out. I'm inclined to stick to XPath 1.0 for now.
Variable, property and parameter values are available within an XPath expression if the variable, property or parameter is declared prior to the XPath expression. If the property or parameter is declared within a supertype, the reference to that supertype must come before the XPath expression in which the property or parameter is declared.
The comparisons =, !=, >, >=, < and <= follow XPath 1.0 rules. To carry out comparisons between typed values that are based on the collation for the value's type, use the following extension functions, each of which take two arguments and returns true or false:
• dt:eq(), dt:neq(), dt:lt(), dt:le(), dt:gt(), dt:ge()
An empty node-set is returned if the values aren't comparable (do not share the same collation). In most cases, this will be treated the same way as false.
Within a datatype library, each concrete datatype has a corresponding extension function named after the name of the datatype. This function takes a single argument, which is a string, and returns a typed value based on that string. Note that this works for all datatypes, including lists and unions. A type error is raised if the string does not meet the constraints for that datatype.
Other extension functions are:
dt:comparable(value, value)
returns true if the values are comparable, false otherwise
dt:item(list-value, number)
returns the item in the list-value at the index given by the number (counting starts from 1); returns an empty string if the number is greater than the number of items in the list-value. Values that aren't of a list type are treated like list-type values with a single item.
dt:property(value, prop-name)
returns the value of the named property
An XPath 1.0 expression
XPath = text
An XPath 2.0 regular expression
regular-expression = text
An XPath 2.0 regular expression with named subexpressions. Named subexpressions are specified with the syntax (?[name]regex) where name is name of the subexpression and regex is the subexpression itself.
Example 15. Extended Regular Expression
(?[year]-?[0-9]{4})-(?[month][0-9]{2})-(?[day][0-9]{2})
extended-regular-expression = text
name = attribute name { xs:NCName }
dt-name = attribute dt:name { xs:NCName }
ns = attribute ns { xs:anyURI }
extension-element = element * - dt:* { anything }
extension-attribute = attribute * - local:* { text }
anything = attribute * { text }*,
mixed { element * { anything }* }