Analysis
The analysis below refers to the requirements listed in Problem, Requirements, and Basic Approach. So the requirements are repeated here.
Requirements Repeated
- The "basic" option in ISO 8601, i.e. YYYYMMDD without hyphens,
and HHMMSS without colons.
- BC dates.
- Time zones.
- Year and month only (no day of month), or year only.
- Questionable dates. E.g. 1992? would mean "possibly" the year
1992, but not "definitely".
- Approximate dates. E.g. 1992~ would mean "approximately" the
year 1992.
- Uncertain dates. E.g. 199? would mean some year in the 1990s, but not
certain which year; 1999-?? would mean some month in 1999, 199901?? would
mean some day in the month of 1999-01.
- Date range (start and end).
- End date “open” in a date range.
- Start and/or end date "unknown" in a date range.
Analysis of Requirements
Requirement (1) is to allow raw data to be extracted and put into xml without
conversion. It is not intended to subject all data to this requirement, only
data that is typically encountered in database records that
needs to be converted to XML (data that conforms to ISO 8601 but
hyphens/colons are not included). For
data where conversion is not an issue, and which is handled by xs:date or
xs:dateTime, these two built-in types are preferred, because built-in validation
is provided.
For example, BC dates (2) are not often found in bibliographic metadata,
they are mostly original data, so conversion is not a major issue,
and so representing them with hyphens is acceptable. The same holds for time
zones (3). BC dates and time zones are handled by xs:date and xs:dateTime. ("-2004-01-01" is a valid date. 2004-01-01T10:10:10Z, indicating UTC time zone, is a valid time.) So neither need be treated by any special schema logic, they can be entered
as xs:date and xs:dateTime.
Year-and-month-only or year-only
(4), although supported by ISO 8601, are not supported
by xs:date. In fact their use in conjuction
with the additional requirements - questionable (5), approximate (6), or
uncertain (7) dates - is not even supported by ISO 8601.
Similarly,
ISO 8601 supports ranges (8) though xs:date does not, and similarly, ISO
does not support it in conjuction with the additional requirements: "OPEN" (9)
and
"UNKNOWN" (10).
None of requirements (4) through (7) apply
to dateTime, they apply only to date. So a pattern (a regular expression) developed for
date requirements will have very different features than one developed for
date/time. And the special requirements for range impose yet additional features. So (at least) three patterns, one (or more) for date, one (or more) for
date/time, and one (or more) for range, are proposed.
The right number of patterns is of course a subjective design decision:
one large complex pattern vs. several smaller, simpler
patterns. Too few patterns increase the complexity of each individual
pattern. More patterns increases overall complexity. In any case, simplicity
comes at the expense of decreased validation power.
Two patterns for date, one for date/time, and
two for range are shown below. These patterns together with xs:date and xs:dateTime are combined as a union (via xs:union); any string validates if it conforms
to one of these types.
Patterns
patterns for Date
- <xs:pattern value="\d{2}(\d{2}|\?\?|\d(\d|\?))(-(\d{2}|\?\?))?~?\??"/>
year (yyyy) or year-month (yyyy-mm) where
the last or last two digits of year may be '?' (199? means some year
from 1990 to 1999; 19?? means some year
from 1900 to 1999), or month may be '??' ( 2004-?? "means some
month in 2004"), and the entire string may end with '?' or '~' for "uncertain" or "approximate".
- <xs:pattern value="\d{6}(\d{2}|\?\?)~?\??"/>
yearMonthDay - yyyymmdd, where 'day' may be
'??' so '200412??' means "some day during the month of 12/2004".
The whole string may be followed by '?' or '~' .
hyphens are not allowed for this pattern. Year-month-day
with hyphens will validate via xs:date. (It seems unnecessary to support
year-month-date with hyphens along with the additional requirements;
for year-month-date with the additional requirements the non-hyphen form
should suffice.)
The key issue with dates is hyphens.
ISO 8601 requires a hyphen for year-month (with no day). Hyphens are optional
for year/month/day. The fact that 8601 requires hyphens for
year-month doesn't present a big problem, that is, with regard to requirement
(1) because most if not all of the date-data of concern (data for conversion)
is of the form year/month/date. To be as compatible with 8601
as possible, all dates of the form year-month will
include the hyphen. For dates where the day is included, the form
with no hyphens is accommodated; "with hyphens"
is supported via xs:date so no special schema logic is needed.
pattern for Time
- <xs:pattern value="\d{8}T\d{6}"/>
'yyyymmddThhmmss' (with T separator).
Hyphens in date and colons in time are not allowed for this pattern.
patterns for Range
- <xs:pattern value="((\d{4}(-\d{2})?)|unknown)/((\d{4}(-\d{2})?)|unknown|open)"/>
For years - 'yyyy/yyyy'; for year/month
- yyyy-mm/yyyy-mm. Beginning or end of range value may be 'UNKNOWN'. End
of range value may be 'OPEN'. hyphens mandatory when month is present.
- <xs:pattern value="(\d{4}((-)?\d{2}((-)?\d{2}(T\d{2}(:)?\d{2}((:)? \d{2}(\.\d*)
?)?((Z|(\+|-)\d{2}(:)?\d{2})?))?)?)?|unknown)
/(\d{4}((-)?\d{2}((-)?\d{2}(T\d{2}(:)?\d{2}((:)?\d{2}(\.\d*)
?)?((Z|(\+|-
)\d{2}(:)?\d{2})?))?)?)?|unknown|open)"/>
extends support for a range, by supporting a datetime range. For example:
"20050705T0715-0500/20050705T0720-0500". Hyphens in date and/or colon in time may be included or excluded. Time zone optional. Month only or month-day only supported.
Limitation
None of these patterns provides more than rudimentary validation. They
enforce for example that the date has eight digits, the time has six
digits, certain masking characters, and the words “open or “unknown” in
a range. But a pattern cannot (without excessive complexity)
validate that the month is between 1 and 12 and the day between 1 and 31, much less
that the day is consistent with the month (e.g. that if the month is
04 then the day must be 30 or less), and so on.
These are things that xs:date
does very well. So xs:date and xs:dateTime should be used whenever
none of the special features is needed that these do not support.
Schema
The sample schema below incorporated the above patterns. For usage, see Tools and Usage.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">
<!--
Extended Date/Time Format: edtf
This schema is an "include" file. It does not define a root, it defines a single simple type, edtfSimpleType. A schema may "include" this schema and then reference it for example as follows:
<xs:element name="dateOfBirth" type="edtfSimpleType"/>
************************* edtfSimpleType
edtfSimpleType is the union of three simple types - xsDate, xs:dateTime,
and edtfRegularExpressions. ("union" means that any string
conforming to any one of the types in the union will validate.) xs:date
and xs:dateTime are built-in W3Cschema types. edtfRegularExpressions
is a set of four regular expressions which are described below. So any
string that conforms to one of the two built-in types or any of the four
regular expressions will validate.
-->
<xs:simpleType name="edtfSimpleType">
<xs:union memberTypes="xs:date xs:dateTime edtfRegularExpressions"/>
</xs:simpleType>
<!--
******** edft
-->
<xs:simpleType name="edtfRegularExpressions">
<xs:restriction base="xs:string">
<!--
The following pattern is for year (yyyy) or year-month (yyyy-mm)
The last or last two digits of year may be '?' meaning "one year
in that range but not sure which year", for example 19?? means some
year from 1990 to 1999. Similarly month may be '??' so that 2004-?? "means
some month in 2004". And the entire string may end with '?' or '~'
for "uncertain" or "approximate".
Hyphen must separate year and month.
-->
<xs:pattern value="\d{2}(\d{2}|\?\?|\d(\d|\?))(-(\d{2}|\?\?))?~?\??"/>
<!--
The following pattern is for yearMonthDay - yyyymmdd, where 'dd' may
be '??' so '200412??' means "some day during the month of 12/2004".
The whole string may be followed by '?' or '~' to mean "questionable" or "approximate".
hyphens are not allowed for this pattern.
-->
<xs:pattern value="\d{6}(\d{2}|\?\?)~?\??"/>
<!--
The following pattern is for date and time with T separator:'yyyymmddThhmmss'.
hyphens in date and colons in time not allowed for this pattern.
-->
<xs:pattern value="\d{8}T\d{6}"/>
<!--
The following pattern is for a date range. in years: 'yyyy/yyyy'; or
year/month: yyyy-mm/yyyy-mm. Beginning or end of range value may be 'UNKNOWN'.
End of range value may be 'OPEN'.
hyphens mandatory when month is present.
-->
<xs:pattern value="((\d{4}(-\d{2})?)|unknown)/((\d{4}(-\d{2})?)|unknown|open)"/>
<!-- The following pattern extends support for a range, by supporting a datetime range. For example:
"20050705T0715-0500/20050705T0720-0500". Hyphens in date and/or colon in time may be included or excluded. Time zone optional. Month only or month-day only supported.
-->
<xs:pattern value="(\d{4}((-)?\d{2}((-)?\d{2}(T\d{2}(:)?\d{2}((:)?\d{2}(\.\d*)?)?((Z|(\+|-
)\d{2}(:)?\d{2})?))?)?)?|unknown)/(\d{4}((-)?\d{2}((-
)?\d{2}(T\d{2}(:)?\d{2}((:)?\d{2}(\.\d*)?)?((Z|(\+|-)\d{2}(:)?\d{2})?))?)?)?|unknown|open)"/>
<!-- -->
</xs:restriction>
</xs:simpleType>
|