Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

SIARD (Software Independent Archiving of Relational Databases) Version 1.0

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name SIARD (Software Independent Archiving of Relational Databases) Version 1.0
Description

An open format developed by the Swiss Federal Archives, designed for archiving relational databases in a vendor-neutral form. A SIARD archive is a ZIP-based package of files based on XML and SQL:1999. A SIARD file incorporates not only the database content, but also machine-processable structural metadata that records the structure of database tables and their relationships. The ZIP file contains an XML file describing the database structure (metadata.xml) as well as a collection of XML files, one per table, capturing the table content. The SIARD archive may also contain text files and binary files representing database large objects (BLOBs and CLOBs). SIARD permits direct access to individual tables by exploring with ZIP tools. A SIARD archive is not an operational database but supports re-integration of the archived database into another relational database management system (RDBMS) that supports SQL:1999. In addition, SIARD supports the addition of descriptive and contextual metadata that is not recorded in the database itself and the embedding of documentation files in the archive.

A relational database archived in the SIARD format consists of two components: the metadata, in a folder tree with root header which documents the structure of the archived database, and the table data in a folder tree with root content. The structure of a typical content folder is:

  • content/
  • -- schema0/
  • ---- table0/
  • ------ table0.xml
  • ------ table0.xsd
  • ---- table1/
  • ...
  • ---- table2/
  • ...

Notice that, for each table, the SIARD format requires a .xsd file defining the number of columns and the datatype for each column. This XML Schema document is typically derived automatically as a SIARD archive is created from the database. The table data is in the corresponding .xml file. The SIARD metadata is stored in a file called metadata.xml in the header folder along with the corresponding metadata.xsd file. The structure of the metadata.xml file must match the structure of the content file, e.g., have a corresponding number of tables and number of columns in each table.

SIARD 1.0 permits use of the 64 bit extension of the ZIP format introduced by PKWARE Inc. with version 4.5 of the ZIP specification in order to avoid the 4 Gbyte size limitation. ZIP64 is not as widely supported as ZIP32. Although compression as supported by ZIP was originally not permitted, use of the "deflate" algorithm (the default compression option for ZIP) is supported in the SIARD software and was announced as a planned change in a February 2015 presentation on SIARD from the Swiss Federal Archives.

Production phase Designed specifically as a non-operational software independent archival format for relational databases.
Relationship to other formats
    Subtype of ZIP_6_3_3, ZIP File Format, Version 6.3.3 (PKWARE). The SIARD 1.0 specification refers to version 6.3.2 of the ZIP specification (APPNOTE.TXT), but 6.3.2 and 6.3.3 are technically identical, with the update consisting only of "formatting changes to support easier referencing of this APPNOTE from other documents and standards."
    Defined via XML_Schema_1_0, W3C XML Schema 1.0

Local use Explanation of format description terms

LC experience or existing holdings  
LC preference  

Sustainability factors Explanation of format description terms

Disclosure

Non-proprietary, openly documented standard developed by the Swiss Federal Archives, starting in the early 2000s. Adopted and republished as a Swiss e-Government standard (CH-0165) in 2013.

    Documentation

The SIARD 1.0 specification is available from site for Swiss E-Government standards (eCH). The original is in German. Menu with versions of the specification in German, French, and English is at https://www.ech.ch/vechweb/page?p=dossier&documentNumber=eCH-0165&documentVersion=1.0. The menu includes associated documents, such as the schema for the SIARD metadata.xml file and requests for changes.

Archiving Tools: SIARD Suite, from the Swiss Federal Archives, also provides access to the format specification in English.

The published SIARD 1.0 specification states that SIARD stands for "Software Independent Archival of Relational Databases" but the Swiss Federal Archives website and recent presentations use "Archiving" rather than "Archival." The compilers of this resource have chosen to use "Archiving" in the title for this format description.

Adoption

SIARD_1_0 was adopted by the PLANETS project as the recommended format for the preservation of relational databases in 2008. in 2013, the format was adopted as a Swiss E-Government standard. According to a draft list of formats adopted by various European archives as of early 2012, SIARD (as-is or adapted) was used in national archives in Denmark, the Netherlands, and Germany, in addition to Switzerland. Relational databases are a very important content category for national archives, but less common in the collections of libraries.

A reference suite of software is available for downloading with a free license from the Swiss Federal Archives. The source for this software is not open, but a project under E-ARK (European Archival Records and Knowledge Preservation) is aiming at an enhanced format and a fully open-source suite of software. The updated format was originally named SIARD-E, but a public draft was issued in July 2015 as SIARD 2.0. According to E-ARK Project Update of March 2015, the enhanced format will be based on the best practices from the existing SIARD, SIARDDK and DBML formats. DBML was developed for archiving databases at the National Archives of Portugal (Direcção Geral de Arquivos). SIARDDK (often written SIARDK) is a variant of SIARD 1.0 developed by the Danish National Archives, intended as a full SIP (Submission Information Package) that packages related documentation with the SIARD database object. See Notes below for more information on DBML and SIARDDK.

    Licensing and patents No license is required to use the format or implement software to create, render, or manipulate files in the SIARD format.
Transparency

SIARD was designed to be transparent, representing relational database content as a collection of easily interpreted XML-based files in straightforward schemas with element names that reflect the database structure. These files are human-readable and easily processed with widely available XML tools. Queries and other SQL-based commands that are stored in the SIARD file are represented in UTF-8 (Unicode). The files are packaged in a constrained form of the ZIP format that is supported by applications that can extract files from a ZIP32 or ZIP64 archive file. However, many ZIP tools, including those distributed with Mac OS and Windows, do not support ZIP64.

Since relational database content may include Binary Large Database Objects (BLOBs), SIARD archives may include files that are not transparent.

Self-documentation

The SIARD format includes a schema for representing the structural metadata for the original relational database. If semantic names for tables, columns, etc. were present in the original database, they can be captured automatically. However, some elements are likely to need manual entry. The schema also includes elements that can document the archiving processing. See Notes below for more details on the metadata schema, which can be downloaded from the eCH-0165 specification page.

External dependencies None
Technical protection considerations None. Encryption and password protection are not permitted in a SIARD archive file.

Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality

SIARD_1_0 is designed as a non-operational snapshot of a relational database consistent with the SQL:1999 standard for the Structured Query Language. Subclause 4.1 of Part 2 of the SQL:1999 specification specifies the data types defined in SQL:1999. These include a variety of types for character data, numeric data, and date/time values. Requirement P_4.3-3 of the SIARD specification indicates the XML datatypes that should be used for each data type supported in SQL:1999. This section indicates that three time-related types in SQL:1999 are not supported in SIARD 1.0: TIME WITH TIME ZONE; TIMESTAMP WITH TIME ZONE; and INTERVAL. SIARD 1.0 does define how to handle Binary Large Objects (BLOBs) and Character Large Objects (CLOBs).

The SIARD 1.0 schema for metadata.xml does not enumerate the strings allowed for naming data types in the metadata for columns but merely constrains the form of permissible strings to letters and numbers. RFC 2015-13 is a request to specify data type mappings in more detail in the SIARD specification and metadata schema.

Support for software interfaces (APIs, etc.)

A SIARD 1.0 archive file can be unpacked using tools that can handle ZIP64 files. Apart from BLOBs, individual component files are XML-based or text in UTF-8 and can be read using widely available viewers for text and/or XML.

A SIARD archive file is designed to permit import into a RDBMS that supports SQL:1999 using an API based on ODBC.

Data documentation (quality, provenance, etc.) The SIARD 1.0 metadata schema does include metadata elements designed to identify a source RDBMS and document important aspects of the conversion into SIARD. See Notes below for more on the provenance-related metadata elements in metadata.xsd. The SIARD specification states that it is intended to be an object that will be embedded in a package that can incorporate more contextual documentation.
Beyond normal functionality

The SIARD_1_0 format is used by the SIARD 1.0 suite of software, which can both extract the database structure and content into a SIARD_1_0 archive file and regenerate a functioning database from a SIARD_1_0 file. The relational database management systems supported by the SIARD software suite (according to the SIARD Suite website in April 2017) are:

  • Oracle
  • Microsoft SQL Server
  • MySQL
  • IBM DB2
  • Microsoft Access

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension siard
From requirement G_4.1-4 in SIARD 1.0 specification.
Pronom PUID fmt/161
See https://www.nationalarchives.gov.uk/PRONOM/fmt/161.
Wikidata Title ID Q27861463
See https://www.wikidata.org/wiki/Q27861463.

Notes Explanation of format description terms

General

Metadata: The pre-defined metadata schema for the mandatory metadata.xml file has the following elements that apply to the SIARD archive as a whole, including provenance details for the archiving process:

  • dbname -- name of the archived database [mandatory]
  • description -- short free form description of the database content
  • archiver -- name of person responsible for archiving the database
  • archiverContact -- contact data (telephone number or email address) of archiver
  • dataOwner -- name of data owner (section and institution responsible for data) of database when it was archived [mandatory]
  • dataOriginTimespan -- time span during which data where entered into the database [mandatory]
  • producerApplication -- name and version of program that generated the metadata file
  • archivalDate -- date of creation of archive [mandatory]
  • messageDigest -- message digest code over all primary data in folder "content" [mandatory]
  • clientMachine -- DNS name of client machine from which SIARD was running for archiving
  • databaseProduct -- name of database product and version from which database originates
  • connection -- connection string used for archiving
  • databaseUser -- database user used for archiving
  • schemas -- list of schemas in database [mandatory]
  • users -- list of users in the archived database [mandatory]
  • roles -- list of roles in the archived database
  • privileges -- list of privileges in the archived database

Key structural metadata is also held in the metadata.xml file. For each table, this includes the name of the table in the source database; its location in the archive folder hierarchy (see layout in the main format description above); an optional free text description; a list of columns each with name and data type; a count of rows; identification of the column serving as primary key; and other optional technical details. The metadata.xml file also holds structural details for other database features supported in SQL:1999, such as views, routines, constraints, triggers, and additional keys.

For each table the SIARD archive requires a schema (e.g., table0.xsd) derived from the structure of a table in the source database (and documented in the metadata.xml file in the SIARD archive) and mappings from source data types to W3C XML Schema (XSD) data types. The SIARD 1.0 specification uses XML 1.1 as a normative reference, but is not specific about which version of XSD should be employed, possibly because XSD 1.1 was not approved by W3C until 2012. The XSD data types used in the mappings in P_4.3-3 of the SIARD 1.0 specification are all defined in XSD 1.0 and XSD 1.1. The SIARD archive requires a instance (e.g., table0.xml) that holds the data value content for table0 conforming to the corresponding schema (in this case, table0.xsd).

Proposed changes to SIARD: Proposals for changes to the SIARD specification are documented at eCH-0165: SIARD-Formatspezifikation as RFCs. RFC 2015-30 indicates that there is a plan to submit SIARD for standardization by ISO. In a February 2015 presentation, Save your databases using SIARD!, the Swiss Federal Archives noted particularly that planned changes included: data compression using 'deflate'; splitting of the ZIP archive for very large files; and extension to full SQL:1999 support, including user-defined data types.

History

Related formats and future joint developments: The E-ARK project plans to develop a specification of an E-ARK archival relational database format based on the best practices from the existing SIARD, SIARDDK and DBML formats. Early announcements used the name "SIARD-E", but as work continued, the name "SIARD 2.0" emerged as preferred. Feedback on a draft of the SIARD 2.0 specification was requested by September 30, 2015.

The DBML format was developed as part of the RODA project at the National Archives of Portugal and archiving in DBML was part of the RODA Database Preservation Toolkit. The RODA repository is now an open-source product of the company KEEPSolutions, a spin-off of the University of Minho formed by the developers. The Database Preservation Toolkit is now a separate project. Support for SIARD 1.0 was added rather than producing a new version of DBML with fuller capture of database structure and contents. KEEPSolutions is an active partner in the E-ARK project.

SIARDDK is a variant of SIARD adopted by the Danish National Archives. The SIARD 1.0 specification assumes that the SIARD archive is a digital object that will be packaged inside an OAIS SIP (Submission Information Package) or AIP (Archival Information Package) with related documentation. The SIARDDK variant is designed as a SIP, using the table structures as in SIARD, but also defining a complete structure for incorporating contextual documentation and package metadata. In addition, SIARDDK uses the usual 32-bit ZIP format rather than ZIP64 and allows only one SQL schema per package. The relationship between SIARD 1.0 and SIARDDK is documented at https://github.com/eark-project/siard-e-format/tree/master/SIARDDK.

RFC 2014-105 requests that the SIARD specification be modified to permit use of "deflate" compression rather than requiring SIARD files to be uncompressed in the ZIP wrapper. The RFC indicates that SIARD software already supports this compression technique. In early June 2015, the eCH website indicated that this was approved as an amendment to SIARD in March 2015; however, in mid-July 2015, this amendment could not be retrieved.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: Monday, 17-Apr-2017 13:08:50 EDT