Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

Standard Generalized Markup Language (SGML). ISO 8879:1986

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name ISO 8879:1986. Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML)
Description

SGML (Standard Generalized Markup Language) is an openly documented and freely implementable international standard for semantic markup of textual documents in a manner that permits the separation of the underlying content from the formatting instructions for display or printing. Published as ISO 8879:1986, the standard is maintained and periodically reviewed under the auspices of ISO/IEC JTC 1/SC 34. To quote from the standard's introduction, "SGML can be used for publishing in its broadest definition, ranging from single medium conventional publishing to multi-media data base publishing. SGML can also be used in office document processing when the benefits of human readability and interchange with publishing systems are required." SGML has been widely used, although in many contexts, it has been superseded by XML and HTML, markup languages that were derived from SGML. See Notes below for more detail on the influence of SGML on the HTML and XML formats .

SGML was designed to enable the sharing of machine-readable documents across different technical environments and to support a long readable life, particularly as required for documents produced in government, law, and industry. An SGML file is encoded as plain text. As a "generalized" markup language, it incorporated the concept of a "document type definition" (DTD). For a particular document application, a company or government agency would develop an SGML DTD, declaring names and constraints for document elements and their attributes. Given a source document marked up in conformance with that DTD, SGML processing software could be used to produce human-readable output for printing or display by associating appropriate formatting instructions, perhaps using the DSSSL style sheet language. For example, a US Department of Defense (DOD) specification for technical manuals, MIL-M-38784C, issued in 1990, incorporated an SGML DTD. Its successor standard, MIL-STD-38784: Standard Practice for Technical Manuals, published in 1995, is still in force, still based on an SGML DTD after a 2018 update.

The introduction to the SGML specification highlights key components of the language, including an "abstract syntax" for descriptive markup of document elements and a "reference concrete syntax" that specifies the use of particular delimiter characters, etc. Although the specification suggests that other concrete syntaxes might be used, the compilers of this resource are not aware of any any such variations. Comments welcome. Also defined were special delimiters for processing instructions, a general "entity reference" mechanism for referring to content outside the mainstream content of a document, and generalized support for non-text. To encourage acceptance, the authors of the SGML specification followed other design objectives: the ability to enter text and markup on "the millions of existing text entry devices"; no character set dependency; no national language bias; and markup usable by both humans and programs.

In their 1995 Introduction to SGML, Eve Maler and Jeanne El Andaloussi identified some design strengths they considered unique based on their experience with publishing and type-setting. The declarative (rather than procedural) nature of SGML markup let document producers put the same document data to multiple uses, such as delivering documents in a variety of online and paper formats. Its deliberately generic, non-proprietary design helped make documents independent of software platform and vendor and also protected documents against changes in computer hardware and software.

SGML was adopted in many publishing contexts. The Cover Pages resource, assembled by Robin Cover and now hosted as an archive by OASIS, was a widely consulted resource about SGML in the 1990s. The resource lists and describes dozens of SGML applications in various domains: General; Government, Military, and Heavy Industry; and Academic. When Robin Cover ceased adding to or updating this area of the Cover Pages, he wrote (as of July 2002), "relatively few enterprise-level projects are started as SGML applications, but many SGML applications implemented before 1999 are still running productively." See Adoption section in Sustainability Factors below for a few particular examples.

Relationship to other formats
    Has subtype XML, (Extensible Markup Language). Appendix C of the XML specification states, "XML is designed to be a subset of SGML, in that every XML document should also be a conforming SGML document."

Local use Explanation of format description terms

LC experience or existing holdings

In the 1990s, the American Memory project at the Library of Congress developed an SGML DTD that was used for manual transcriptions of a few hundred out-of-copyright books and some materials from collections of personal or organizational papers. See American Memory DTD for Historical Documents and Text Conversion and SGML-encoding from a 1996 request for proposals. For public access, the SGML files were originally converted to HTML. More recently, the SGML has been converted to XML and PDF to support updated public access for most collections created during the American Memory project. Recent conversion of textual materials at the Library of Congress has been based on OCR from scanned images and not employed SGML.

Another example of experience with SGML at the Library of Congress in the 1990s was participation in development, testing, and use of the first generation of the Encoded Archival Description (EAD). See Development of the Encoded Archival Description DTD and Introduction to the EAD Tag Library for Version 1.0. Library of Congress staff produced a number of finding aids in the SGML version of EAD before adopting the XML DTD for EAD v2002, and migrating the existing finding aids in 2005.

LC preference The Library of Congress Recommended Formats Statement (RFS) for textual documents includes SGML as an acceptable digital format, when a DTD accompanies the document or is accessible.

Sustainability factors Explanation of format description terms

Disclosure An international standard, currently maintained and periodically reviewed under the auspices of ISO/IEC JTC 1/SC 34. The standard was originally prepared under the auspices of Technical Committee ISO/TC 97, Information processing systems.
    Documentation ISO 8879:1986. Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML)
Adoption

During the 1990s. SGML was adopted for marking up the structure of documents in a wide variety of contexts, for example, for transcription of old printed or manuscript materials, in new workflows for scholarly publishing, in software manuals, for information exchange in industries such as telecommunications and railroads, and in requirements associated with contract work for the U.S. Department of Defense. See Cover Pages for lists of many such applications. Some particular applications are:

  • The US Department of Defense (DOD) had a long-term project to reduce the cost of supporting and constructing equipment used by the military. Using the acronym CALS, standing originally for Computer-aided Acquisition and Logistic Support and then for Continuous Acquisition and Life-cycle Support, the DOD developed a family of standards for digital information of various types. According to A Brief History of the Development of SGML, the SGML portion of CALS was initiated in February 1987 and the first SGML-based standard from the DOD, MIL-M-28001, "Markup Requirements and Generic Style Specification for Exchange of Text and its Presentation" was published in February 1988. This standard, last modified in 1993, is still active. The CALS standards have been widely adopted across the world and are still in use.
  • In the early 1980s, the Association of American Publishers sponsored the Electronic Manuscript Project to develop an initial SGML application for book, journal, and article creation, intended for manuscript interchange between authors and their publishers. The specification was approved as ANSI/NISO standard Z39.59 in December 1988. It was submitted for standardization through ISO and became ISO 12083: 1994. This was the beginning of widespread adoption of SGML in scholarly publishing workflows. XML has largely superseded SGML in this context, with the XML-based JATS standard being widely adopted for scholarly publishing.
  • The Text Encoding Initiative (TEI) was established in 1987 as a "cooperative undertaking of the textual research community to formulate and disseminate guidelines for the encoding and interchange of machine-readable texts intended for literary, linguistic, historical, or other textual research." See the Poughkeepsie Principles that guided the efforts. Adapting from Learn the TEI: Introducing the Guidelines by adding dates, "The original TEI language (P1 through P3 [1990-1999]) used SGML syntax. With P4 [2002], users were given a choice of using SGML or XML; with P5 [2007], SGML is no longer an option." A Gentle Introduction to. SGML from the TEI Guidelines was a widely cited resource See also TEI: History.
  • The DocBook SGML DTD was originally designed and implemented around 1991. It was developed primarily for exchange of UNIX documentation. To quote from DocBook: The Definitive Guide (Version 1.0.3. 1999), "DocBook provides a system for writing structured documents using SGML or XML. It is particularly well-suited to books and papers about computer hardware and software, though it is by no means limited to them. DocBook is an SGML document type definition (DTD)." DocBook was adopted by the open source community where it has become a standard for creating documentation for many projects. As for the TEI DTD, the DocBook DTD was converted to a DTD consistent with both SGML and XML (versions 4.x, 2000-2016). The normative definition for DocBook 5.0, published in November 2016, is an XML-based schema expressed as RNG. There is no SGML DTD for DocBook 5.0. See the Wikipedia entry for DocBook.

During the 1990s there were active open-source software projects to parse SGML and to use style sheets to display and print SGML documents. They included SP: An SGML System Conforming to International Standard ISO 8879 by James Clark; SGML-tools; and SGML-tools Lite. An Overview Of Required Tools for installing and configuring an SGML DocBook authoring system for Windows lists some others. However none of these projects is still under active maintenance. See Why SGML DocBook is dead.

A description in the IBM Knowledge Center on History and Relationships of SGML, HTML and XML states that XML was invented because there were "barriers to delivering SGML over the web. These include the lack of stylesheet support, no mainstream browser support, software complexity, and obstacles to interchange of SGML data because of varying levels of SGML compliance among SGML Software Packages. Due to the lack of SGML support in mainstream Web browsers, most applications delivering SGML information over the Web convert the SGML to HTML." XML became a W3C recommendation in February 1998. Since then, many applications of SGML have migrated to XML and the number of commercial software tools supporting SGML has dropped.

A limited set of commercial SGML authoring tools exist in 2018. They include:

    Licensing and patents No concerns.
Transparency SGML files are both human-readable and machine-processable. For the contents to be understood, a well-documented DTD is needed. Human-comprehensible element tags are advantageous for transparency.
Self-documentation An SGML DTD can, and often does, define elements to be used to describe the content of a conforming document and its context.
External dependencies None beyond SGML processing software.
Technical protection considerations The SGML specification mentions no internal capabilities for encryption. However, files containing sensitive material will likely be encrypted if transferred over public networks.

Quality and functionality factors Explanation of format description terms

Text
Normal rendering

Mechanisms are provided in the SGML specification to allow any characters in Unicode 2.0 (ISO/IEC 10646) to be included in the character set for an SGML document. A document typically begins with one or more statements that comprise a definition for the character set used in the document. The first statement is typically a reference to a "base character set" using a recognized "public identifier." Additional characters can be defined using their Unicode (ISO/IEC 10646) code points. See Handling of the SGML declaration in SP for a listing of character sets recognized by the SP software toolkit.

A primary design objective for SGML was to allow explicit representation of the logical structure of text, such as paragraphs, headings, sections, indexes, etc. Effective support for normal rendering is dependent on an appropriate DTD and associated style sheet (or other formatting instructions). Another important objective was to support indexing of the textual content.

Integrity of document structure The primary intent of SGML markup is to represent document structure.
Integrity of layout and display Best practice is to have the SGML represent the logical document structure and use stylesheets to render the text in a form appropriate for printing or for display to an end user. A style language for use with SGML is ISO/IEC 10179:1996 Information technology -- Processing languages -- Document Style Semantics and Specification Language (DSSSL).
Support for mathematics, formulae, etc. Depends on particular DTD specification. For example, The SGML DTD in ISO 12083:1994 Information and documentation -- Electronic manuscript preparation and markup has a module that supports mathematics display.
Functionality beyond normal rendering Depends on particular DTD specification.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension sgm
sgml
See PRONOM record for SGML.
Internet Media Type text/sgml
application/sgml
See IANA registration at RFC 1874. The text/sgml media-type is for use only when the contents of the SGML entity can be understood without SGML display software.  In other cases, application/sgml should be used.
Pronom PUID x-fmt/195
See http://www.nationalarchives.gov.uk/PRONOM/x-fmt/195.
Wikidata Title ID Q207819
See https://www.wikidata.org/wiki/Q207819.

Notes Explanation of format description terms

General

The significance of SGML is not only in its direct use in publishing workflows for a couple of decades, but also in its influence on markup formats that are much more widely used: XML and HTML. By 1995, it was determined that web browsers were unlikely to support full support for SGML and W3C began work on a simpler version. The term SGML-Lite had been used informally for a while, but XML, for Extensible Markup Language, was the name chosen by the W3C working group. Design goals guiding the group included: XML shall be straightforwardly usable over the Internet; and XML shall be compatible with SGML. XML, which has been described as a subset of SGML, was first published as a W3C Recommendation in February 1998. XML adopted the concept and the syntax for DTDs from SGML. HTML started life, not as a subset of SGML, but as an application of SGML, defined by an SGML DTD. This remained true for HTML versions through 4.01, for which associated DTDs were published. In practice, however, web browsers and website developers have not used the DTD for HTML, but have developed and used HTML-specific tools for rendering, creating, and validating HTML. As stated in the invitation to a 30th anniversary party for SGML, "Our current information ecosystem is largely (perhaps almost totally) unaware of SGML, yet it is totally dependent on SGML's progeny: HTML and XML. SGML may not have made much of a splash but the ripples it created are still spreading."

History

SGML was adapted from IBM's Generalized Markup Language (GML), which Charles Goldfarb, Edward Mosher, and Raymond Lorie developed in the 1960s. The term "GML" was based on the initials of the three surnames. See The Roots of SGML, a memoir by Charles F. Goldfarb that goes back to the 1960s and A Conversation with Charles F. Goldfarb.

The first working draft of the SGML standard was published in 1980. By 1983, the Graphic Communications Association (GCA) recommended the latest (sixth)working draft as an industry standard (GCA 101-1983). Major adopters included the US Internal Revenue Service (IRS) and the US Department of Defense. In 1984, the working group was reconstituted as an ISO/IEC working group and produced a draft international standard in October 1985, which was approved and published in October 1986 as ISO 8879:1986. See A Brief History of the Development of SGML for more detail on the history of SGML through 1990.


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 12/13/2018