Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

HyperText Markup Language (HTML) Format Family

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name HyperText Markup Language (HTML) Format Family
Description

HyperText Markup Language (HTML) is the primary markup language used for creating pages and applications on the World Wide Web. HTML originated around 1990 as a language intended for the distribution of relatively simple structured documents, suitable for use by authors who were scientists or academics rather than experts in printing or other aspects of document formatting. As the Web became more and more popular, HTML was enhanced to satisfy demand for access to multimedia, greater control of layout and fonts, and support for interactive applications. As of 2018, a trio of widely adopted complementary technologies are used for displaying content and supporting interaction on the Web: HTML for structural markup for underlying content, Cascading Style Sheets (CSS) for applying formatting to that content, and JavaScript for supporting interaction and as the basis for content management frameworks that assemble pages dynamically from chunks of content. This description focuses on characteristics that are common to all versions of HTML with an emphasis on the the file format as used for distributing documents or document-like content. It also covers the overall history of the format's development. Descriptions for a number of subtypes provide more detail for the most widely used versions of HTML.

HTML files are textual files, viewable and editable in plain text editors. Originally, HTML files used 7- or 8-bit character sets (typically US ASCII, ISO/IEC 8859-1, or Windows-1252); now UTF-8 is recommended. Over the years, HTML has been developed and maintained as a freely implementable, openly documented format under the auspices of several different organizations.

The format was first developed by Tim Berners-Lee, while he was working at CERN. The first formal specification was published as RFC 1866: Hypertext Markup Language 2.0 under the auspices of the Internet Engineering Task Force (IETF) in November 1995. The World Wide Web Consortium (W3C), which had been founded in October 1994 with Berners-Lee as Director, took over development responsibility for HTML and published the HTML 3.2 Reference Specification as a W3C Recommendation in January 1997. W3C published HTML 4.0 (December 1997) and HTML 4.01 (December 1999) as W3C Recommendations and then turned its focus to XHTML (Extensible Hypertext Markup Language). The W3C Recommendation XHTML 1.0, published in January 2000, was a reformulation of HTML 4.01 in a form that would be valid XML; an important objective was to facilitate the use of widely available XML tools for parsing, editing, and validating source markup for web pages.

In 2004, a group of browser developers (from Apple, the Mozilla Foundation, and Opera Software) felt that the W3C's focus on XHTML was failing to address the needs of real-world website developers, particularly because the Web was now being used for applications, not just distributing document-like content. They formed a new entity, the Web Hypertext Application Technology Working Group WHATWG to continue development of HTML, using HTML 4.01 as a starting point and a 2004 position paper from the Mozilla Foundation and Opera Software as initial guidance. Since then, work on the fifth major generation of HTML specifications has proceeded in both WHATWG and W3C. The relationship has been somewhat complex, as made clear in the History clause of the HTML 5 specification. Nevertheless, in recent years, many important features have been added to the HTML specification. In practice, progress has been incremental, with WHATWG addressing individual problems and needs. Its members, who include the major browser developers, design and test solutions collaboratively, with an emphasis on interoperability and backwards compatibility. W3C has put effort into compiling regular chronological versions that incorporate new features that have interoperable support in two or three browser engines. The first W3C Recommendation for HTML5 (referred to here as HTML 5.0 to distinguish it from later specifications) was published in October 2014 with editors from both WHATWG and W3C named. The document states, "the bulk of the text of this specification is also available in the WHATWG HTML Living standard." Since then, the two inter-related activities have continued, leading to the following situation in early 2018.

  • WHATWG continues to maintain a specification for HTML as a "living standard" and does not use the version number 5. To quote from its HTML Standard FAQ 'Going forward, the WHATWG is just working on "HTML", without worrying about version numbers. When people talk about "HTML5" in the context of the WHATWG, they usually mean just "the latest work on HTML", not necessarily a specific version.'
  • The W3C has produced further snapshot specifications of the HTML 5 standard: HTML 5.1 (first published in November 2016); HTML 5.2 (first published in December 2017); and HTML 5.3 (in draft as of early 2018).

The original HTML was based on SGML (Standard Generalized Mark-up Language), an international standard [ISO 8879:1986 (SGML)] for marking up text into structural units such as paragraphs, headings, list items and so on. SGML could be implemented on any machine. The idea was that the language was independent of the formatter (the browser or other viewing software) which actually displayed the text on the screen. The use of pairs of tags such as <TITLE> and </TITLE> is taken directly from SGML. The SGML elements used in the original HTML included P (paragraph); H1 through H6 (heading level 1 through heading level 6); OL (ordered lists); UL (unordered lists); LI (list items). See the 1992 description of HTML tags. HTML 4.01 (published in 1999) was still described as an SGML application. However, by that time, it was clear that SGML, although widely adopted in publishing workflows, was not going to be accepted as a distribution format on the Web. Nor were open source tools being developed for working with SGML as a format. So W3C pursued the idea of basing a future version of HTML on XML, itself a subset of SGML. After W3C had produced two versions of XHTML (XHTML 1.0 and XHTML 1.1) as W3C Recommendations, the strategy changed. With HTML5, the approach has been to recognize HTML and XML serializations for the same set of elements and attributes. When the term "XHTML5" is used, it refers to the XML serialization for HTML5. However, the clause on XML syntax from the HTML 5.2 specification states "XML documents may contain a DOCTYPE if desired, but this is not required to conform to this specification. This specification does not define a public or system identifier, nor provide a formal DTD." This is in contrast to XHTML 1.1 which provided both an XML DTD and a schema in the W3C XML Schema language.

Production phase The primary use of HTML is as a final-state format for web pages made available on the Internet. However, HTML files can be edited directly in a text editor and are sometimes also used as a middle-state format in workflows that assemble web pages from several content files for final user display. As of 2018, many email systems support a subset of HTML as a middle-state format for transport of messages. See Wikipedia entry for HTML email.
Relationship to other formats
    Has subtype HTML_early, HyperText Markup Language (HTML), versions prior to 2.0
    Has subtype HTML_2_0, HyperText Markup Language (HTML) 2.0
    Has subtype HTML_3_2, HyperText Markup Language (HTML) 3.2
    Has subtype HTML_4_0, HyperText Markup Language (HTML) 4.0
    Has subtype HTML_4_01, HyperText Markup Language (HTML) 4.01
    Has subtype HTML_5, HyperText Markup Language (HTML) 5.x. HTML_5 has merged HTML and XHTML as two serializations (not described separately at this website at this time) of the same set of elements and attributes.
    Affinity to XHTML_1_0,
    May contain CSS,

Local use Explanation of format description terms

LC experience or existing holdings

The Library of Congress launched its first website in 1994. Since then, staff have created HTML files using various chronological versions of the format. Early web pages did not identify the HTML version with a DOCTYPE declaration and likely corresponded to HTML 2.0. Starting around 1997, the Library adopted, in turn, HTML 3.2, HTML 4.01 Transitional, and XHTML 1.0 Transitional. Starting around 2011, all new and redesigned pages have used HTML 5.

The Library of Congress has acquired large volumes of HTML files for its collection via its web archiving program. See Library of Congress Web Archiving.

LC preference The Library of Congress Recommended Formats Statement (RFS) for textual documents includes HTML and XHTML as acceptable digital formats, when accompanied by DOCTYPE declaration and presentation stylesheet. The RFS does not distinguish between HTML versions.

Sustainability factors Explanation of format description terms

Disclosure HTML has been an openly documented specification since its origin in the early 1990s. Initially it was developed and documented at CERN and via the www-talk mailing list. Between 1993 and 1995, documents were distributed publicly by IETF, first in June 1993 as an Internet Draft, and culminating in November 1995, with the publication of HTML 2.0. W3C took over responsibility for HTML development in 1995. Since 2004, HTML public documentation has been provided under the auspices of both W3C and WHATWG, with WHATWG documenting HTML as a "living standard" and W3C documenting snapshots.
    Documentation

The most recent specifications of HTML available are the latest W3C snapshot of HTML 5 and the WHATWG Living Standard for HTML. See Format Specifications below for links to previous versions.

Adoption

Hypertext Markup Language (HTML) is the primary markup language used for creating pages and applications on the Web. Because HTML is now developed as a living standard, adoption needs to be assessed by element or other feature. Resources exist to check whether particular aspects of the standard are supported in the popular browsers. See Useful References below. For example, https://caniuse.com/#search=audio provides information about support for the audio element and the audio codecs listed in the HTML specification. The MDN element page for audio has a table of browser support at the bottom of the page. These resources show support by browser. However, current browsers are not built on independent code bases. Each browser has its own interface and functionality, but depends on a browser engine to handle the work of rendering HTML documents. The Wikipedia List of Web Browsers lists browsers by engine. There are two primary independent open-source browser engines, Gecko and Webkit. Gecko was developed by Mozilla and is used primarily for Firefox; Webkit is maintained by Apple for Safari (for MacOS and iOS). Blink is a fork of the Webkit code by Google, first used for Chrome and then adopted by Opera. Trident is the proprietary engine that was used for Internet Explorer (IE); the engine for IE's successor, Edge, is called EdgeHTML. For more detail on browser engines, see the Wikipedia entry for Web Browser Engine.

Tools for creating HTML documents come in many varieties. The Wikipedia List of HTML Editors includes text-based editors designed for working with source code and visual editors such as Adobe Dreamweaver. Dreamweaver also functions as a high-level content management framework for building entire websites as do WordPress, Drupal, and many others.

As of early February 2018, statistics from World Wide Web Technology Surveys indicate that 80% of websites use HTML and 20% use XHTML. Of those using HTML, 86% use HTML 5 and 13% use HTML 4 Transitional.

As of 2018, many email systems support a subset of HTML for messages. See Wikipedia entry for HTML email.

    Licensing and patents

In all stages of the development of HTML as the markup language for documents on the web, the specification has been developed as a non-proprietary, openly documented, freely implementable standard. The W3C Patent Policy has the goal of assuring that all W3C Recommendations can be implemented on a royalty-free basis. The WHATWG Intellectual Property Rights Policy is similarly designed to secure intellectual property commitments from contributors to promote royalty-free licensing.

Transparency

HTML files can be opened and viewed in text editors. The HTML markup is human-readable with human-comprehensible element tags and also designed for straightforward automatic parsing. The widespread use of CSS, and particularly of Javascript, has resulted in less transparency.

The transparency of image and video files intended for incorporating into the rendered display depends on the formats of those files. Note that such files are not stored within the HTML file, but referenced by URL. The URL may be absolute or relative to the HTML file.

Self-documentation

HTML specifications since HTML 2.0 define a META element within the HEAD section of an HTML document. This element, which may have NAME and CONTENT attributes to hold name/value pairs, is widely used for recording descriptive or administrative metadata for the document. Web browsers do not typically display this data. Any NAME can be used but WHATWG has a wiki page for MetaExtensions that web authors are recommended to consult " to avoid choosing a metadata name that's already in use, and to avoid duplicating the purpose of any metadata names that are already in use." The MetaExtensions list includes Dublin Core elements and elements from several other vocabularies, including AGLS Metadata Standard (defined by the Australian government) and elements recommended by Google Scholar.

According to 3.2.4.2.1 Metadata content in HTML 5.2 and 3.2.5.2.1 Metadata content in the WHATWG HTML specification, metadata chunks in XML-based specifications, e.g., RDF, can be used in the XML serialization for HTML, but not in the regular HTML serialization. The compilers of this resource have not been able to determine the degree to which this feature has been adopted. Comments welcome.

External dependencies HTML markup does not have external dependencies that prevent interpretation of the source code. However, Web pages that use external style sheets depend on those resources for rendering as intended by authors. More problematic are scripts in Javascript or other programming language incorporated into web pages. These can create dependencies on resources that are only available dynamically in real-time. The problems raised by Javascript were discussed in 2014 blog posts HTML and fuzzy validity, by Gary McGath and Web Archiving in the JavaScript Age, by Andy Jackson of the British Library.
Technical protection considerations HTML specifications have not incorporated any features supporting encryption or technical protection until 2017. In September 2017, W3C published Encrypted Media Extensions (EME) as a W3C Recommendation. This specifies an API to control playback of encrypted content. This extension to has been very controversial. It applies only to HTML 5. See Notes below for more detail on the proposal and the surrounding controversy. According to Can I Use? in February 2018, EME was supported by all the major browsers for computers and Chrome for Android phones but not for iOS Safari (for iPad and iPhone). Comments welcome.

Quality and functionality factors Explanation of format description terms

Text
Normal rendering

The scope outlined in the November 1999 WorldWideWeb: Proposal for a HyperText Project at CERN focuses on hypertext and retrieval of documents over the Internet. But some of the stated objectives imply the need to satisfy normal text functionality for the future HTML: "to provide some method of reading at least text (if not graphics) using a large proportion of the computer screens in use at CERN"; and "to provide a keyword search option." The initial HTML design was for minimal markup that supported flowed text on different devices and window sizes. Text was naturally in reading order, since it assumed a user reading on a screen and HTML files were easily indexed by ignoring the tags.

Initially, HTML was based on a limited character set. For interoperability, the underlying character encoding used for transport was typically 7-bit ASCII, with numerical character references used to represent non-ASCII characters from a larger document character set. For example, &#246; represents a small letter o with an umlaut. The first formal specification (for HTML 2.0) based the document character set on ISO 8859-1, an ASCII-based 8-bit character set also sometimes referred to as Latin-1. RFC 2070: HTML Internationalization, published in January 1997, extended the HTML document character set to the Universal Character Set (UCS), standardized as ISO 10646 and introduced a mechanism for declaring character encoding in a META tag. For example, <META http-equiv="Content-Type" content="text/html; charset=EUC-JP"> in the HEAD section of an HTML document declared the character encoding used in the stored document as EUC-JP. This syntax was used through HTML 4.01. In HTML 4.0 and 4.01 both ISO 8859-1 and UTF-8 were common as character encodings on websites in English and in European languages. In HTML 5, the default character encoding became UTF-8. Other encodings can be specified, using a simplified syntax, e.g., <meta charset="euc-jp">.

Integrity of document structure

Semantic tags for paragraphs, up to six heading levels, and list structures of several types have been part of HTML since the earliest use of the markup language. See HTML Tags (from 1992). A specification for tagging a table structure, after considerable debate about alternative proposals, was published as RFC 1942 in May 1996 as an extension of HTML 2.0. HTML 4 added elements to mark inserted text (<INS>), deletions (<DEL>), and inline quotations (<Q>) and added some structural elements to identify headers and footers for tables. New structural elements were added to HTML 5 for header, footer, section, article, and figure (with optional caption).

Integrity of layout and display

Preserving particular aspects of layout was not an objective for HTML as originally conceived. The focus was on making the textual content and semantic structure of a resource conveniently readable on different devices.

A proposal for a syntax for cascading style sheets to permit the separation of styling instructions for HTML content from its semantic structure was proposed in October 1994 by Håkon Lie based on some guiding principles. The November 1995 specification for HTML 2.0, RFC 1866 indicated that the <LINK> element could be used to associate an external stylesheet with an HTML document. Formal standardization for the Cascading Style Sheet (CSS language started in W3C as soon as the consortium was formed, leading to the publication in December 1996 of the W3C Recommendation for CSS 1.0.

A formal specification for tables was added to HTML in RFC 1942 in May 1996. In the past, tables have been used for layout in HTML documents, although the intent of this specification was to represent the equivalent of a table in a printed article. The specification for the HTML table element now says explicitly, "Tables must not be used as layout aids."

With HTML 4, came an increased emphasis on the use of style sheets, for example using CSS, to allow web-page authors and website administrators to apply sophisticated designs and to control layout and formatting in a convenient way.

Support for mathematics, formulae, etc.

A syntax for embedding MathML into HTML using the <math> element was introduced with HTML 5. See clause 4.7.17. MathML from the most recent HTML 5 specification. According to Can I Use?, MathML is not supported in Chrome or Microsoft's Edge, although it is supported in Firefox and Safari. When tested in February 2018, Firefox 58 did not render all the MathML examples in a single-page presentation of the MathML specification with embedded MathML for examples correctly when compared with the examples embedded as images. However, it did render more examples correctly than Chrome 63.

MathML has indeed not fulfilled the hopes of its developers. The W3C has formed a Community Group for Math on Web Pages. The group states, "While MathML was supposed to solve the problem of rendering mathematics on the web it lacks in both implementations and general interest from browser vendors. However, in the past decade, many math rendering tools have been pushing math on the web forward using HTML/CSS and SVG. One of the identified issues is that, while browser manufacturers have continually improved and extended their HTML and CSS layout engines, the approaches to render mathematics have not been able to align with these improvements. In fact, the current approaches to math layout could be considered to be largely disjoint from the other technologies of OWP [the Open Web Platform]." The group also argues on its website that representation of mathematical content on the web needs to move away from the single solution of MathML, which is designed to represent the visual layout of mathematics but not to convey the semantics.

Functionality beyond normal rendering

HTML was developed specifically to support linking among online resources. Compared with other digital formats used for primarily textual document-like content, the HTML format has been extended to support embedded audio and video and the development of interactive applications. These functionalities are widely adopted.


File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension html
htm
See https://www.iana.org/assignments/media-types/text/html (updated 2014).
Internet Media Type text/html
See https://www.iana.org/assignments/media-types/text/html (updated 2014).
Magic numbers See note.  The initial registration for the text/html Internet media (MIME) type states, "No sequence of bytes can uniquely identify an HTML document." However, in practice, tools can identify an HTML file and its version by finding <HTML> start and end tags or DOCTYPE declarations for particular chronological variants or profiles.
Mac OS file type TEXT
From the registration for the text/html Internet media (MIME) type.
Pronom PUID See note.  PRONOM does not provide a record for the entire HTML format family. See Notes below for PUIDs for different chronological versions.
Wikidata Title ID Q8811
See https://www.wikidata.org/wiki/Q8811 for the HTML Format Family. Wikidata also has entries for individual versions of HTML.

Notes Explanation of format description terms

General

Identifying versions of HTML: HTML and XHTML documents conforming to HTML specifications starting with HTML 2.0 begin with a doctype declaration. This declaration had a specific meaning in SGML and its use was maintained so that browsers and validators could recognize different doctypes to determine which version of HTML was being used. HTML 5, since it was intended as a "living standard" and was no longer based on SGML, used a mimimalist doctype <!doctype html> designed to be recognizable by browsers, but no longer giving any indication of which specific variety of HTML (5.0 and up) is in use.

Embedded images: In contrast to document formats such as PDF, ODT, and DOCX, images intended for incorporating into rendered HTML pages are not stored in the HTML file. Each image is referenced by a URL, which may be absolute or relative to the HTML document. The same is true for audio and video in a web page, whether simply linked or intended for playing automatically as the page is rendered.

Frames and framesets: The concept of a frameset as the structure for a web page was introduced in HTML 4.0 and included in the specifications for XHTML 1.0 and XHTML 1.1. A frameset defined the positioning of rectangular frames in a browser window. The content for each frame was an HTML document referred to by URL. One of the most popular uses of frames was to present a coherent body of content with a navigation menu in one frame and to use a separate frame to contain the page selected by a user. Some disadvantages of framesets are listed at Advantages and disadvantages of frames. Use of mobile devices with small screens presented additional challenges. The frameset structure and associated elements and attributes were dropped from HTML 5.

Encrypted Media Extensions (EME): EME is a W3C specification for providing a communication channel between web browsers and digital rights management (DRM) agent software. The first use of EME was by Netflix in April 2013. See HTML5 Video at Netflix. The first public draft of a specification for EME was published on May 10, 2013. Many revisions followed, particularly in late 2015 and early 2016. After a period of controversy, it was published as a W3C Recommendation in September 2017. See Useful references, below, for more about the controversy. Comments welcome.

History

The Description above covers the history of the HTML file format and the involvement of different entities in its development and documentation. Below is an annotated timeline of the most significant chronological versions of HTML.

  • 1990-1994: Informal specifications for HTML shared on www-talk list.
  • November 1995: Publication of HTML 2.0, as IETF RFC.
  • January 1997: Publication of HTML 3.2 as W3C Recommendation
  • December 1997: Publication of HTML 4.0 as W3C Recommendation
  • December 1999: Publication of HTML 4.01 as W3C Recommendation
  • January 2000: Publication of XHTML 1.0 as W3C Recommendation
  • May 2001: Publication of XHTML 1.1 as W3C Recommendation
  • October 2014 Publication of HTML 5.0 as W3C Recommendation

Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 03/30/2018