Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive | Datasets | Geospatial | Generic

Text >> Quality and Functionality Factors

This discussion concerns individual works consisting primarily of text; it does not take into account any additional requirements for serial publications or supplementary non-text materials. The focus of the initial analysis is on quality and functionality factors that apply to text as a form of expression and hence to all genres of text. However, in practice, the choice of digital format for a textual object will require consideration of genre (novel, reference work, journal article, e-mail, syndicated news feed, etc.), tools and formats used in creation or publication of the work, and whether the Library needs or intends to provide current access to the work using the digital object submitted or can support current use through other means (e.g. a subscription for online access). There is no single digital format in use today that will be best in all circumstances for textual content.

Significant characteristics of text: the challenge for analysis
The assessment of significant characteristics for textual items is complex, as is the impact of this assessment when selecting digital formats. Textual documents are created for a variety of purposes and their uses at the time of creation may be supplanted by other uses later. For example, journalistic accounts may be the subject of historical or linguistic analysis, or be re-purposed into a new work. Some current digital formats may be easy to make available now and sustainable, yet offer limited functionality in the future. The accompanying sidebar highlights some examples of significant characteristics, as they may relate to the requirements of creators and users. The characteristics listed here make no assumptions about what might be considered fair use; some may require explicit permission of rights-holders during the period of copyright protection. For some genres of text (such as e-mail or news wire service transmissions) very few of the characteristics may actually be relevant.

Significant characteristics of text: a simplified view
For practical support of decisions about formats appropriate for content to be added to the Library of Congress collections in the near future, some attempt to simplify is needed. As a starting point, it is useful to note that some digital formats are intended to support accurate rendering of final layout and design choices for text (e.g. PDF) and others are intended primarily to represent the logical structure of a document with the expectation that different design choices for layout, font, etc. can be made through the use of stylesheets (e.g. XML or SGML). Most word-processing and desktop-publishing application programs provide user controls for layout in two ways. Direct controls let users apply font changes manually to individual words or line indents to individual paragraphs. Alternatively, templates and styles can be used to generate documents that are tagged with their logical structure. Many authors do not use the features that generate the logical structure tags; but if used, they may be valuable to retain. If documents are stored in a format designed primarily to retain integrity of layout and design, tools for automated analysis (now and in the future) will not be able to take advantage of any logical structure information that contributed to the creation process. A format that represents the logical structure is also likely to be more useful for future re-purposing (other than as a straightforward facsimile).

A preliminary effort to cluster significant characteristics led to the identification of three integrity factors: integrity of document structure and navigation; integrity of layout, font, and other design features; and integrity of rendering for mathematics, chemical formulae, diagrams, etc. Whether these factors prove useful in practice for Library staff considering digital formats appropriate for any particular body of content will be evaluated as this format-analysis activity proceeds.

Normal rendering for textual items
Normal rendering for textual items includes convenient linear reading on screen, the ability to print sections of the document to paper, to excerpt quotations as text strings, to search for words within an item, and to index for searching as part of a corpus of documents. Rendering of any text item must reflect the intent of the author in representing the individual characters, paragraph structure, lists, headings, and indicators of emphasis.

Integrity of document structure and navigation
The degree to which the navigation options and automated analysis made possible by the logical structure of a textual work is essential to its usability. This factor will be important for highly-structured documents, such as directories, encyclopedias, and for works that use a formal logical structure (e.g., front matter, body, list of illustrations, references, index) that can be exploited for enhanced user access and for analysis when the work is considered part of a larger corpus. Explicit representation of logical document structure is important for accessibility for the visually handicapped, as exemplified by the importance of the "publication structure" in the standard for the Digital Talking Book.

Integrity of layout, font, and other design features
The degree to which the look and feel, and exact choices of features such as font and column layout of a text document is essential to its meaning. The importance of this factor will be a matter of policy or individual judgment, with the importance being greater if the layout is used to convey information and considerably less if the document consists only of textual sections presented in a linear or hierarchical structure. It may be worth noting that web pages are rendered differently by different browsers (indicating that layout and design are not fixed even in the original). Similarly, in the traditional print publications of journals, choices of font and layout are made by the publisher; the authors may not feel that they are essential and may indeed prefer choices used for drafts. Users may be indifferent, as long as the layout and font are easy to read.

Integrity of rendering for mathematics, chemical formulae, diagrams, etc.
The degree to which the accurate rendering of non-textual elements is key to the informational content of the document. This factor is separated out in light of the shortcomings of structural markup languages in the representation of equations. The integrity of a rendering that has been approved by an author should be considered significant.

The relative importance of these three integrity factors, applied to a particular item or body of content, must be considered when selecting among format options that are reasonable given the creation or publishing process for the content. It is probably not reasonable to expect a document created by a process in which the logical structure was never a consideration to be deposited for copyright registration as an XML file in a DTD or schema designed to express the formal structure of a journal article.

Beyond normal rendering
How a digital textual work behaves (the functionality experienced by a reader) is supported by the combination of text tagged with logical structure information and the capabilities of a particular online environment or dedicated player (or e-book reader) to take advantage of the structured data. When considering functionality observed in particular viewing environments (such as a side-bar with a collapsible tree view of a table of contents), the potential functionality supported by the underlying markup (in this case, representing the underlying hierarchy of the table of contents) must be distinguished from the particular view of the hierarchy. Many aspects of functionality for text (such as bookmarking, or searching for words and permitting skipping to the next occurrence) are properties of a particular viewer rather than the underlying content. E-books using a particular proprietary "reader," which may be an application for a general-purpose computer or a special-purpose device, may provide functionality beyond that of a book presented as a linked set of web pages. This is true, even if the web pages and the e-book "file" were derived from the same marked up file.

One significant functionality beyond normal rendering for text relates to links from within the document to external resources. Links observed on the web today may be links created explicitly by the author. In this case, a format that retains the identifiers or addresses (e.g. URLs) behind the links is likely to be preferable. However, links on websites are often generated by an application that constructs links dynamically. This is particularly likely to be the case when online services display links from references in articles to the cited articles or customize displays to particular users or groups of users. Whether the underlying digital content file includes information that could reproduce such links can probably not be determined in a general way.

Back to top

Last Updated: Thursday, 05-Jan-2017 15:59:12 EST