EAD Application Guidelines for Version 1.0


Chapter 6: SGML and XML Concepts

6.1. Introduction
6.2. SGML Documents
6.2.1. Document Type Definition (DTD)
6.2.2. SGML Declaration
6.2.3. Document Instance
6.2.4. Enforcing Compliance
6.3. Elements
6.4. Attributes
6.5. Entities
6.5.1. Parameter Entities
6.5.2. General Entities
6.5.2.1. Character Entities
6.5.2.2. Declaring Entities in the Document Prolog
6.5.2.3. Internal Entities
6.5.2.4. External Entities
6.5.2.4.1. Externally Stored Data Intended to be Parsed
6.5.2.4.2. Externally Stored Data Not Intended to be Parsed


6.1. Introduction

This chapter provides an overview of some important concepts in SGML-based information systems in order to give readers of these Guidelines a basic understanding of some key technical issues. Differences between SGML and XML are addressed where they are relevant. (97) These issues are examined from the perspective of the EAD DTD itself, with most examples taken directly from the DTD. In addition, this chapter provides the following:

Some readers may find chapter 6 overly technical or confusing upon first reading. Chances are that some of its content will not immediately be useful when approaching EAD as a complete novice or with a limited understanding of basic computing technology. It may prove more helpful when revisited later on, after the reader grows more comfortable with the practicalities of EAD encoding. In the same way that you do not need to understand the chemistry of baking in order to make an apple pie provided you can identify the ingredients and follow the proper sequence in mixing them, most archivists will be able to create an EAD-encoded finding aid without a detailed understanding of either SGML or XML.

On the other hand, other readers may find this chapter overly basic if they need even more in-depth information about SGML or XML systems. These Guidelines provide ample citations to more comprehensive information resources that are available for those needing more detail than can be provided here (see the footnotes throughout this chapter, as well as the SGML/XML section of the bibliography in appendix G). H3>6.2. SGML Documents The SGML standard (ISO 8879) is a metalanguage that defines the components of an SGML document.(98) These components are as follows:

6.2.1. Document Type Definition (DTD)

A Document Type Definition supplies a formal, human- and machine-readable specification of the elements that can occur in a certain class of document and the markup that can be used to represent those elements. The elements have a logical, hierarchical relationship to one another that is specified in the DTD, which also provides a formal set of rules for the order and frequency of the elements. Figure 6.2.1a illustrates the concept of SGML as a metalanguage.

SGML data structures are based on the premise that the contents of a type of document, such as a finding aid, can be described as a series of hierarchies. For example, a collection may contain series, those series may contain files, and those files may consist of individual documents. In SGML, these parent-child relationships are expressed in the DTD. The relationship of elements to one another may be visualized as a tree with a root and central trunk subdividing into smaller and smaller branches with nodes at the furthest extremity of each. Figure 6.2.1b illustrates this concept using the beginnings of EAD's hierarchical structure. The base of this tree structure is referred to in SGML parlance as the document element. In an EAD document it is represented by the start-tag <ead>, which signals the beginning of the body of an encoded instance, the base of this tree.

6.2.2. SGML Declaration

The SGML declaration provides technical information to SGML applications, such as authoring and publishing software, about how the DTD and encoded document will use SGML. This includes the base character set employed and the special features of SGML, such as OMITTAG or SHORTREF (99), that are to be permitted. While most SGML applications expect the use of a standard default SGML declaration called the concrete reference syntax (CRS), a particular DTD may employ a customized declaration. EAD does employ a customized SGML declaration, which is contained in the file eadsgml.dcl. (100) The XML standard, on the other hand, does not permit such modifications, requiring the use of its own implicit SGML declaration. As a result, as noted in section 4.3.1, EAD documents in XML do not require the use of the eadsgml.dcl file. The eadsgml.dcl file provides an SGML declaration that modifies the CRS in order to come as close to the implicit declaration used in XML as possible. This was necessary in EAD so that a single DTD, with only the slight modifications noted in section 4.3.2.1, could be used in both SGML and XML systems.

6.2.3. Document Instance

Individual document instances must begin with a prolog naming the DTD to which they adhere, which as noted above references the SGML declaration being used, except in XML-compliant files when the SGML declaration is implicit and cannot be modified. Following this prolog, the body of an individual document instance conforming to a particular DTD is composed solely of two types of information:

The prolog of all EAD-encoded document instances must begin with the following document type declaration:

	<!DOCTYPE  ead  PUBLIC  "-//Society of American Archivists//DTD
	ead.dtd (Encoded Archival Description (EAD) Version 1.0)//EN">

The string <!DOCTYPE identifies this as a document type declaration, which should not be confused with the similarly named "Document Type Definition." The document type declaration references the DTD that is being used by the document instance. Immediately following the <!DOCTYPE string is the document type name, which is defined by the SGML standard as the minimum literal string that can stand uniquely as a representation of the document type being declared, in this case "ead." In an SGML system, the document type name may be either in upper or lower case. However, since XML is case sensitive and the EAD DTD prescribes that all element and attribute names be in lower case, it may be safest always to render it in lower case to be certain that your files will work in either context.

Having identified the document type, the instance must now make its DTD available to processing applications. This can happen in either of two ways. The entire DTD may be embedded within the DOCTYPE declaration itself in what is called the declaration subset. The content of the declaration subset is delimited at the beginning by a left square bracket ([) and by a right square bracket (]) at the end. (101) More frequently, the DTD is stored for efficiency in an external file, which the DOCTYPE declaration references at this point, as in the EAD DOCTYPE declaration shown above. The reference may consist of a formal public identifier whose text is preceded by the keyword PUBLIC. Alternatively, a location-specific system identifier may be given as a reference, signaled by the keyword SYSTEM. The relative merits of public identifiers, system identifiers, and a combination of the two are discussed in section 6.5.2.4.1.

6.2.4. Enforcing Compliance

Any SGML-aware software application attempting to process an SGML document instance is entitled to expect that the document will do two things:

A valid document instance is one that uses markup in the manner specified by the particular DTD that it references. XML permits well-formed document instances that do not reference a DTD, but that must still adhere to the rules in the XML specification. Archivists choosing to use the EAD DTD in XML mode will produce document instances that are both well-formed and valid.

In SGML-based systems, the process of testing a document for compliance with the referenced DTD is called "parsing." Parsing is the process of resolving something complex into its component parts. The parsing application in an SGML system goes through several steps to accomplish this. In a typical scenario, it first reads the SGML declaration if there is one and then tests the DTD itself for SGML conformity. Next it reads or parses the markup, expanding text entities and separating text from markup. The document is then transformed into a tree structure so that in the final phase the application can validate the document by comparing its structure to that of the DTD. Simply stated, parsing identifies the markup, and validation compares that markup against the DTD.

6.3. Elements

Elements are the primary building blocks for the markup that is specified by a DTD. Elements-in the form of the tags that represent them-provide the framework for an encoded SGML-compliant document instance. This framework surrounds textual or other data, which is the information content of the document that is intended to be conveyed to an audience.

The element declaration in a DTD is composed of the string <!ELEMENT, followed by an element name, followed by a content model, as in the following example:

	<!ELEMENT  element_name  content_model>

The element name forms the core of each tag that delineates the content of an element in an individual instance of an encoded document. The SGML standard specifies that tags are identified by a left angle bracket (<), followed by the text of the element name and a trailing right angle bracket (>), and that there are two possible tags for each content-bearing element: a start-tag and an end-tag. In their simplest form these tags are identical, with one important exception: the end-tag includes a forward slash (/) between the left angle bracket and the element name.

	Start-tag:			End-tag:
	<element_name>		</element_name>

The start-tag can have a more complex form because it can be qualified by attributes (discussed in section 6.4), while end-tags cannot be so modified.

The following examples contain two fairly simple element declarations from the EAD DTD to illustrate the relationship between element declarations and the tags used in an individual instance of an encoded document. Figure 6.3a contains the EAD DTD declarations for the elements EAD <ead> and Change <change>:

		<!ELEMENT  ead   (eadheader, frontmatter?, archdesc)>
		<!ELEMENT  change  (date, item+)>

	Figure 6.3a.  Element declarations taken from the EAD DTD.

The element name for each element as declared in the EAD DTD is used to construct tags in individual EAD document instances, as illustrated in this example:

	Start-tag:		End-tag:
		<ead>			</ead>
		<change>		</change>

The content model section of an element declaration in an SGML DTD specifies three characteristics of the element:

The order of occurrence in the content model sequence of the element declaration is determined by the following symbols:

, CommaRequired order (x then y)
| Vertical bar (pipe)No required order (x or y)
( ) ParenthesesContent groupings within
the broader content model

The permitted frequency of occurrence of subelements is established by the following frequency indicators, which can appear immediately after the name of a particular element or after a content grouping delineated by a set of parentheses:

BlankRequired, may only occur once
+ PlusRequired, may occur one or more times
? Question markOptional, may occur only zero or one times
* AsteriskOptional, may occur zero, one or more times

Both elements as declared in figure 6.3a are examples of a content model that can only contain other elements, also referred to as subelements. The first element declaration in figure 6.3a states that the element Encoded Archival Description <ead> must contain a single required subelement EAD Header <eadheader>, possibly followed by a single optional subelement Front Matter <frontmatter>, followed by a single required subelement Archival Description <archdesc>. The second element declaration states that the element Change <change> must contain a single required subelement Date <date>, followed by one or more required Item <item> subelements (see chapter 3 for a discussion of the use of these and other EAD elements and for examples of their use in encoded finding aids).

The above element declaration examples are illustrative of an element-only content model, which means that these elements can have only other elements as content. Obviously all elements declared in the EAD DTD cannot follow this content model, since we must be able to put the textual data that comprises an archival finding aid somewhere within individual EAD-encoded document instances. An SGML-based encoding scheme such as EAD uses the term PCDATA to indicate that "parsed character data" is allowed in the content model for an element. You may not think of the text of your finding aid as "parsed character data," but that is what it is to an SGML-aware software package once your finding aid text is part of an EAD-encoded document. Any text that the content model for an element defines as PCDATA must be parsed by the software in order to determine that it is not markup. The software cannot assume automatically that this element content is or is not markup, and so it must resolve (analyze) its component parts.

SGML-based content models use the term CDATA (character data) to indicate to processing software that data allowed in certain places will never contain markup and therefore does not have to be parsed in order to be validated. One common example of the use of CDATA that will be discussed in section 6.4 is supplying attribute values, which can never contain other markup or character entities.

The Abstract <abstract> element provides an illustration of a mixed content model that can contain both textual data and other elements or markup. The following example illustrates the content model for <abstract> as established in the EAD DTD:

<!ELEMENT abstract(#PCDATA | ptr | extptr | emph | lb | abbr | expan|
ref | extref | linkgrp | bibref | title | archref)*>

This element declaration states that <abstract> has no required content and can contain parsed character data or any of the enumerated elements in any order and as many times as they are needed. In an EAD document, this means that you can put text and the enumerated tags in any combination that you wish between the start-tag <abstract> and the end-tag </abstract> (see section 3.5.1.2.6 for more information on the use of <abstract>).

An SGML-based markup system can define either content-bearing elements, as discussed above, or empty elements. The vast majority of elements defined by the EAD DTD contain either other elements or text. Several, in what is known as "mixed content," contain both. A few others, such as the element Pointer <ptr>, are defined as EMPTY. The declaration for this element specifies the following content model:

	<!ELEMENT  ptr  EMPTY>

Empty elements can contain no textual or element content and are tagged using only the start-tag, not the end-tag. SGML processing software knows not to look for an end-tag (see section 4.3.2.4 for information on XML syntax for empty elements). The principal value of the capacity to declare empty elements in an SGML system is to gain access to the attributes available for those elements; this is especially useful for elements that facilitate both internal and external linking from a particular spot in a document. You might use such an element for creating cross references between different points within an encoded finding aid or to create a link to digitized facsimiles of items from the collection described in a finding aid. Chapter 7 contains encoded examples and a more in-depth discussion of EAD's linking features.

6.4. Attributes

Attributes in an SGML-based encoding system allow information to be added to qualify an element in some way, such as to specify the type of an element, to provide it with a unique identifier, or to provide instructions for how it should be processed or displayed. Attributes are declared in DTDs in much the same way as elements. They are identified by the string <!ATTLIST, and they have values that can either be constrained or unconstrained by the attribute declaration. Possible values for constrained attributes are specified within the attribute declaration, forcing the encoder to choose from a closed list of values. Unconstrained attribute values are generally specified in the DTD with a CDATA content model, which allows the user to input any value.

Attributes are always related to a previously declared element in a DTD. In other words, the SGML standard does not permit an attribute to stand alone without an element for which it provides some qualification. Attributes provide metainformation about the data content delineated by a particular element and can only appear after the element name within start-tags in an SGML-based encoding system. Attributes are placed into start-tags in the following manner:

	<element_name  attribute_name="attribute_value">

	For example: 	<c level="series">

For example, the Component element, delineated by the start-tag <c> and the end-tag </c>, provides, through its subelements and the textual data they contain, a variety of identification and contextual information about a particular component in an archival description. The LEVEL attribute on that component is the means through which the EAD DTD provides for the encoding of metainformation about the level of a particular descriptive component (collection, series, file, item) within the larger encoded archival description.

An attribute declaration in a DTD is composed of the string <!ATTLIST, followed by an element name indicating the element that the attribute will be modifying, followed by the attribute name, followed by the specification of allowable attribute value(s), followed by either a default value or an attribute type.

	<!ATTLIST  element_name  ATTRIBUTE_NAME  value(s)  type_or_default>

Figure 6.4a provides two examples of EAD element and attribute declarations:

 1.    <!ELEMENT  editionstmt  (edition | p)+>
	<!ATTLIST editionstmt
		id 		ID 			#IMPLIED
		altrender 	CDATA 			#IMPLIED
		audience  	(external | internal) 	#IMPLIED
		encodinganalog 	CDATA 			#IMPLIED>

2.    <!ELEMENT  archdesc  (runner*, did, (admininfo | bioghist | controlaccess |
				odd | scopecontent | organization | arrangement |
				add | dsc | dao | daogrp | note)*)>
	<!ATTLIST archdesc
		id 		ID 			#IMPLIED
		altrender 	CDATA 			#IMPLIED
		audience  	(external | internal) 	#IMPLIED
		type		(inventory | register
				 | othertype)		#IMPLIED
		othertype	CDATA			#IMPLIED
		level		(series | collection
			 	file | fonds | item |
			 	otherlevel | recordgrp
			 	| subgrp | subseries)	#REQUIRED
		otherlevel	CDATA			#IMPLIED
		langmaterial	CDATA			#IMPLIED
		legalstatus	(public | private |
				 otherlegalstatus)	#IMPLIED
		otherlegalstatus CDATA			#IMPLIED
		encodinganalog 	CDATA 			#IMPLIED
		relatedencoding	CDATA			#IMPLIED>

	Figure 6.4a.  EAD DTD element and attribute declarations for
		<editionstmt> and <archdesc>. The three columns in the attribute
		declarations represent, in order, the name of the attribute, the content
		model, and the value designation.

The LEVEL attribute for the Archival Description <archdesc> element provides a good example of both constrained and unconstrained values (see the second attribute declaration example in figure 6.4a). The attribute declaration provides the following closed list of possible values for this attribute, thus constraining the choices an archivist can make: collection, file, fonds, item, otherlevel, recordgrp, series, subgrp, subseries. Providing a closed list of values makes inputting those values easier in SGML-aware authoring systems (see chapter 4 for more information on EAD authoring) and ensures consistency of attribute values across repositories for certain important types of information that may be crucial to union databases of encoded finding aids. There are, however, other legitimate names that some repositories may use for levels of archival description. While the list above is closed, one of the choices is OTHERLEVEL, another attribute that is declared in the DTD with a content model of CDATA, meaning that its value is unconstrained (see section 3.5.1 for a discussion of encoding the LEVEL attribute in <archdesc>).

When attribute values are encoded within tags, they are treated by an SGML-aware processing system as literal values. This term denotes a string of characters enclosed between either single (') or double (") quotation marks that will not be broken down further for processing. For example, an encoder cannot use an entity reference as the content of an attribute value and expect that the processing software will recognize and resolve that entity (entities are discussed in section 6.5).

Attribute values can be acted upon by SGML-aware processing software in a variety of ways:

In an individual document instance it is important to remember that, unlike the textual data that is the content of many elements, the actual data values of attributes are not immediately available to the end user of that encoded document instance. In order that certain attribute values display to an end user (for example, the LANGMATERIAL attribute value of <archdesc> or the LABEL attribute value of a <did> subelement), you must be using a system or stylesheet that can act upon attribute values and transform them for display (see section 5.3.3 for more on stylesheets).

From a technical perspective, the most striking difference between attributes and elements is the fact that elements, as we have seen, can contain either text or other elements, while attributes can never contain other elements. Attribute values, as previously noted, are expressed in terms of CDATA rather than PCDATA so that an SGML-aware processing software package attempting to validate an encoded document instance will never have to parse those attribute values in search of further markup. Also, as you can see in the attribute declaration examples in figure 6.4a, there is no means in the attribute declaration to control the order in which attributes should occur. In an EAD-encoded document, you may therefore place the declared attributes for any given tag in any order you wish, while elements must be encoded in the order (if any) specified by the element declarations in the DTD.

The examples in figure 6.4a illustrate the variety of attribute values that can be declared in a DTD. The attribute ALTRENDER-which allows an encoder to indicate output rendering preferences to processing software-has a value designation of CDATA, meaning that it is unconstrained and an encoder therefore can assign it whatever value is needed. The attribute AUDIENCE, on the other hand, is constrained to one of two values supplied in a closed list, "internal" or "external." The attribute ID has a value designation of ID, which is an SGML term for a string of characters that begins with an upper- or lower-case letter, contains no whitespace, and is composed only of alphanumeric characters, underscores (_), hyphens(-), colons(:) and full stops(.) A further requirement for attributes with an ID value designation is that their value must be unique within a particular encoded document instance and that there can only be one such attribute per element.

Attribute types are specified at the end of each attribute declaration line in the <!ATTLIST examples in figure 6.4a. The vast majority of the attributes in the EAD DTD are declared as IMPLIED, meaning that individual SGML-based systems may imply values for them if not otherwise declared, but that the DTD will not enforce their occurrence. You will notice in the <!ATTLIST declaration example for <archdesc> that the attribute LEVEL is declared as REQUIRED, which means that a parser will not validate an EAD instance when this attribute is missing.

6.5. Entities

Entities allow an encoder to declare an abbreviated name that serves as a substitute for something else. That "something else" can be one of several things:

Once an entity has been declared within a document instance, the encoder can use the abbreviated name as many times as necessary. Processing software, when encountering the abbreviated name, will expand the abbreviation to whatever the entity declaration references. How the entity expansion behaves is chiefly determined by the processing software, but an encoder can often use markup to provide some direction to the software. This is discussed at greater length in chapter 7 on linking elements.

		Declaration:
		<!ENTITY tp-address PUBLIC "-//ABC University::Special
		Collections Library//TEXT (titlepage: name and address)//EN"
		"tpspcoll.sgm">

		Expansion:
		<list type="simple">
		<head>Repository Address </head>
		<item>Special Collections Library</item>
		<item>ABC University</item>
		<item>Main Library, 40 Circle Drive</item>
		<item>Ourtown, Pennsylvania</item>
		<item>17654 USA</item>
		</list>

	Figure 6.5a.  An example of an entity declaration followed
			by the entity expansion.

Goldfarb and Prescod provide a helpful, if perhaps oversimplified, analogy for entities. They suggest thinking of an entity as a box with a label. The box contains some specified text or data, while the label (the abbreviated name) offers a shorthand way of referring to the box. (102) Entities can range from simple to complex, but they provide powerful ways to increase efficiency, avoid redundancy, and incorporate non-SGML data into encoded document instances.

Entities can be one of two types: parameter or general.

6.5.1. Parameter Entities

Parameter entities provide a good introduction to some basic entity syntax, but we will not spend much time discussing them, because they can only appear in DTDs (either stand-alone DTDs like EAD, or DTDs that are embedded in the declaration subset of a document instance) and not in the body of individual document instances. Parameter entities are generally used by DTD writers, as they are in the EAD DTD, to bundle commonly used element and attribute declarations into a content model that can be reused easily throughout a DTD, or even shared between DTDs. Parameter entity declarations in DTDs use the following formula:

	<!ENTITY  %  entity_name  entity_value>

Parameter entities must be declared in DTDs before they can be referenced elsewhere in the text of the DTD. A DTD writer would reference a parameter entity as follows:

	%entity_name;

The following example provides an illustration of a parameter entity declaration taken from the EAD DTD:

	<!ENTITY  %  a.common
			'id		ID			#IMPLIED
			 altrender	CDATA			#IMPLIED
			 audience	(external | internal)	#IMPLIED'>

You may notice from prior attribute examples that the literal single-quoted string above looks suspiciously like the contents of an <!ATTLIST declaration, which it is. This entity declaration example basically states that wherever an application reading this DTD encounters the reference "%a.common;" it should substitute the attribute list that appears between the single quotation marks in the above example. This particular parameter entity is referenced frequently throughout the EAD DTD, since the attributes ID, ALTRENDER, and AUDIENCE are available for the modification of the majority of EAD elements. The very first element declared in the "Encoded Archival Description Element Declarations" section of the EAD DTD appears as follows:

	<!ELEMENT  ead   (eadheader, frontmatter?, archdesc)>
	<!ATTLIST  ead
		%a.common;
		relatedencoding 	CDATA 		#IMPLIED>

At the point at which an SGML-aware software package encounters this parameter entity reference, it will expand the reference so that an encoder utilizing an <ead> tag actually has four available attributes for that tag, RELATEDENCODING plus the three defined by the "a.common" entity declaration. This expansion of the entity reference is one of the steps all SGML-aware software packages must take prior to processing any SGML-compliant encoded text.

There is one final point to make about the above example that illustrates an important property of entities. The example illustrates an internal entity declaration, which means that the text for the expansion of the entity reference is declared as part of the entity declaration itself. If that declaration had referenced a text or data file stored outside of the file in which the reference is included (we will see an example of this shortly), it would be an example of an external entity declaration.

The percent sign (%) in both the entity declaration and at the beginning of the entity reference in the examples above is what identifies these as parameter, and not general, entities.

6.5.2. General Entities

As noted in section 6.5.1, parameter entities can only be used in DTDs. Encoders of individual document instances using an SGML DTD also have entities available to them, called general entities. General entities are more complicated than parameter entities because they can serve a variety of functions within an encoded document instance. The following four sections discuss the types of general entities that are available to EAD users and the way in which some of those entities must be declared in a document instance. Section 6.5.2.1 discusses entities that are not individually declared in the document instance because they are character entities either defined as part of the EAD DTD when it is used in SGML mode, or, in the case of XML, defined as part of the XML specification itself. Section 6.5.2.2 discusses the declaration of entities as part of the prolog to a document instance. Section 6.5.2.3 discusses internal entities, in which the content of the entity expansion is provided as part of the entity declaration. Section 6.5.2.4 discusses external entities that provide addresses for external files required to expand the entity.
6.5.2.1. Character Entities
The simplest form of entities are those that are incorporated through eadchars.ent, one of the files associated with the EAD DTD (see section 4.3.1 for more information on the EAD DTD and associated files). These character entities are defined in standardized ISO (International Standards Organization) character entity sets (103) and are generally for characters and symbols (e.g., a "u" with an umlaut, the copyright symbol, the Greek letter lambda) that are not available on the standard English-language keyboard. The EAD DTD references the following 10 ISO character sets by default:

	Added Latin 1			Monotoniko Greek
	Added Latin 2			Diacritical Marks
	Greek Symbols			Numeric and Special Graphics
	Alternative Greek Symbols	Publishing
	Greek Letters			General Technical
	

Note that the character entity set references in the EAD DTD do not by themselves make these character entities available for use in EAD instances. You must have these character sets available in your SGML system in order to make them work in EAD. This is a topic that should be discussed with your system administrator if you plan to use character entities in your encoded finding aids.

In SGML, special typographic and graphic characters are represented with SDATA (specific character data) entity declarations, which provide the abbreviations you will use in character entity references within your encoded documents. These entity declarations appear in the SGML ISO character entity mapping tables that must be a part of your system if you are going to reference nonkeyboard characters. In your EAD instances you can reference individual character entities in the following manner:

	&#abbreviation;

As an alternative to the SDATA abbreviation you can also use the decimal number assigned to the character in the ISO character entity set in use, for example:

						Character Entity Reference
	Desired Character or Symbol	ISO SDATA Abbreviation	ISO Decimal Reference
	©				  &#copy;		  &#169;

XML uses the functionality of Unicode to replace SGML's SDATA-based character entity scheme. Unicode is designed to be all encompassing, incorporating all diacritics, symbols and characters into a single character entity set.

As explained by Goldfarb and Prescod:

	If you are a native English speaker you may only need the fifty-two upper- and
	lower-case characters, some punctuation, and a few accented characters.  The
	pervasive 7 bit ASCII character set caters to this market.  It has just enough
	characters (128) for all of the letters, symbols, some accented characters and
	some other oddments.  ASCII is both a character set and a character encoding.
	It defines what set of characters is available and how they are to be encoded
	in terms of bits and bytes.

	XML's character set is Unicode, a sort of ASCII on steroids.  Unicode includes
	thousands of useful characters from languages around the world.  However, the
	first 128 characters of Unicode are compatible with ASCII and there is a character
	encoding of Unicode, UTF-8, that is compatible with 7 bit ASCII.  This means that
	at the bits and bytes level, the first 128 characters of UTF-8 Unicode and 7 bit
	ASCII are the same.  This feature of Unicode allows authors to use standard plain-
	text editors to create XML immediately. (104)

A character or symbol can be referenced using either the decimal number assigned to it in Unicode or its hexadecimal alphanumeric reference in the following manner:

	Decimal:	&#somenumber;
	Hexadecimal:	&#xsomenumber;

						Character Entity Reference
	Desired Character or Symbol	Unicode Decimal Reference	Unicode Hexadecimal Reference
	©			 	  &#169;		 	 ©

SGML and SGML systems as they currently exist do not recognize the hexadecimal alphanumeric references, though XML systems do. Furthermore, SGML systems only recognize the Unicode numeric references for the 128 7-bit ASCII characters. Work is currently underway to alter the SGML standard to fully recognize the Unicode character entity set. EAD implementers using SGML software should use the ISO SDATA abbreviations when including character entity references in their EAD instances. When XML-compliant mapping tables become available, it will be easy to swap these for the SGML ISO tables in the system without necessitating any markup changes.

One other point worth mentioning is that although special characters can be included in EAD documents using SDATA abbreviations or decimal and hexadecimal references, many search engines cannot search these entity references, which may cause searches to fail. Until the time when Unicode becomes a standard in use in all of the various software packages utilized in encoding, manipulating, and delivering encoded instances, there will be disparities in how different software expresses special characters. A repository must consider the importance of being able to index and display these special characters in the light of the difficulty in maintaining them through the various stages of the markup and delivery of encoded finding aids.

6.5.2.2. Declaring Entities in the Document Prolog
All other general entities to be used in encoded document instances must be declared by placing an entity declaration for each in the declaration subset in the DOCTYPE declaration at the beginning of the EAD instance. As mentioned in section 6.2.3, the declaration subset appears between square brackets ([ and ]) at the end of the DOCTYPE declaration. Because whitespace is not significant within SGML declarations, a popular typographical convention is to begin a new line just after the opening square bracket and just before the closing square bracket in order to make encoded instance files easier to read. Figure 6.5.2.2a shows a prolog to an EAD instance in which an external general entity is declared:

Both the DOCTYPE and ENTITY declarations shown in figure 6.5.2.2a contain quote-delimited external identifiers. External identifiers can either be public or system identifiers. Public identifiers provide a form of destination address that is not specific to any one system. Use of a public identifier relies on the SGML system to resolve the nonspecific address to a specific one where the referenced file can be found. Public identifiers are provided for in the SGML standard (ISO 8879), while the syntax for Formal Public Identifiers (FPI) (105), a subset of public identifiers, is specified in the ISO 9070 standard. EAD does not require that all public identifiers be FPIs.

If a public identifier is used (indicated by the keyword PUBLIC), it may be followed by a system identifier in the form of a Uniform Resource Indicator (URI) for the resource. A URI is a broader construct that includes as a subset the more familiar URL (Uniform Resource Locator). (106) URLs, and more broadly URIs, are the addressing mechanisms that facilitate pointing to resources on the World Wide Web. The keyword SYSTEM, instead of PUBLIC, will precede an external identifier when only a URI, and no public identifier, is given for the file being referenced. Both keywords are important components in document type declarations and external entity declarations. The relative merits of public identifiers, system identifiers, and a combination of the two are discussed in section 6.5.2.4.1.

Entities like the one shown in figure 6.5.2.2a will be discussed shortly in greater detail. Once an entity has been declared in the declaration subset of a document instance, it can be referenced in the document instance itself at any point where markup is permitted by the DTD (it cannot be used in a place where content has been declared as CDATA). The entity declared in figure 6.5.2.2a would be referenced as follows:

	&brblmain;

6.5.2.3. Internal Entities
As mentioned earlier, general entities that may be used in document instances come in two basic flavors, internal and external. General internal entities are the simplest in that they contain the referenced content expansion directly as a component of the entity declaration. General internal entities contain text and are always included within the instance being parsed. General internal entity declarations in the declaration subset of a document instance begin with the string <!ENTITY, followed by the entity name, followed by the entity content delineated in either single or double quotation marks, as in the following example:

	<!ENTITY  entity_name  "specification_of_content">

Note that the keywords PUBLIC and SYSTEM are not necessary in internal entity declarations. Internal entities are only useful for text that is used repetitively within a particular encoded finding aid; they cannot be referenced from other document instances. For example, a repository might utilize a general internal entity in encoding a finding aid in which the name of the organization whose records are described in the finding aid is long, complex, and subject to typographical errors. In such a case, the encoder might declare an entity in the document type declaration as follows:

	<!DOCTYPE  ead  PUBLIC  "-//Society of American Archivists//DTD ead.dtd
	(Encoded Archival Description (EAD) Version 1.0)//EN"  [
	<!ENTITY  stuffsoc  "Society for the Preservation, Beautification, and
	General Betterment of the Stufftown Memorial Stuffatorium">
	]>

The encoder could then, at multiple points throughout the EAD instance, reference this declared entity as follows:

	<origination><corpname>&stuffsoc;</corpname></origination>
		[...]
	<prefercite><head>Preferred Citation</head><p>[identification
	of item], &stuffsoc; Records, Stufftown Memorial Stuffatorium, Stufftown,
	NS.</p></prefercite>
	[...]
	<bioghist><p>The &stuffsoc; was established in 1872 by the town
	council of Stufftown and was endowed initially with a fund to cover operating and
	acquisition expenses through the generous benefaction of ... </p></bioghist>

Any SGML-aware software encountering these entity references in processing can expand them to the full text provided in the entity declaration prior to processing the encoded instance.

6.5.2.4. External Entities
General external entities are intended to be either parsed or unparsed when an SGML-aware software package processes that document instance. In the case of an entity that points to files containing SGML character data, such as a large, oft-repeated markup section, you would want the parsing process to involve both the document instance and the text of the external entity. In other cases, the encoder would not want the externally referenced data files to be validated along with the encoded instance. Examples of the latter include entities that reference external SGML document instances containing other EAD-encoded finding aids or text encoded using a different DTD, or that reference non-SGML data files such as images captured as GIF or JPEG files or MPEG video files.

The next two sections describe the details of the entity declarations that are used to specify parsed or unparsed data.

6.5.2.4.1. Externally Stored Data Intended to be Parsed
First we will consider externally stored EAD-encoded data that is intended to be parsed as part of the document instance in which it is referenced. These external entity references are similar to general internal entity references, except that the literal quoted string must identify the external location of the file to be included in the document instance. This can be done using public identifiers, system identifiers, or a combination of the two.

Using entities in this way can assist a repository in the management of frequently updated information that appears widely across its encoded finding aids. Instead of entering such information as a part of each EAD instance, it can be stored as a separate file and referred to from within each instance using an entity reference. Figure 6.5.2.4.1a below provides an example of contact information that is stored as a separate file, tpncdsp.sgm, so that it can be referenced as an entity in each of the repository's finding aids. Updating this single file when, for example, the repository's area code changes, would change the contact information in all of the repository's encoded finding aids. If the area code information had been hard-coded into each of the individual files, such an update would be much more labor-intensive.

		<list type="simple">
			<head>Contact Information </head>
			<item>Rare Book, Manuscript, and Special Collections Library</item>
			<item>Duke University</item>
			<item>P.O. Box 90185</item>
			<item>Durham, North Carolina</item>
			<item>27708-0185 USA</item>
			<item>Phone: 919/660-5822</item>
			<item>Fax: 919/660-5934</item>
			<item>Email: [email protected]</item>
			<item>URL: http://scriptorium.lib.duke.edu/</item>
		</list>

	Figure 6.5.2.4.1a.  The content of the file tpncdsp.sgm.

The following is an example of an entity declaration that references the file tpncdsp.sgm using only a public identifier:

	<!ENTITY tp-ncd-spcoll PUBLIC "-//Duke University::Rare Book,
	Manuscript, and Special Collections Library//TEXT (titlepage: name
	and address)//EN">

This form of entity declaration is valid only in SGML systems. In XML a system identifier must be supplied as well, as illustrated in the following example, in which a relative URI is used:

	<!ENTITY tp-ncd-spcoll PUBLIC "-//Duke University::Rare Book,
	Manuscript, and Special Collections Library//TEXT (titlepage: name
	and address)//EN" "tpncdsp.sgm">

Finally, the information in figure 6.5.2.4.1a could be declared as an entity using only a system identifier. This approach would also be valid in XML. In the example below the system identifier is given as an absolute URI:

	<!ENTITY tp-ncd-spcoll SYSTEM
	"http://scriptorium.lib.duke.edu/eadfiles/tpncdsp.sgm">

The choice of whether to use a system identifier (a relative or absolute URI) or a public identifier (either an FPI or a less formal local public identifier) is largely determined by the system or systems in which you are storing and delivering encoded finding aid data (see section 5.4 for a related discussion of file management). Use of a relative URI, one that does not give the entire address of the referenced file beginning with the transfer protocol used (such as http://), assumes that all referenced files will inhabit a stable directory structure regardless of the server on which they reside. It should be noted here that use of relative URIs may be problematic for managers of union databases of encoded finding aids; archivists planning to submit their finding aids to such collaborative projects should consult the systems manager of the union database prior to deciding to use relative URIs. Using an absolute URI commits you to some file maintenance overhead anytime the files being referenced are moved to a new server, since the entire address of each URI will have to be edited. However, a simple "find and replace" routine will probably alleviate this overhead in most cases.

Use of a public identifier assumes the existence of an SGML catalog file (107) to which a system can turn in order to map, or resolve, that public identifier into a URI. The great strength of a file management system based on public identifiers is that changes in file locations on or among servers can easily be accommodated by changing a single address in the catalog file, rather than changing the entity reference in each individual EAD instance. Planning for future storage and delivery system possibilities requires careful thought as you decide which addressing approach to adopt. A fuller discussion of various options for providing file addresses in external entity declarations is provided in section 7.5.

There is an important difference between referencing files containing encoding excerpts, such as the one illustrated in figure 6.5.2.4.1a, in SGML systems as compared to XML systems. In the former, any chunk of text can be excerpted for reuse provided that it is encoded using the same DTD as the document instance in which the file will be referenced as an entity. XML imposes the additional requirement that the excerpted chunk of encoded text be "well-formed," meaning that it must have a single parent element. The example in figure 6.5.2.4.1a meets this XML requirement, since the tags <list> and </list> enclose all of the other information in the file. If the <list> and </list> tags were removed, however, this file would no longer satisfy the well-formed requirement in XML.

All general external entity declarations, regardless of whether they utilize a public or a system identifier, can be referenced in encoded document instances in the same way that the general internal entity was referenced earlier (see section 6.5.2.3), as shown in the following example:

	<publisher>Rare Book, Manuscript, and Special Collections
	Library<lb>
	Duke University<lb>
	Durham, North Carolina</publisher>
	&tp-ncd-spcoll;

6.5.2.4.2. Externally Stored Data Not Intended to be Parsed
The second type of external entity reference points to files containing data that is not intended to be parsed along with the document instance. As noted earlier, there are a variety of reasons why you might not want the parsing software to attempt to resolve the content of the entity that you are referencing. Two prominent reasons are that the referenced file is not composed of textual data that SGML-aware processing software would recognize (such as GIF-format images), or that the referenced file contains SGML-recognizable character data that is not intended to be a part of your document instance, such as another EAD-encoded finding aid or a literary text encoded using the TEI DTD.

The SGML standard specifies that entity declarations not intended to be parsed use the keyword NDATA followed by a notation name to indicate to application software the type of external file that the entity declaration references. A notation declaration, either in the DTD or in the prolog of an EAD instance, specifies a notation name and a formal public identifier (FPI) for the type of NDATA file used and must occur before you can declare a general external entity. The file endnotat.ent, one of the files affiliated with the EAD DTD, includes a series of standard notation declarations. This provides notations for SGML as well as a number of common non-SGML data types, including HTML, JPEG, MPEG, XML, PCX, GIF, and TIFF. Because these notations are declared in the EAD DTD, you can declare external entities in the prolog of your document instance that reference such files. Each notation declaration contains a notation name that you will need to create entity references to that type of file. Notation declarations follow the same format as general external entity declarations. Figure 6.5.2.4.2a illustrates the notation declaration for GIF files from the EAD DTD:

When encoding your repository's finding aids, you may wish to include a GIF image of your repository seal or of the repository itself. This can be done by creating a general external entity declaration in the declaration subset of the prolog of each EAD instance. This entity declaration, as previously noted, must provide an address for the image file on your server using a public identifier, a system identifier, or both. A general external entity declaration, using as a system identifier an absolute URI, for the purpose described above might look like this:

	<!ENTITY lcseal SYSTEM
	"//lcweb2.loc.gov/sgmlstd/panorama/lcseal.gif" NDATA gif>

The following is an example of an entity declaration for the same purpose that uses both a public and a system identifier, in this case a relative URI:

	<!ENTITY dukeseal PUBLIC "-//Duke University::Rare Book,
	Manuscript, and Special Collections Library//NONSGML (dukeseal)//EN"
	"dukeseal.gif" NDATA gif>

By including the NDATA keyword following the URI, you are signaling SGML-aware processing software that the referenced file contains data that should not be processed with your EAD instance. By providing the declared notation name for a data type after the NDATA keyword, you are further giving the processing software a clue about how it might handle this data that it is not supposed to parse.

Because this entity is external to your document instance and not intended to be parsed as part of it, you would not refer to it in your document using the direct entity reference format discussed in section 6.5.2.4.1. Instead, you would use one of EAD's linking elements with an ENTITYREF attribute to provide a reference to the external entity (see section 7.3 for a fuller discussion of EAD's external linking elements and attributes).


Footnotes

  1. Throughout this chapter, an SGML environment is implied unless XML is explicitly mentioned.

  2. For a more detailed overview of SGML, see C. M. Sperberg-McQueen and Lou Burnard, eds., "A Gentle Introduction to SGML" (Chapter 2 of the Text Encoding Initiative Guidelines), available at: <http://www-tei.uic.edu/orgs/tei/sgml/teip3sg>.

  3. For further technical information about the SGML declaration, see Charles Goldfarb, The SGML Handbook (Oxford: Oxford University Press, 1990), 450-75.

  4. See section 4.3.1 for information regarding the EAD DTD and its associated files.

  5. Entity declarations are also included in the declaration subset. See section 6.5.2.2 for details.

  6. Charles Goldfarb and Paul Prescod, The XML Handbook (Upper Saddle River, N.J.: Prentice Hall, 1998), 478.

  7. Robin Cover's SGML/XML Web Page provides further information about ISO character entity sets, available at: <http://www.oasis-open.org/cover/topics.html#entities>.

  8. Charles Goldfarb and Paul Prescod, The XML Handbook (Upper Saddle River, N.J.: Prentice Hall, 1998), 37.

  9. For a fuller, more technical discussion of Formal Public Identifiers, see Steven J. DeRose, The SGML FAQ Book (Dordrecht: Kluwer, 1997), 211-12. See also Charles Goldfarb, The SGML Handbook (Oxford: Oxford University Press, 1990), 382-90.

  10. The World Wide Web Consortium maintains an excellent web site that contains definitive information concerning definitions and the ongoing development of URIs, URLs, and FPIs. Available at: <http://www.w3.org/Addressing/>.

  11. Further information regarding the SGML catalog file and the formal FPI structure is available in Encoded Archival Description Retrospective Conversion Guidelines, Section VI, "Naming and Declaring Referenced External Entities," available at: <http://sunsite.berkeley.edu/amher/upguide.html#VI>. These guidelines were used in both the American Heritage Virtual Archive Project and the Online Archive of California Project.

Table of Contents
Home Page Preface Acknowledgments How to Use
This Manual
Setting EAD
in Context
Administrative
Considerations
Creating Finding
Aids in EAD
Authoring EAD
Documents
Publishing EAD
Documents
SGML and XML
Concepts
EAD Linking
Elements
Appendices


Go to:


Copyright Society of American Archivists, 1999.
All Rights Reserved.


[VIEW OF LC DOME] The Library of Congress

Library of Congress Help Desk (11/01/00)