The overview of each type explains how that type of software generally functions, cites representative products available at the time these Guidelines were written, and characterizes the general advantages and disadvantages of each method. This marketplace currently is very dynamic, with the emergence of XML and XML-compliant software as a potentially significant force in a variety of environments, including office productivity tools, relational database managers, and electronic commerce.(75)
Additional steps will be required to produce an attractive printed version of the finding aid for public use, since the document will include all of the EAD tagging but will lack any formatting or other presentational directions. Several options are available, including the use of stylesheets produced through use of formatting languages such as XSL and DSSSL that can be used to specify the layout of print copies generated from EAD finding aids. See section 5.3.3 for information on stylesheets and section 5.3.4 for a discussion of print output.
Advantages: Low cost, ready availability, and user familiarity are the chief virtues of these products.
Disadvantages: Text editors and word processor applications have no built-in knowledge of the rules of the EAD DTD and hence no method of verifying conformance to it. You must rely, therefore, on the encoder's knowledge of EAD to ensure that data elements and attributes are correctly applied. You will have to employ a separate application to validate the document. Several are available currently as freeware, including NSGMLS and XML-specific parsers from Microsoft, IBM, and others.
When keying an EAD document using a text processor, you must be particularly careful about the use of certain characters and symbols. For example, the characters &, <, >, ", and ' have special meaning for SGML processors, as they may be either part of the text or part of the markup, and it is necessary to differentiate between the two. It is possible to include these and other "nonstandard" characters, such as letters that carry diacritics in non-English languages or other symbols not found on standard keyboards, in EAD documents by use of entity references (see section 184.108.40.206 for a discussion of character entity references). Extensive manual keying of entity strings may, however, significantly slow the authoring process. When using a word processor to create an EAD instance, particular care must be taken to ensure that an entity reference to the appropriate ISO character is inserted for such non-Latin letters and symbols instead of the proprietary escape codes that word processors typically use to display such characters. You will also have to generate the "prolog" section of the EAD instance manually (see section 6.2.3 for information about the EAD prolog).
With regard to look and feel, some software displays the text of the EAD instance in a linear fashion, with both the markup and the content of the finding aid appearing as a continuous block of text. In other software, markup appears in one window in the form of a tree structure that displays the hierarchical and nesting relationships of the elements, and the text of the finding aid appears in a parallel window. Software applications representative of both categories are described below, emphasizing those currently being used by archivists to create EAD documents.
Corel's WordPerfect word processing software began including SGML authoring capabilities with version 6.0. It too requires that the DTD be converted into its internal structure as a "logic" file (ead.lgc, available via the EAD Help Pages) and also incorporates the standard text manipulation features one would expect in a word processor. Printing the finding aid in a format suitable for public use (without tags and with appropriate physical layout on the page) requires application of the "styles" feature in WordPerfect.
The ADEPT Editor software from ArborText similarly displays the element structure as a tree in a secondary window and the text of the EAD document in the main window. Like other products in this category, it offers a full range of text processing features and has probably the most complete set of specialized functions for SGML authoring of any available tool. The format of both the screen view and the print output of the SGML instance is governed by two separate stylesheets written according to the FOSI specification (see section 5.3.1 and section 5.3.3 for a discussion of stylesheet languages). ADEPT Editor is available for most operating systems, including Windows, Macintosh, UNIX, and OS2. It was among the first commercial authoring packages to incorporate XML functionality.
Similar products in this group include Adobe's Framework + SGML and Vervet's XML Pro. As XML enters the commercial marketplace, software companies such as SoftQuad (with XMetaL) and Macromedia (with Dreamweaver 2) are adding XML editing capabilities to their HTML editors.
Advantages: Both types of native editors (linear and tree structured) have many useful features that make them an attractive option. The software "knows" SGML in general and the DTD being used in particular. By directly incorporating the DTD, the software can provide continuous validation of a document during the authoring process. These particular applications include many of the features that you would expect in a full-featured word processor, including a spelling checker, thesaurus, macros, internal styles governing the display of text, and templates. They also will manage entities and generate the document's prolog.
Although native editors assume that users have a general knowledge of the structure and application of a particular DTD, the software's prompts and pull-down menus aid the user in the selection of elements and the assigning of attributes. They also help encoders to insert and manage character and file entities. In a sense, the effort is done "up front" during the initial data entry phase. Once the document is finished, no further work is required, with the possible exception of printing a user-friendly version of the inventory.
Disadvantages: Some knowledge of the DTD is required of the encoder, though mastering the software itself generally is no more complex than for any typical office computer application. You probably will have to learn on your own, however, since local training centers are not likely to feature courses in such specialized tools. In addition, the cost of software may be a factor with some products. All of these applications are priced as specialized rather than commodity products, with prices beginning around $450, though Corel does offer significant educational discounts for WordPerfect.
Generating finely formatted print copies often involves additional steps and skills and sometimes additional software. Native editors are best suited to the keying of new or existing inventories not already in electronic form rather than for conversion of existing electronic files. This is because using such editors to encode existing machine-readable texts requires much cutting and pasting of the file after it has been opened in the editor and therefore may actually prove more time consuming (and therefore more costly) than simply rekeying the text.
Conversion always proceeds from the premise that there is information available in the source document that permits the conversion software to map equivalencies between text or codes in the original document and comparable EAD elements. Such information may include physical formatting data such as punctuation, capitalization, tabbing and indention; word processing styles; or other markup codes such as elements from another SGML DTD or from MARC tags. In general, the greater the consistency in the application of these clues in the source document, the more reliable and complete the conversion. Do not presume that these techniques can be applied successfully to any and all existing electronic texts, absent such consistent conversion "hooks."
Microsoft's SGML Author for Word is a generalized tool for converting documents created in the Windows version of Word into SGML documents. It accomplishes the conversion by using Word styles and templates features. You create styles, one corresponding to each EAD element, in a Word template for your documents, and then you formally create a link in the SGML Author software between each style and the equivalent EAD tag. This map is stored in an association file. As the finding aid is keyed into a Word document, appropriate formatting styles are applied to the text. During the conversion process, the software reads the association file and encodes the text of a particular document with the appropriate SGML tagging. TagPerfect software from the Finnish firm Delta Computers offers comparable functionality for converting Word documents into SGML.
As an alternative to these off-the-shelf solutions, you may choose to create your own program to accomplish the conversion. There are three general categories of tools for doing so, and they are distinguished by the complexity of the effort and power of the languages involved. The simplest languages to learn and apply are the internal macro languages of Word and WordPerfect, and some archival repositories have successfully used them for conversion. The macro programming language in version 8.0 of WordPerfect includes special features that address SGML-specific issues. Beginning with Word97, Microsoft has changed its macro language from WordBasic to Visual Basic for Applications. The macros that can be written using these tools can range in complexity from very simple to highly sophisticated.
A number of special purpose programs have been developed expressly for the task of converting structured text into SGML documents. These include DynaTag from INSO Corporation and Balise from AIS Software. They employ a complex programming syntax and are geared to experienced programmers. Balise is described, for example, as closely resembling the C++ language.
Another conversion option that falls somewhere between these two poles is Perl. A widely employed and well-documented programming language, Perl was designed expressly for the type of text manipulation that is required for conversion of existing finding aids to EAD. It has been used at several repositories by staff with an affinity for such technical undertakings. While one can purchase an introductory Perl manual in order to begin learning this programming language, be forewarned that it will take time to master Perl.
Advantages: Converters permit you to leverage existing machine-readable files and familiar software, provided that existing files are structured in a manner that will enable such transformation. By using the same software for creating inventories as you do for other office documents, you avoid the cost of a new suite of software. You also eliminate or substantially reduce the time required to learn a new application, thereby improving the likelihood of staff acceptance. There is an implicit assumption in such an ex post facto conversion scenario that the authors of documents will need only minimal, if any, knowledge of the underlying DTD structure. Also, since the original document was probably produced using a word processor, you avoid the need for additional steps to generate print copies for public use. Text conversion may be an effective approach for the encoding of legacy data already in electronic form and may also be suitable as part of your workflow for the production of new inventories.
Disadvantages: While staff costs (in terms of the overhead associated with knowing specialized software and the EAD DTD) are assumed up front during the authoring process when you use a native editor, similar overhead costs occur both before and after the fact with converters. First, source documents must be carefully formatted in advance to facilitate subsequent conversions. Development of the conversion routines themselves may involve an extended iterative design process. The conversion itself may prove more or less automatic, but manual intervention or post-conversion manipulation might be required. Some converters may be sensitive to variations occurring in source documents, either because the organizational structure of archival collections themselves vary or because of changes in finding aid formats over time. Programming adjustments may be required. Careful quality control review is necessary to insure that automated processes actually generate the desired output.
Eloquent Software's Gencat program is a proprietary DBMS that offers output of files in multiple formats, including EAD. Some archives have already written their own applications to export data from a database as EAD documents, but fairly advanced programming skills are required to do so. A potentially complex, yet extremely critical, issue in the design of such a database is the development of an architecture that supports the multilevel hierarchical structure of the components of an archival collection that is at the heart of EAD. Part of the challenge may lie in the fact that many archival database systems are very "flat," allowing only one or two levels of hierarchy to be expressed.
The use of databases may become more widespread and simpler to implement in the future when producers of relational databases such as Oracle, SyBase and others implement XML functionality into their products, as they have promised.
Advantages: Use of a DBMS may be advantageous for institutions with substantial investments in such applications. It may also be valuable for those needing to interchange descriptive metadata of the type found in EAD finding aids with other applications, such as collection management or records management systems.
Disadvantages: The programming required to implement such a database and export its data to EAD may required highly specialized training or skills. In addition, conversion of a "flat" database structure into EAD will fail to exploit some of EAD's power to express archival hierarchies.
ead.dtd-This is the core EAD DTD file. It is brief, containing a version history of the DTD plus entity references that invoke the other files in the EAD suite. It also contains three conditional sections that enable or disable the following features: XML compatibility, XLink functionality, and the specialized features of EAD's array of tabular elements. The use of these features is described in section 220.127.116.11 (XML compatibility), section 18.104.22.168 (tabular layout), and section 7.2.4 (XLink functionality).
eadbase.ent-This is the largest file of the group and contains the SGML rules for EAD.
eadnotat.ent-This file contains references to the various types of notational (nontext) files that might be used within an EAD document. These include common image file formats such as GIF, JPEG, TIFF, and MPEG (see section 22.214.171.124.2 for more information on notational files).
eadchars.ent-This file contains references to the various character sets that might be used in an EAD document. All character sets are referenced by their standard ISO identifiers. This file is not required if the document is created in XML, which uses the Unicode character set (or some subset thereof) by default (see section 126.96.36.199 for more information on character sets).
eadsgml.dcl-This is the SGML declaration file, which specifies various features of the DTD that a processing application may need to know. While many DTDs utilize a standard SGML reference declaration, EAD employs its own version. Some software applications incorporate the text of the declaration at the beginning of each SGML instance. All XML documents employ a default declaration and so do not require the use of this file.
When you make this change, observe that the explanatory note in this section of the DTD file points out that "for XML, the eadnotat.ent file should be invoked in the declaration subset of [the] individual instance." This means that the file "eadnotat.ent" must be explicitly declared as an entity in the prolog of each EAD instance that contains links to notational (nontextual) data such as graphics files (see section 6.2.3 for a general discussion of the document prolog). For XML instances, the prolog of EAD-encoded finding aids should therefore read:
<!DOCTYPE ead PUBLIC "-//Society of American Archivists//DTD ead.dtd (Encoded Archival Description (EAD) Version 1.0)//EN" "ead.dtd" [ <!ENTITY % eadnotat PUBLIC "-//Society of American Archivists//DTD eadnotat.ent (Encoded Archival Description (EAD) Notation Declarations Version 1.0)//EN" "eadnotat.ent"> %eadnotat; ]>
While it is not necessary to declare the notation file "eadnotat.ent" if the finding aid does not contain a link to notational data such as graphics files, it is probably safest to add it in all cases as a default. Note that the Uniform Resource Identifiers (URIs), in this case simple file names that refer to the "ead.dtd" and the "eadnotat.ent" files, must point to the exact physical location of these two files on your system. Their content may therefore vary from the above examples in accordance with your local storage practices for the DTD and its associated files.
<?xml version="1.0" standalone="no" encoding="UTF-8"?>
Both the Windows and Macintosh operating systems permit the transfer of data from one application to another (via the clipboard and scrapbook respectively), and it is therefore a simple matter to cut-and-paste text between a catalog record and a finding aid document. You can simply open your catalog editing software and EAD authoring application simultaneously, in separate windows on the desktop, and transfer the information. This may be particularly useful if the existing legacy finding aid comprises only a container list and can be combined with the contents of a MARC record containing summary contextual information such as a scope and content note.
Some repositories may require a more automated approach in order to transfer large quantities of such data in batch mode. One option would be to write your own conversion program to transform the data from MARC into EAD, or vice versa. Another approach would be to use the MARC DTD developed by the Library of Congress. A simple DOS program from Logos Research(80) converts records from the MARC "transmission format" into the MARC DTD structure, and vice versa. The Library of Congress also offers two free programs, written in the Perl language, to convert records between these two formats. (81) Once a MARC record has been converted into the MARC DTD structure, you can use a transforming application to render the data from the MARC DTD syntax into the appropriate EAD syntax. Such transformations may be accomplished by various tools such as an XSL processor used in conjunction with an XSL stylesheet (see section 188.8.131.52 for a discussion of XSL and stylesheets). Among these tools is PatML, a freeware product from IBM.(82)
Future development of the xml:namespace standard may make it possible to include information encoded in more than one DTD within a single EAD instance. As a result, we may have a third option in the future in which MARC data, in the MARC DTD structure, might be embedded directly in the EAD instance without first necessitating its transformation into EAD.
We cannot necessarily anticipate the actions that future processors will take on current text. With both SGML and XML, it is therefore prudent not to attempt to format text by incorporating whitespace in your document, other than between words, but rather to manipulate display completely through a stylesheet. Two examples may help illustrate how SGML and XML handle whitespace.
Keying text in the following manner into an SGML authoring application
<p>November 1: The work of the Commission began ... </p>
will not ensure that rendering software such as a browser will actually display the text as follows:
November 1: The work of the Commission began ...
It is much more likely with current software that the six blank spaces will be compressed into one. There are certain circumstances, however, in which one must be careful to ensure that at least one space does appear between words. This is true for inline elements, especially those that might be nested. Consider the following example:
<p>The movie, <title render="italic">Shakespeare in Love</title>, won the Academy Award for best picture.</p>
Without the space before the <title> start-tag and after the </title> end-tag, the text might be rendered as follows:
The movie,Shakespeare in Love, won the Academy Award for best picture.
Where the need for spacing in a prescribed situation can be anticipated (such as when a <unitdate> always follows <unittitle>) and a universally valid style rule can therefore be applied, a stylesheet may be used to supply the whitespace required. Unfortunately, not all situations are so predictable; for example, as shown above, you may not be able to guarantee that a space will be required after every instance of <title> when it occurs within a <p>. In such cases, your markup should include a single space following the inline element. It's better to be safe than sorry! Most processing software will reduce extra whitespace to a single space, but it would be quite problematic to expect your system to supply spacing where none exists.
Punctuation within the body of a paragraph of text must be entered as data. Marks of punctuation such as colons and commas that are used between EAD elements for visual recognition or clarity, however, may be more safely supplied by a stylesheet; this approach enables global changes in such formatting to be accomplished at a later date with a minimum of effort by simple changes to the stylesheet.
The XSL style language provides the ability to reorder the sequence of elements, and such resorting of text may affect output punctuation as well. Elements that are initially keyed in a particular sequence and separated by some mark of punctuation may later be resorted into another sequence. For example, embedded punctuation in the <unittitle> element in the following markup
might read correctly in one case when displayed as:
but would include a superfluous mark of punctuation if this text were presented "out of line," in this manner:
Title: Papers, Dates: 1975-1997
It is therefore preferable to supply such punctuation for display purposes through your stylesheet. In some circumstances, however, such as the <persname> element in the following example, you should supply the punctuation within the markup; this is because you cannot predict whether a comma will always follow a <persname> element within a <p>. Moreover, this text is unlikely to be reordered for display. Since the text might be extracted for indexing purposes, however, it is advisable to place the second comma outside the <persname> element so that it is not inadvertently treated as a part of the name:
<p>The author, <persname altrender="bold">Bill Smith</persname>, was born in 1912.</p>
On the other hand, the content of a heading may be unique to a particular finding aid in a way that cannot be anticipated by a stylesheet or derived from other text in the document; in such a case, the use of <head> is ideal.(85)
Within the <dsc> element, EAD includes an optional model of tabular displays that does require the deliberate specification of each cell, wrapping <drow> and <dentry> tags around them. Experience with EAD has shown, however, that effective tabular displays can be generated in the <dsc> and other areas of the finding aid by using stylesheets without the need to add this extra layer of tabular markup. Both the Cascading Style Sheets (CSS) language and the Extensible Style Language (XSL) can create tabular layout (see section 5.3.3 on stylesheet languages). Consequently, the <drow> and <dentry> tabular model is not included as a default feature of the EAD DTD, nor is it detailed in these Guidelines, though its application is documented in the Tag Library.(86)
Should you wish to invoke tabular layout, you must alter the section of the ead.dtd file headed "<!-- TABULAR DSC INCLUSION/EXCLUSION -->" in the following two ways:
Section 5.4 discusses the effects that issues such as changing file names, file directory structure schemes, or Web site locations have on the publication process and suggests options-such as an SGML catalog, file handlers and purls-for dealing with them. Good file management, however, begins during the authoring process with the systematic assignment of a standard naming protocol for files and a logical directory structure for organizing files on your computer. You also will need some type of system-a file, an index, or even a database-that tracks the names that have been used and associates them with a unique and meaningful description of the collection represented by the electronic version. This is necessary both to ensure administrative sanity and to enable the proper functioning of systems for user indexing, display, and retrieval.
Use of the <revisiondesc> subelement within <eadheader> may be useful in this regard (see section 184.108.40.206 for additional information). Documentation of the processes that you develop for encoding will also be helpful.
|Table of Contents|
|Home Page||Preface||Acknowledgments||How to Use
Aids in EAD
|SGML and XML
The Library of Congress