This option requires space on a Web site where a repository can load its files and the technical training necessary to design and create HTML-encoded pages.
Advantages: This approach is relatively easy to accomplish and inexpensive to carry forward if you have access to a Web site.
Disadvantages: The only information that the user initially has about the contents of the collection is the context that is provided on the page that contains the link to it. This may make the process of selecting relevant finding aids somewhat tedious.
The USMARC format includes field 856, which is used to record a citation to any external electronic resource, such as an EAD-encoded finding aid that is related to the collection described in that MARC record. Field 856 may include a uniform resource locator (URL) for that resource. Web interfaces to online catalogs render the URL as a hyperlink, which, when selected, displays the associated file, in this case a finding aid, in the user's browser. Repositories that use a Uniform Resource Name (URN) can record their globally unique identifier as a "handle" in the 856 field.
856 42 $3 finding aid $d eadpnp $f pp996001 $g urn:hdl:loc.pnp/eadpnp.pp996001 $u http://hdl.loc.gov/loc.pnp/eadpnp.pp996001 856 42 $3 An electronic version of the inventory for this collection may be found at $u http://www.mnhs.org/library/findaids/00020.xml Figure 5.2.2a. Two examples of the USMARC 856 field. The first contains a URN ($g) followed by a URL ($u) and the second only a URL ($u).
Advantages: The technical requirements for this approach are relatively modest and inexpensive if your online catalog has a Web interface. You will need to modify the relevant MARC record by adding the appropriate linking data in the MARC 856 field. Additionally, you will need space on a Web-accessible server to mount the EAD files.
This approach leverages existing indexing in the online catalog that captures information about the context and content of each collection. It mimics existing two-step finding aid systems in which the researcher first queries the catalog for broad name, place, or topical indexing of the collection, and is then directed to a finding aid in a separate file for more detailed information. With the introduction of MARC catalogs into many repositories, the first part of this process (consulting the online catalog) was automated. With online access to EAD files, this becomes an all-electronic process as the MARC record is linked dynamically to the online finding aid.
Disadvantages: This scenario presumes the availability of both MARC records for collections and a Web-accessible online catalog. The cost of creating the former or installing the latter just to make finding aids accessible may be quite expensive. It also creates an ongoing cost in the maintenance of the electronic links between the catalog entry and the EAD file (see section 5.4 for a discussion of file management issues).
While using the catalog as a gateway to one's finding aids provides many access points into the collection, users still cannot initially search the full text of the finding aid; whether this is a weakness or an advantage is a matter of perspective. One point of view holds that this is less than fully satisfactory, since full searching of the rich content of one or many finding aids is not possible. The other side argues that the summary catalog description, with access terms constructed in standardized forms, performs a useful filtering function, limiting search results to those aspects of the collection deemed sufficiently important to appear in a synopsis. For many reasons, this may be preferable to being bombarded with a large number of irrelevant hits such as a full-text search of large finding aid files may bring. The results of such a query may resemble the consequences of attempting to drink a sip of water from a gushing fire hydrant.
The applications that fall into this broad group offer so many different features that they are difficult to categorize precisely. They include basic Web authoring and distribution products that offer search and retrieval capabilities, as well as complex, feature-rich document management systems, usually marketed to large corporations, that have numerous files and sophisticated publishing requirements. Many are quite expensive. Nevertheless, such systems may be suitable for large institutions with extensive holdings of electronic texts, or for consortia that share online services. Additionally, several firms offer substantial educational discounts, which can greatly reduce a repository's software costs.
The Enigma search engine from Insight, Inc., the InQuery search engine from the Center for Intelligent Information Retrieval, the BASIS text database from Open Text, and the Dual Prism software suite from AIS Software offer electronic publishing for SGML and XML files. Other applications include document management software such as the Livelink system from Open Text, Live Publish from the Folio Products Division of Open Market; Arbortext's Epic system; and INSO's DynaText family of products, which includes DynaTag and a Web publishing component called DynaWeb.
To provide access to collections in this manner, an archives must install the software that will index the files and format them for display on a Web-accessible server (see section 5.3 for information on the use of stylesheets in formatting documents). Each of these products includes a search interface that is partially to fully customizable by the archives. In a typical query, the user enters search terms that are passed to the server. The server executes the request and returns the results to the user's browser, typically listing brief information about each relevant collection. The user may then select and download the desired finding aids one at a time. This process is analogous to the brief listings of book titles that many online catalogs deliver in response to a search that results in multiple hits.
Advantages: The chief virtue of this method is that users can search simultaneously the full text of many finding aids, either from a particular repository or from multiple institutions, as in a union catalog. A query may reveal information about some portion of a collection that is significant to a particular researcher, but that does not constitute a sufficiently large part of the collection to have been noted in a summary MARC catalog record; many finding aids contain a wealth of information about subject content that the researcher can mine in this way. The search interface may be structured to permit retrieval based on specific content markup, such as enabling a search for all data encoded in a <corpname> element by prompting users to limit a search to "names of organizations." This approach provides easy, integrated access to many collections.
Disadvantages: Search engines are relatively expensive to acquire and require advanced computing skills to program and maintain. As mentioned in section 5.2.2, your ability to perform in-depth queries may increase recall but will certainly decrease the precision of many searches. The nature of the query interface, the type and level of indexing, and the presentation of the results set are additional theoretical and practical concerns that have not been clearly developed at this early stage in the implementation of EAD. More experience with retrieval issues will clarify user understanding of and requirements for both the search interface and the display of results, which will, in turn, help to refine these applications.
Before learning about various stylesheet formats, we will examine how a browser might process text files encoded in SGML, XML, or HTML in order to display them correctly. In a typical scenario, the browser initially reads the encoding scheme of the document and interprets its structure as a tree, in which the document element (see figure 6.2.1b), such as <ead> or <html>, is the base of the tree. For example, the tree for an EAD document will have two or three major branches (<eadheader>, the optional <frontmatter>, and <archdesc>). The <archdesc> branch subdivides into <did>, <scopecontent> and other branches, which in turn further subdivide until one finally comes to the "nodes" or leaves at the end of each branch, where the textual content of the finding aid is found. For an example of how such a tree might be graphically represented, examine the Windows Explorer feature of the Windows operating system, which displays the hierarchical and nested relationships of the drives, directories, subdirectories, and files on your computer.
Once the tree has been built, the browser compares the file to its DTD to ensure conformance. A presentation engine in the browser then renders the text of each node for display, controlling properties such as screen placement and the size, type, and color of the font. In doing this, it follows certain rules embedded in a stylesheet.
The formatting rules (or stylesheet) for HTML documents are hardwired into the browser; in other words, they are included as part of the programming of the browser software. The display of each HTML element in Navigator or Internet Explorer is thus determined in advance by Netscape or Microsoft (each in a slightly different way). This is feasible because there are only about 80 HTML tags whose display must be predefined.
The default display may be overridden, however, by an external stylesheet file that causes the presentation engine in the browser to display the document in a different way. Such external stylesheets are optional for HTML files but are required when the document sent to the browser is encoded in XML or SGML; this is because the formatting of elements in these schemes is not built into the browser. Indeed, this formatting cannot be preordained, since the number of elements that could be defined by current and future DTDs in SGML or XML is virtually unlimited. A stylesheet therefore must be employed.
To deliver SGML manifestations of EAD-encoded finding aids on the Web, you must mount your EAD files on a Web-accessible server, along with the necessary stylesheets (87) and navigators and the EAD DTD files. In turn, users must have loaded either the Panorama or MultiDoc Pro software on their computers and have properly configured their browser to work with it. EAD documents are matched with specific stylesheets either through an association provided by a catalog file on the server (see section 6.5.2.4.1) or by a processing instruction embedded in each finding aid that points to the relevant stylesheet file, also stored on the server. Processing instructions (PIs) are an SGML device for inserting into a document information that is intended for processing by a proprietary software application rather than by a parser.
Advantages: Panorama and MultiDoc provide a very effective presentation of the finding aid, including a useful navigator feature that provides a visual road map of the document, enhances user understanding of the collection, and aids in sophisticated searching that is built into the software. The presence of the entire document, with its full SGML structural encoding, on the user's computer permits fast and powerful searching based on content markup, as well as speedy navigation through the document once it has been downloaded.
Disadvantages: Unfortunately, unlike many other Web viewers and plug-ins, neither of these applications is available for free, but must be purchased by the user who wishes to display your finding aids. Since casual users may be unwilling to spend the time or money necessary to acquire the software, the usefulness of this scenario is diminished for general Web distribution. It may be more feasible in closed environments, such as a single archives, library or campus that can supply all users with the viewer (which could then be used not only for viewing finding aids, but also other SGML-encoded documents such as scholarly texts). Additionally, although the stylesheet language employed by each product is accessible through a robust editor, both are proprietary. Style specifications developed for this publishing environment will not be transferable to others.
Advantages: As with SGML, the browser can take full advantage of the structural markup in EAD to effect fast and powerful searches of the document's content. Unlike the SGML scenario, however, no helper application is required because all the required functionality is included in the standard Web browser. The end user needs no special software.
Disadvantages: The chief drawback to this approach is that, at the time these Guidelines were written, XML functionality was as yet available only in Internet Explorer 5.0, although Netscape is building XML capability into the next release of its Navigator browser. Users with older browsers may use Panorama Publisher or MultiDoc Pro as a helper application to display XML documents in the same way that they process SGML files. It will be a number of years before a critical mass of Web users will be using newer browser versions that can read XML files directly. Until that time, archives will need to provide an alternative delivery method, probably in HTML.
The experience of libraries and archives in disseminating MARC-encoded catalog records provides an informative analogy. While such institutions continue to appreciate the many virtues of creating, storing and searching catalog records in MARC format, they have quickly embraced Web interfaces to their online catalogs that deliver records to users in HTML format. No one has seriously suggested, however, that use of MARC be discontinued and that catalog data be created directly in HTML, since the result would be loss of the ability to search by specific types of data such as author, title, and subject. Such a move would seriously cripple user searching of collections, as well as the long-term viability of the catalog records as data rather than as undifferentiated text.
You can achieve the best of two worlds by encoding your finding aids in EAD and then using HTML as the vehicle for publishing them. You accomplish this by converting the markup of the finding aid from the EAD encoding scheme into HTML syntax. This process, technically referred to as "transformation," may happen at the repository in either of two ways.
Several of the publishing systems described earlier, including DynaWeb and Dual Prism, can generate HTML versions of a finding aid in real time at the moment that the user requests the file; this is known as dynamic transformation. Custom scripting, in programming languages such as Perl, works in conjunction with SGML-aware search engines to generate the HTML version "on the fly," with the script acting as a stylesheet to map data from one tag set to the other. Another software option in this category is Microsoft's Web server software (IIS), which can use a stylesheet written in the XSL language to transform an XML file into HTML, and then send the file out to a reader using its Active Server Page technology. As XML tools mature, such transformation into HTML may occur on both the user's computer and the repository's server.
Alternatively, a finding aid may be rendered into HTML code by the repository and stored on its Web server before any user requests the file. Currently employed conversion techniques include word processing macros, scripts written in Perl, and transforming software such as the Microsoft XSL Processor. The Internet Archivist authoring software has a built-in SGML to HTML converter. The problem with such a priori transformation is that one loses some of the functionality of the stylesheet. If a change in the document structure is required, the SGML-encoded master copy is updated, and the HTML version is regenerated. In this scenario, therefore, each finding aid must be individually reprocessed. With dynamic transformation, on the other hand, the results that the user gets on the browser reflect the most up-to-date version of the format without requiring that the individual documents be edited.
Advantages: Delivering finding aids as HTML solves the immediate problem that not all users can currently read SGML or XML files. HTML documents are accessible on any browser without additional effort by the researcher. Because standard HTML tags are used, no additional stylesheet need be generated.
Disadvantages: Unless the access and retrieval environment permits the user to search the original EAD-encoded document, the value of structured searching is lost. This searching limitation will exist as long as the file on the user's computer contains only the presentation markup of HTML. A less significant potential disadvantage is that staff will have to know both HTML syntax and the transformation language employed in order to implement this delivery option. Storing both an SGML or XML source file and an HTML presentation file for each encoded finding aid will also increase-perhaps double-the file storage space required on your server. Maintaining and updating two versions of each document is an additional expense, one that, complicates both file management and processing workflows.
While your repository will have many different finding aids, you probably will need only a few stylesheets, one for each style of finding aid that you produce (such as one for small collections and another for more complex finding aids, or one for paper-based collections files and another for microfilms). Considerable potential exists for cross-institutional sharing in the development and use of stylesheets, with repositories adopting, or borrowing and modifying, existing ones from a shared pool of models. The resulting standardization in finding aid appearance, both within and across repositories, might well enhance user comprehension and interpretation of these complex information tools; such sharing also would simplify the finding aid distribution process. Sharing of stylesheets would mandate, however, substantial agreement within and across archives as to the format in which finding aids are to be encoded and displayed. While this obviously involves decisions relating to layout on the screen or page, the inclusion or omission of particular EAD elements also will affect such sharing, especially of legacy data.
In May 1998, the more robust Level 2 version was approved by the World Wide Web Consortium (W3C) as an official Web Recommendation. Microsoft and Netscape both promise full support in their next software releases for Level 2, which features substantially richer formatting capabilities such as tables, as well as specifications for the output of print and screen displays.
Its functionality is straightforward. Once the browser creates the document tree, it applies CSS styles to elements in the order in which they appear in the document. These styles may be applied to either XML or HTML documents, either by embedding the styling specifications directly in the document, or by linking the encoded file to a separate stylesheet file via an HREF link or a processing instruction in the finding aid.
Supporters of XSL describe it as a more robust styling language than CSS, one that is intended to be employed in more complex presentation situations. Certainly XSL's pattern matching and formatting syntax is more sophisticated than CSS, though with a concomitant penalty in complexity. In addition to its styling functionality, XSL may also function as a transformation agent for the conversion of data from one syntax to another.
XSL also applies styles in a different way than CSS. Once the document tree has been constructed in the processor, XSL creates a second tree, the output tree; hence, the structure of the output can be different than that of the source. For example, you might decide to display the location <physloc> of an item before its title <unittitle>, even though <physloc> follows <unittitle> in the EAD instance. An XSL stylesheet can simply reorder the elements in the output tree without any alterations to the source document; styles are then applied to the output tree. This property of XSL also provides the capability to repeat the same data in two different parts of the display, such as by extracting headings to create a separate table of contents while also presenting the headings in situ throughout the finding aid. XSL accomplishes display either by using its own detailed format object specifications or by using the simpler display language of HTML. None of this is possible using CSS, which applies formatting directly to the document tree.
The relatively long approval schedule for XSL has not stifled the development of application software, including incorporation of XSL into Microsoft Internet Explorer 5.0. Several experimental tools are available, including XT and Jade from James Clark, the Koala XSL engine, and Microsoft's MSXSL "technology preview."
The following examples show how stylesheet rules would be written in various languages to define the style for the display of the inclusive dates of the records in a finding aid. The stylesheet instructions specify that the dates of the archival materials are to appear on a separate line, in 12-point Times New Roman, colored navy, and prefaced by the text "Dates: ".
The first example uses an XSL rule to define display by use of XSL format objects: (92)
<xsl:template match="archdesc[@level='collection']/did/ unitdate[@type='inclusive']" <fo:block color="navy" font-size="12 pt" font-family="times new roman"> Dates: <xsl:apply-templates/> </fo:block> </xsl:template>
The second example uses an XSL rule to define display through the use of HTML formatting conventions. Technically, this is an XML to HTML transformation, since an XSL processor will generate HTML-encoded output as the result of applying this rule: (93)
<xsl:template match="archdesc[@level='collection']/did/ unitdate[@type='inclusive']" <P><FONT color="navy" face="times new roman" point-size="12"> Dates: <xsl:apply-templates/> </FONT></P> </xsl:template>
The third example utilizes the conventions of the Cascading Style Sheets Level 2 specification:
archdesc[level="collection"] > did > unitdate[type="inclusive"] {color: navy; font-family: times new roman; font-size: 12 pt} archdesc[level="collection"] > did > unitdate[type="inclusive"]:before {content: "Dates:"}
SGML printing applications typically may be used with any SGML instance and are not restricted to files created by related authoring tools from the same vendor, though the two software packages may be closely bundled. Microsoft's SGML Author for Word can convert any SGML document into a Word file, in the same way that it executes a conversion in the opposite direction from Word to SGML. Stylesheet languages and other processors may also be utilized. Current applications include use of the DSSSL standard to generate print output from SGML applications. The capacity to control printing is included in both the CSS and XSL languages, though no implementations of either have yet appeared.
Institutions that create HTML manifestations of their EAD documents might use the HTML file as a source of print copy. This would not be done through the print function of the browser, which typically has limited formatting capabilities, but by importing the HTML file into a word processor (see section 5.3.2.3 for more information on use of HTML files). Current versions of Word, for example, can import HTML documents, remove the tags, and convert the results into a word processing format. Lacking more robust solutions, one can simply remove the tags from the ASCII SGML file and format the document manually. Freeware programs are available in Perl that will strip out the markup from SGML documents. (94) The Internet Archivist authoring software can generate a simply formatted ASCII text output of an EAD file as well. Other authoring tools such as Author/Editor, ADEPT Editor, and XMetaL, as well as the browser plug-in Panorama Publisher, also can produce nicely printed copies of encoded finding aids.
Standardized file-naming conventions for internal storage, as well as techniques such as file-handling databases and purl resolvers or handle servers for maintaining persistent file names on the Internet, are critical both to the long-term sanity of the program administrator and to the accessibility of the files over time. Even the novice Web user has encountered the frustrating phenomenon of selecting a hyperlink on a Web page and receiving in return a message that the file could not be found. While there are many potential causes of this problem, one of the most frequent is that the creator has moved the file to another computer location without updating all links, resulting in broken connections.
A single online directory may hold many files, and therefore one promising solution is the creation of an index or other third-party device that stores the server and directory location of multiple files. This limits the information contained in any given hypertext link to a reference to a single persistent location (a server), where the storage details of many files are kept and may be updated simultaneously as filing systems change. This is made possible in SGML via a convention called the SGML catalog. Because images, text and other documents may be declared as entities and referred to by name rather than by address in an SGML document, it is possible to store the details about the actual computer location of these entities in an external and centralized catalog file. Unfortunately, XML has not incorporated this feature, but rather requires that all entities include both a relative entity name and a specific address in the form of a Uniform Resource Indicator (URI).
Other solutions might be employed, including purl resolvers and handle servers, both of which work in essentially the same manner.
The purl (persistent uniform resource locator) mechanism utilizes software developed by and freely available from OCLC,(95) and it functions in the following manner. When creating links in Web documents, the author embeds a purl in the document instead of using a conventional uniform resource locator (URL). The purl contains the Internet address of the purl server and a unique name for the document, image or other object that is referenced, instead of the document's absolute Internet address. When selected, the link sends a message to the resolver, which stores a full address for the external object and redirects the query to that location. In this way absolute Internet addresses are maintained on the resolver, where mass updates of information such as server and directory name changes are possible, rather than embedded in individual files, where any alterations in addressing would require the editing of many individual documents.
This approach does, however, have limitations. The purl software from OCLC runs only on a variety of UNIX platforms. Moreover, the purl naming convention does not effectively support links to specific locations within a document, only to the document as a whole.
Handles are one implementation of Uniform Resource Names (URNs) developed by the Corporation for National Research Initiatives (CNRI). (96) Handles are universally unique identifiers that are registered with a "naming authority," much in the same way that ISBNs are currently distributed and registered for published books. When a handle is used as a resource address in an encoded document, a handle server must resolve the handle, or unique identifier, into an actual address at which the desired resource can be located. Handles and handle servers are a relatively new development, and archivists interested in using this approach may obtain further information on The Handle System Web site.
Table of Contents | |||||
Home Page | Preface | Acknowledgments | How to Use This Manual |
Setting EAD in Context |
Administrative Considerations |
Creating Finding Aids in EAD |
Authoring EAD Documents |
Publishing EAD Documents |
SGML and XML Concepts |
EAD Linking Elements |
Appendices |
The Library of Congress