Carl Fleischhauer

Library of Congress

Carl Fleischhauer holds degrees from Kenyon College and Ohio University. His work experience includes film and video production at West Virginia University and media documentation activities at the American Folklife Center at the Library of Congress. His publications include the videodisc The Ninety-Six: A Cattle Ranch in Northern Nevada (Library of Congress, 1985) and the photographic book Documenting America, 1935-1943 (University of California Press, 1988). Today Fleischhauer coordinates the American Memory project at the Library of Congress. American Memory is a program to disseminate historical collections in digital form.

Organizing Digital Archival Collections:

American Memory's Experiences with Bibliographic Records

and Other Finding Aids

Frameworks and Finding Aids

Organizing Digital Archival Collections

I. Introduction

During the five-year American Memory pilot that ended on September 30, 1994, my Library of Congress colleagues and I wrestled with dozens of issues. The first part of this paper reports on one of these issues--perhaps I should say family of issues--for which I can offer no perfect name. The family includes (or partially includes) elements that we call catalog, finding aid, or bibliographic records. But for us, this domain also extended to collection organization and the tangible work of collection processing, inventory, and control.

My use of the nouns collection and processing in the preceding paragraph reveals that American Memory concerned itself with archival collections. The formats we treated included manuscripts, sound recordings, prints and photographs, motion pictures, printed matter, and combinations thereof. Although a number of the pilot- period collections include monographs, thus far these have been presented as parts of larger groupings.

The collections have not only been varied in terms of the original formats but also in terms of structure. In some cases, we worked with the Library's custodial divisions to reproduce a complete, pre-existing entity: all 4,000 of the Prints and Photographs Division's panoramic photographs, say, or every document in the Manuscript Division's Federal Writers' Project Folklore Life Histories--a complete series within a larger manuscript collection.

Sometimes, considerations of cost or the level of collection readiness led us to digitize representative cross-sections of larger collections: 1,118 of about 4,000 Civil War photographs or 116 works selected from the 900 items in the Rare Book and Special Collections Division's National American Woman Suffrage Association library.

On three or four occasions, usually when we received a grant, we ourselves selected a body of material. For example, Laurence Rockefeller funded the assembly of an anthology of several hundred documents in all media that trace the evolution of the conservation movement in America.

Collections in each of the three categories manifest a high degree of cohesion and coherence; the items they contain tend to be interrelated. Thus our construction of electronic editions required not only that we produce digital reproductions of individual items but also that we create frameworks that present each collection as a coherent entity. Some speakers would use the term finding aid [FOOTNOTE: Please note that my use of the term finding aid is intended to name any entity that helps a researcher locate material in a collection. It explicitly includes both bibliographic-record databases and registers of the sort used by archivists.] to name this framework; some might say that the finding aid is within the framework.

II. Architecture

My metaphor today is architecture. Individual items in electronic archival collections may be thought of as bricks or two-by- fours; the collections themselves are rooms within a larger edifice. Thus the frameworks mentioned in the last paragraph give shape and definition to rooms in a larger edifice. I'll return to this larger edifice in a moment.

What sort of frameworks have we constructed? At first, the American Memory team mocked up collections on CD-ROM disks in both Macintosh and IBM-compatible environments. The collection- specific frameworks on these CD-ROMs brought together prose texts and bibliographic records to create a hybrid of narrative and database that has a different look and feel from the typical library catalog.

Each CD collection led off with a title page whose typography and layout took advantage of the graphical capabilities of today's computers. This title page might better be called a table of contents in that it offers researchers menu-style access to both the finding-aid database and background texts: scope and content notes, chronologies, technical or editorial remarks, bibliographies, and the like. This title page and the associated background texts provide rich and complete collection-level information.

When a researcher selects the finding-aid database option from the collection title-page menu, he or she is provided with a search- and-retrieval software that can search for words or terms in the bibliographic records or, for some collections, the full texts of the items. In our experimental mood, we used a variety of search software.

The bibliographic records that make up our pilot-period finding aids vary tremendously. The records for the Woman Suffrage documents, for example, were prepared by the Rare Book division for distribution to the national bibliographic utilities. They are thorough, high-quality records that can stand on their own. In contrast, the records for the portrait photographs by Carl Van Vechten are skeletal. [FOOTNOTE: The Van Vechten records contain a brief title preceded by the phrase "Portrait of." To make the records work in the Prints and Photographs Division's general catalog, the catalogers also provided a subject heading with the name of the sitter and birth and death dates.] But once a researcher is "inside" the Van Vechten collection, the hybrid combination of brief bibliographic records and background texts provides an excellent framework for and access to the photographs.

Recently, we have joined the world in embracing Internet, working with the Library's computer center to place collections on a World Wide Web server. Our Web presentations, like the CD-ROM mockups that preceded them, feature narrative texts and, thanks to Dean Wilder, the Library's principal World Wide Web programmer, also employ a search-and-retrieval software. This combination permits us to continue to mix prose texts with searchable bibliographic records. (We did a little code-switching, however, and now call our title pages home pages, following the custom of the Web.)

All of these mockups were reasonably satisfactory to those of our colleagues who were willing to embrace the use of bibliographic records for a finding aid. But it went a little against the grain for our co-workers in units like the Manuscript Division, Music Division, and the American Folklife Center. "Bibliographic records are a procrustean bed," they said, asking, "Why can't we continue with our traditional finding aid: the register?" And there were two other ideas afoot: first, that it might be faster and easier to draft a register than to assemble a bibliographic-record database and, second, that a database represented overkill when item-description did not include subject access.

The term register, of course, names a directory to a collection, traditionally produced in printed form. The register guides researchers to the various series and subseries used to define the categories of material in the collection. Then-- depending upon the materials at hand and the resources to apply to the task--the register goes on to list the collection's containers, folders, and sometimes even documents.

Our colleagues' prodding has started us designing online registers. This is easiest to imagine in the example of a collection of the papers of an individual. The online register's table of contents would provide access to scope and content notes, a chronology of the person's life, remarks on the papers' provenance, a statement on restrictions, and the like. The table of contents will also provide access to a menu from which the researcher may select a series of interest, then a subseries, and finally a list of containers or folders. Clicking on the name of a folder or a thumbnail image (representing the first document page in the folder) brings forth the digital reproduction: a set of facsimile images, a searchable text, or both.

Of course, an online register will be provided within a software package that permits a researcher to search for words across the whole register. If time permits, archivists can add controlled-vocabulary index or subject terms to permit searching by subject.

The basic tool for the creation of an online register is coded markup language, the very thing discussed by Susan Hockey, my predecessor at this podium. Susan Hockey and her Text Encoding Initiative (TEI) coworkers have been very helpful to American Memory; three years ago, they convinced us of the potency of markup schemes and American Memory adopted the TEI guidelines for the conversion of historical texts.

I would also like to tip my hat to Daniel Pitti from the University of California, Berkeley, and a participant in this meeting. Daniel Pitti has developed a set of very sophisticated models for online registers using Standard Generalized Markup Language (SGML). The preliminary American Memory mockups employ the hypertext markup language (HTML) associated with today's commonplace implementations of the World Wide Web, but we share Pitti's preference for SGML. [FOOTNOTE: We would not call the hierarchical access tool we have prepared for the American Memory WPA Federal Writers' Project collection (the only example that we have on the Internet at this writing) a register. The Writers' Project finding aid has a special, even, eccentric, form due to certain exigencies that arose in the course of its preparation. We think it works fine, but are not proposing it as a model.]

Having heard how we assembled and framed our rooms, i.e, individual collections, you may ask about the larger edifice. What happens when a collection framed by a register coexists with a collection framed by a database? Shall we catalog the collection of collections? If so, would such a metacatalog provide research access via collection-level records, the full texts for home-pages, or by some other means? When a researcher is choosing a collection to consult, will he or she be able to take advantage of the "inside-collection" indexes derived from finding-aid bibliographic records, online registers, or the full texts of items within the collections? How easy will it be for researchers to compare and contrast items in different collections? How will our ensemble of archival collections play off against a typical OPAC catalog of monographs and serials, some of which may also be reproduced in digital form?

Our answers to these questions are very tentative; in fact, we look to meetings like this one for guidance. Less tentative is our sense that neither the Library of Congress nor any other institution will have the resources to construct perfect searchable entities. Fortunately, the world at large is constructing facile and sophisticated searching tools, and we look forward to the application of those tools to our complex and diverse materials.

III. A Few Specifics on Bibliographic Records

During the American Memory pilot and in parallel efforts carried out by the Prints and Photographs Division, Library staff members have experimented with bibliographic-record databases. To some degree, the records in these databases vary from those customarily produced for, say, monographs. I will warn that few of our experiments have been fully realized; this report is meant to provide a few snapshots of our activity.

Linking to the reproduction. The most obvious element to be added to a bibliographic record in a digital-content system is the link to the reproduction of the item described. For many years, we used a local Library of Congress field (938) as an "electronic call number," but recently we have moved toward use of the new MARC field 856 (Electronic Location and Access). In this field, we provide an identifier for the item i.e., a number that is incorporated into the filenames for all of the reproductions of items. (See the section below on text collections for elaboration of this point.) We also provide what we call a setname for directories of files that are stored together in the computer delivery system. In effect, a set is a directory that contains a number of related files.

It is worth noting that our 856 field content functions within the closed loop of a collection finding aid and that our digital reproductions are on the premises. This is a bit different from 856 field content in a record found in the catalog library A that points to content on a server in library B. I should also note that the filenames for Library of Congress digital reproductions employ DOS naming conventions. This is because our digitizing contractors currently deliver materials to us in DOS media and we carry out our quality review on IBM-compatible hardware. (We hope to add other means of delivery in the future.)

Figure 1 is an example of use of the 856 field in the case of a single still photograph; figure 2 is an example in which the item is reproduced by multiple digital files.

FIGURE 1. Example of an 856 linking field for a single still photograph. Subfield d contains the setname; f contains the image filename.

 001 van94000092/PP
 005 19931124123618.0
 040 #ace#DLC#DLC#gihc
 050 00 #abu#Item in LOT 12735,#no.  90#
 100 1 #ade#Van Vechten, Carl,#1880-1964,#photographer.
 245 10 #ah#Portrait of Tallulah Bankhead #[graphic].
 260 #c#1934 Jan 25.
 300 ab#1 photographic print :#silver gelatin.
 500 #a#Van Vechten no.  XXXII E 16.
 530 #a#Use reference surrogate in Prints and Photographs Reading Room. 
 541 #cad#Gift,#Carl Van Vechten Estate,#1966. 
 600 10 #ad#Bankhead, Tallulah,#1902-1968. 
 655 7 #a2#Portrait photographs.#gmgpc
 755 #a2#Silver gelatin prints.#gmgpc
 773 0 #tw#Carl Van Vechten Photograph Collection (Library of Congress)#(DLC) 89715440
 856 #3df#original#LCPP005A#5A51689
 985 #ae#pp/van#ammem 

FIGURE 2. Example of an 856 linking field for a multipart still photograph, reproduced by multiple digital files. Subfield d contains the setname; f contains the filename for the first image; g contains the filename for the last. The system assumes that all filenames falling between these two are to be displayed.

 040 #aDLC #cDLC #egihc
 050 00 #aPAN US GEOG - New Jersey, no.  6 #u(F size) 
 017 #aJ131241 #bU.S. Copyright Office
 245 00 #aAtlantic City, N.J. from Lawrence Captive Airship, 800 feet above boardwalk, 1909 #h[graphic]. 
 260 #c1909. 
 300 #a1 photographic print :  #bsilver gelatin ; #c19.5 x 50 in. 
 500 #aCopyright claimant's address:  Chicago. 
 500 #aNo.  1.
 541 #cCopyright deposit; #aGeo.  R. Lawrence Co.; #dSeptember 1, 1909. 
 530 #aUse surrogate on videodisc available in the Library of Congress Prints and Photographs Reading Room. 
 650 7 #aWaterfronts.  #2lctgm
 650 7 #aBeaches.  #2lctgm
 650 7 #aPiers & wharves.  #2lctgm
 655 7 #aCityscape photographs.  #2gmgpc
 710 2 #aGeo.  R. Lawrence Co., #ecopyright claimant. 
 752 #aUnited States #bNew Jersey #dAtlantic City. 
 755 #aPanoramic photographs.  #2gmgpc
 755 #aAerial photographs.  #2gmgpc
 755 #aSilver gelatin prints.  #2gmgpc
 773 #tPanoramic photographs (Library of Congress) #w(DLC) 93845487
 852 #aDLC #bP&P
 856 #3original #dLCPP006A #f6A19567 #g6A19572

Things get a little more complicated for printed matter and manuscript collections, especially when the reproductions consist of both an image set and searchable text. Here again, we use an identifier, a number that is incorporated into the filenames for all of the reproductions associated with a given item. (Note that, for us, a manuscript item is typically a file folder.)

One of our woman suffrage pamphlets, for example, carries the identifier N1043. This is the number provided in the 856 field. The searchable text is stored in a file named N1043.SGM, indicating a text with SGML markup. The searchable text contains pointers (tagged elements) to the page images and to a separate set of reproductions of illustrations. Thus a retrieval system can present a researcher with the text as the primary item and hold the images in reserve. The researcher can access the images "from the text," if desired.

The page-images are in files named N1043001.TIF (for the first page image), N1043002.TIF (for the second page image), and so on. The extension TIF indicates the use of TIFF (Tagged Image File Format) headers. The intricacies of handling the pamphlet's illustrations have led us (for now) to create a separate set of digital images that reproduce the illustrations; in the illustration-image filenames, the identifier is followed by a hyphen or a T (to indicate a thumbnail image) and a two-digit serial number (the illustration's serial number), and end with the extension PCX (the illustrations are stored in the "PCX" file format).

We have not yet placed an image-only manuscript or printed-matter item in a retrieval system, but anticipate that a researcher will first receive a list of images and, having displayed one, will be able to navigate within the set: next, previous, and jump to.

Linking to added-value information. From time to time, we have wanted to link the bibliographic record not only to reproductions of the item proper but also to what I will call here added-value elements. Thus far, our added-value elements have consisted of images and/or texts, selected or written in a pedagogical mood; they are intended to provide special explanations to students or other first-time users or, in some cases, to simply add a little eye appeal.

Here's one example: the Nation's Forum political speeches from the Motion Picture, Broadcasting, and Recorded Sound Division. There are three types of reproductions for this collection: the sound recording, a transcript of the speech text, and an image of the original 78 rpm record label. In the final version of the cataloging, these will be cited in 856 fields. Then there are two added-value elements: a portrait photo of the person making the speech, from a Prints and Photographs Division biographical photograph file (an entirely different collection) and the caption cum identification for the photo. Meanwhile, in another collection, we plan to accompany selected items with "context notes," i.e., brief paragraphs that place the item in a larger historical context. Our current plan is to cite these added-value elements in a local 956 field, structured to mirror the 856.

Managing processing activities. The preparation of large, multipart, and sometimes multiformat collections includes a number of steps, e.g., items may have to be tracked through preservation treatment, sent or retrieved from storage facilities, be the subject of copyright searches, and (of course) undergo digital capture by scanning or other means. Archivists control these process steps by means of a variety of databases. In the interest of efficiency, we found ourselves asking how those processing databases might be transmuted into finding-aid databases when processing came to an end. Or, to ask it another way, could not the catalog in an early state serve the work of collections processing?

This topic came home with a vengance when we worked up a collection of late 19th-century and early 20th-century books about California. Since these books are from the General Collections, our tiny American Memory coordination office managed the process. We sent batches of books to one contractor for text conversion and then to a second contractor to have the illustrations scanned. Sign-out/sign-in documentation was prepared for each batch with the assistance of the Loan Division. We did our best to monitor the whole process in a manual mode, and kept finding ourselves out of synch with movement of materials.

In an after-action analysis, our consultant Elaine Woods--who has provided us with more useful guidance on bibliographic databases than any other single person--proposed adding ad hoc 980 and 984 fields to the bibliographic record. [FOOTNOTE: The bibliographic records for this collection are PREMARC records that have been upgraded by a Library of Congress Cataloging Directorate editorial unit. In order to prepare our CD-ROM prototype, we downloaded the cataloging into the Minaret PC-based cataloging software where new fields could be added.] As indicated in figure 3, subfields told where a book was and recorded the delivery-disk name and filenames for the contractor-produced texts and illustrations. The 984 field also includes a subfield that records the ranking our editor assigned to the book during the selection process. At the end of processing, of course, this information can be stripped from the record.

FIGURE 3. Example of American Memory California book record. Ad hoc fields 980 and 984 may be used for processing control; only 984 is shown here. We have also used field 35 to hold the American Memory processing serial number for the book within our California collection.

In field 984, subfield b contains the "editor's quality ranking" for the book, subfield n indicates that this book was handled by the Binding Office and identifies the contractor delivery batches for the converted images (Input Solutions batch 2) and the text (Federal Prison Industries batch 7), the first subfield x identifies the current physical location of the actual book (a desk charge), and the second subfield x indicates the need for a copyright search.

 Control #:  01019696 Date Added:  810825 Date modified:  19930204134812.0
 Type date:  s Dates:  1850 - Country:  nyu Illus:  Audience: 
 Form:  Contents:  Govt:  Conf:  0 Fest:  0 Index:  0
 Fic:  0 Biog:  Lang:  eng
 010 arc01-801
 035 a011
 040 aDLC cCarP dDLC
 042 apremarc
 043 an-us-ca
 050 0 aF865 b.K57 u
 100 1 a[Kip, Leonard], d1826-1906. 
 245 10 aCalifornia sketches, bwith recollections of the gold mines. 
 260 aAlbany, bE.H.  Pease & co., c1850. 
 300 a57 p. c21 x 13 cm. 
 651 0 aCalifornia xDescription and travel y1848-1869. 
 651 0 aCalifornia xGold discoveries. 
 852 aDLC bGC
 906 kx n7 o7 pl ql
 917 aYES
 952 aY b9 nIS0
 984 b10 nBinding Office/IS2;FPI7 xLM603 xIncomplete - needs copyright 
        clearance; test & images recorded twice??
 985 aGC/CA-First person eAmMem fBook

Tracking copyright information. One important lesson taught by the Library's Optical Disk Pilot Project (1982-1987) was that a full-content system can neither be assembled nor provided to end-users without the permission of owners of copyrighted material. The Library's digital offering will include some copyrighted material. It will be very difficult for the Library will to secure permissions unless it can demonstrate that its computer retrieval system can track use and manage a fee system.

The World Wide Web system under development in our computer center has not yet begun to wrestle with the delivery side of this problem, but we have begun to work on data- gathering and data-recording. I present our ad hoc scheme here as much to provoke discussion as to recommend.

The first step in a permissions process is to determine the facts of ownership (to the degree possible), and I believe that the bibliographic record is a good place to record those facts. In contrast, volatile information about permissions and fees belongs in a look-up table rather than the bibliographic record.

I advocate capture of the facts of copyright during collection processing. When a collection is destined for electronic access, this information is every bit as important as, say, data that supports preservation. Figure 4 lists our ad hoc copyright-data fields and notes a variation developed by the Prints and Photographs Division. Figure 5 provides an example of a bibliographic record from our collection of California books containing copyright data.

FIGURE 4. Field 017 and ad hoc local fields used in the American Memory California books database.

FIGURE 5. Example of an American Memory California book record, showing the use of copyright-information field 017 and ad hoc field 917 (field 918 not used in this example).

 Control #:  25010432 Date Added:  821108 Date modified:  19921202114622.0 
 Type date:  s Dates:  1923 - Country:  cau Illus:  ac Audience: 
 Form:  Contents:  Govt:  Conf:  0 Fest:  0 Index:  0
 Fic:  0 Biog:  Lang:  eng
 017 aA 856825
 035 a173
 040 aDLC cCarP dDLC
 042 apremarc
 043 an-us-ca
 050 0 aF864 b.M7
 100 1 aMoak, Sim, d1845-
 245 14 aThe last of the Mill Creeks, and early life in northern California, cby Sim Moak. 
 260 aChico, Calif., c1923. 
 300 a47, [1] p. bfront.  (3 port.)  illus.  c24 cm. 
 650 0 aFrontier and pioneer life zCalifornia. 
 650 0 aMill Creek Indians. 
 852 qCB
 906 kx n7 o7 pl ql
 917 aSimeon Moak f1-8-24
 919 sSearch completed aref and bibl staff bY cN d8-20-90, James Roberts memo 
 952 aY b2 nIS2
 984 b9
 985 fcf05

Bibliographic records and reproductions in searchable-text form. Our use of SGML markup for full texts led us to explore two issues that pertain to bibliographic records. The first issue emerged when we were designing our Document Type Definition (DTD), following the guidelines of the Text Encoding Initiative. The TEI requires a header that indicates the work's author, title, and so on. We asked ourselves, "Should we include all of the data from the work's bibliographic record in the TEI header?"

In our case, the answer was no. For one thing, items like manuscript folders have no formal bibliographic record (although they may have a finding- aid record or register entry). For another, our American Memory team--pressed by other matters--did not want to take on the task of designing a TEI-header scheme that could capture all the nuances of a MARC record.

As shown in figure 6, our TEI header includes one title (within the larger title statement) that reproduces the title from the bibliographic record (MARC field 245) for the item with the added phrase "a machine-readable transcription." Within the publication statement, we provide the Library of Congress Card Number (LCCN) for the record, when such a number has been assigned. (Some of our finding-aid-quality bibliographic records are created in local databases and do not receive official card numbers.)

FIGURE 6. Example of TEI header for a document from our Woman Suffrage collection, reformatted for clarity in presentation. The title printed here in boldface has been copied from the bibliographic record's 245 field with the phrase a machine-readable transcription appended. Another element (also boldface here) provides the Library of Congress Card Number (LCCN) for the official bibliographic record for this monograph.

<TEI2><TEIHEADER TYPE="text" CREATOR="American Memory, Library of
 Congress" DATE.CREATED="09/08/93">
<TITLE>Powers of Congress to prohibit inequality, caste and oligarchy
of the skin. Speech of Hon. Charles Sumner of Massachusetts, delivered in the
Senate of the United States, February 5, 1869: a machine-readable
<TITLE>Winning the Vote for Women: The National American Woman Suffrage
Association Collection; American Memory, Library of Congress.</TITLE>
<RESP><ROLE>Selected and 
converted.</ROLE><NAME>American Memory, Library of
Congress. </NAME></RESP>
<P>Washington, 1993.</P>
<P>Preceding element provides place and date of transcription
<P>This transcription intended to be 99.95% accurate.</P>
<P>For more information about this text and this American Memory
collection, refer to accompanying matter.</P>
<COLL>Selected from the National American Woman Suffrage Association
Collection, Rare Book and Special Collections Division, Library of
<COPYRIGHT>Copyright status not determined.</COPYRIGHT>

The second SGML issue pertains to computer character sets. This matter has nothing to do with the Text Encoding Initiative (TEI) guidelines, but grows out of the customary practices associated with the creation and use of SGML documents.

In their raw form, of course, SGML documents employ only the "lower 128" characters in the American Standard Code for Information Exchange (ASCII) set. In the American Memory implementation of TEI SGML, for example, the letter é is coded as &eacute; (called the entity reference), consistent with the rendering outlined in ISO 8879-1986.

But the SGML declaration associated with the American Memory TEI implementation states that we comply with ISO 646-1983, in which é is character number 130 (hexadecimal 8A). In contrast, ALA USMARC provides the diacritic ' (character 226, hexadecimal E2) followed by the letter e (character 101, hexadecimal 65).

Why did we declare our preference for ISO 646-1983? We were told that the software used by contractors for optical character recognition, rekeying, and/or markup are generally oriented to this standard, as is the typical type SGML display software. It would be cumbersome to require contractors or software producers to use the ALA character set.

The disjunct between the SGML and ALA character sets, however, means that post-processing may be needed before a bibliographic record for an item and the searchable text for that item can be displayed on single computer monitor or output to a printer. During the American Memory pilot, we reprocessed data when collections were loaded into our CD-ROM retrieval systems. Our pilot- period efforts also revealed character-set conflict when we extracted the 245 title field from a bibliographic record for use in the TEI header (see figure 6 above). If the title in the MARC record included a diacritic, we would have to convert it to the entity reference before placing the extracted title expression into the header.

The alternative to post-processing would be to develop end-user tools capable of correctly displaying and printing two or more character sets. But it is fair to say this option would be just as challenging as post-processing.

IV. Conclusion

I regret ending my report on what may seem to be a blue note. In fact, it has been extremely interesting to participate in the development of digital collections for the Library of Congress and the rocks in our road have been no more numerous than one would expect. My Library of Congress colleagues and I look forward to continuing the effort! Thank you.