David Bearman

Museum and Archives Informatics


David Bearman is Editor of the journal Archives and Museum Informatics: Cultural Heritage Informatics Quarterly. He is internationally known as a consultant on cultural heritage information services and electronic records management. He led the development of such archives and museum information interchange standards as MARC-AMC and CIMI and he pioneered the International Conferences on Hypermedia and Interactivity in Museums which have fostered development of multimedia interactives in museums since 1991. Prior to forming his own firm, Bearman served as Deputy Director of the Office of Information Resources Management at the Smithsonian Institution.


Keynote speaker: Virtual Electronic Junkyard or Cultural Treasure Trove?

Useful Electronic Space or Virtual Junkheap?

Library of Congress

October 13, 1994

by David Bearman

Editor, Archives and Museum Informatics


As this morning's announcement by the Librarian of Congress made clear, we are entering a new era in the production and distribution of information. Over the next twenty-five years we can expect to take part in a worldwide effort to represent the entire corpus of civilization in digital form. The last similar undertaking took place between the 15th and 18th centuries when the knowledge of the ancient world was recast into print. As any manuscript librarian will tell you, that transfer was never entirely completed, although well over 99% of the ancient record has been printed today. It may be germane to note, however, that long before the conversion of the corpus was complete, new writing, printed and widely disseminated in Europe, urged and contributed to the overthrow of the existing political and economic order.

While the "re-presentation" of knowledge in which we will be engaged is likely to occur in concert with similar fundamental changes in our society, I will not speculate on long-term trends but instead would like to raise some relatively short term political, intellectual, professional and ethical challenges related to this monumental endeavor. My intention is to leave you with a number of questions and provocations about the enterprise in which we are engaged, rather than propose answers.

Political:

Standing as we are at the end of the age of the nation-state and on the verge of a new political order whose outlines are unclear but whose tendencies towards globalism are self-evident, we need to think in both national and trans-national terms in imagining how our cultural heritage will be brought forward to electronic media. We must simultaneously provide structures for nations to re-capture their "patrimony", and ways for the international system to support this endeavor. The model recently adopted by the international community for support of documentation of the world's biological heritage is not only closely parallel to what we require for cultural heritage it suggests a way to achieve the political goal of acquiring adequate resources for the task. Briefly, what the world's nations agreed to in Rio at the Environmental summit was to require each nation to commit itself to documenting its bio-diversity in the coming decade. This world-wide effort exceeds the financial capacity of the poorer nations and will therefore need to be funded through taxing the wealthy nations of the world which are anxious to gain commitments to other parts of the agreement.

The state of documentation of our biological heritage closely parallels that of our cultural heritage. In the richer parts of the world, where there is the least biological diversity, documentation is quite good, but elsewhere there is substantially greater diversity and less documentation. Much of the expertise to identify and describe species is concentrated in the first world, while the material to be documented can be obtained only through field work conducted elsewhere. The order of magnitude of the task is reflected in the billions of known species: 10 to at least the 9th power.

The cultural environment is not dissimilar. While every community and ethnic group has a heritage that deserves documentation, those of the first world have been well documented and are ready for re-presentation in digital form because they have largely made the transition to printed form, photography, and sound recording. That of the third world is embedded in oral tradition, the un-documented built environment, and the customs of the peoples as well as in a literature largely beyond our bibliographic control and archives without finding aids. Again, the order of magnitude of the task is 10 to the 9th power, or billions.

Our challenge is to imagine, and bring into being, an engine capable of driving the documentation of cultural heritage, of human diversity, equivalent to that constructed to document bio-diversity. I believe the political ingredients for such an effort are in place:

* the first world countries desperately need to demonstrate their respect for other cultures because the history of imperialism and the effect of their continued economic power will be to erode, if not disrespect, cultural diversity.

* almost every nation-state in the world today finds itself challenged, or threatened, by ethnic identity and needs to show that it respects the cultural heritage of its diverse populations while at the same time giving them an experience of the binding glue of national identity.

* multi-national corporations have the resources to contribute to cultural documentation and need to demonstrate their concern for the peoples in the countries in which they are pursuing their business interests.

Thus the first critical action we need to take to "re-present" the cultural heritage of the world in digital form is to put forward a political agenda for an international effort to document the cultural heritage of the world. This is not just the Global Information Infrastructure (GII) effort announced recently by the NII Advisory Council and the Clinton administration, this is a movement to populate the GII with the wealth of knowledge created over thousands of years by human civilizations. Let us consider how we can ensure that the next round of international trade talks, or the next international summit, incorporates commitments to document mankind and its achievements.

Intellectual:

The second challenge we face is intellectual. If you had heard Dr. Billington's announcement this morning, you would have heard the refrain "libraries and archives" as the source materials to be digitized. Because I come from what has been called by librarians the "non-book" tradition, I am keenly aware that the information in the Library of Congress and the nation's archives and museums that is most exciting as digitized data, comes from manuscripts, maps, photographs, drawings, and sound recordings. Remember - "non-book" materials constitute more the vast majority of the over 100 million items held by the Library of Congress. Even by the most generous definition, "books" for which we have relatively straightforward "on-ramps" for the digital highway, constitute less than 25% of the holdings.

Those of us in the archives and museum community have long been dealing with a problem in the organization of knowledge that is too frequently overlooked in traditional cataloging circles:

* the materials with which we deal are too voluminous to be individually identified,

* they do not reveal the elements by which they might be described, individually or collectively, and do not possess "subjects" whereby they might be indexed,

* and the potential audiences for these materials have widely divergent interests in them.

My experience with describing cultural objects includes encounters with the 100,000 mosquitoes, 500,000 military buttons, and 10 million 20th century stamps at the Smithsonian, and the hundreds of millions of archival documents generated annually by the U.S. Federal Government. These present cataloging issues similar to the ones we will face in the digital cultural environment we are striving to create - the one about which the Washington Post editorial on September 17 asked the rhetorical question I borrowed as the title of this talk. I suggest that the answer to how we will make the Internet or Information Infrastructure a "usable space" rather than a "virtual junkheap" lies in our ability to use the lessons of intellectual control over museum and archival holding in "cataloging" the digital environment.

Most of the 100,000 mosquitoes were collected by the Surgeon General of the United States during the construction of the Panama Canal. They are cross referenced to data describing the temperature, time of day, barometric pressure and altitude of their capture. It makes no sense to describe these mosquitoes except by reference to the context of their collection and the purposes of the research they were used for. They are both too voluminous to be described individually, and they derive their meaning from their collective existence. Those who might want to consult them today have very different interests than those who collected the, or would have described them. They are mosquitoes; but there are not about mosquitoes, or public health, or imperialism, or the bio-ecology of Panama.

The military buttons were collected by many individuals and by the curators over considerable time. They derive their significance as collectibles from the social meaning of regimental buttons which requires that they be associated with the countries, services and military units that issued them. However the also have interest as aesthetic, commercial and functional entities. Like any objects, they do not dictate the terms under which they might be described, lacking title pages or other evidence of what they mean or how they were created. Groups, rather than individual items, must be linked to external entities such as military units or manufacturers in order to provide meaningful access; they cannot be reasonably described by a subject heading that would retrieve all 500,000.

The stamps, which as printed copies do contain some information about themselves, turn out to be better described by reference to the International Postal Union Convention of 1906 than as individual items. Because they were received under the terms of a treaty which required each signatory nation to provide each other signatory nation with two proof sheets of every stamp they issue, all ten million can be completely described in a single cataloging record that references the treaty obligation. Secondary sources are more than adequate to document each issue and its philatelic history. No number of added entries could provide as rich an understanding as almost any inexpensive stamp catalog.

Consider the difference between a book on the Great Pyramid at Cheops and the pyramid itself. The users of the book can be expected to be interested in the subject of the book - the pyramid and the Egyptian culture which produced it. The "users" of the thing itself may be interested in building techniques, in art, in the environmental conditions in a particular passageway and their effects of pigments in a mural, or in the story that can be constructed from objects in a burial.

Or imagine the difference between a book on the Earth Orbiting Satellite system and either the archival record of the EOS program or the terabytes of scientific data that the program produces. The two collectivities of data produced by the program do not have as their "subject" the EOS system. The archival record documents governmental activity such as contracting and oversight, the scientific record documents the state of the earth - neither is "about" the EOS satellites, each is vast and each consists of component items that might be of interest at lower levels of granularity to very different types of users.

These and many other collections of information and records will comprise the universe of electronic information resources, the new information environment which, as the Washington Post said on September 19, is threatening to become "a vast virtual junk heap" rather than a "usable space". The archives and museum community has long dealt with collections of hundreds of millions of unique items and has developed approaches that will be critical in the age of Internet resources: methods to find information in very large haystacks when subject indexing is an impossible (and indeed unnecessary) dream.

Sometimes we forget that it has only been since the end of the last century that cataloging has focused on creating uniform surrogates of items, with prescribed rules for transcription of information and for headings management. It arrived at this point after overthrowing a century of tradition in the organization of knowledge devoted to universal classification schemes. We now face an equally profound revolution. We need to forge a new tradition of cataloging and access, or more accurate - to improve and extend methods from another tradition. The more fundamental challenge is to understand that we have been in the business of organizing knowledge rather than cataloging and that the organization of knowledge business is about to grow tremendously and become a great added value enterprise in the networked environment.

The new tradition is going to be one that documents intellectual products based on their provenance and locates them in the collectivities in which they now reside. An example of the challenge this presents to us is the Government Information Locator Service (GILS). GILS is typical of the kinds of information systems we will soon see developed to provide access to information: its contents include what might traditionally have been viewed as published and unpublished material, archival records and historical data, systems and services. As such it will serve as a means to access the hundreds of millions of public records created each year by the Federal Government. Descriptive cataloging and subject indexing will not be able to cope with providing access because the contents of the information systems are classically of, and not about, what the agency does. The action we will need to take in the face of the intellectual challenge is nothing short of reinventing the methods by which we have been organizing knowledge around new approaches designed to achieve the same ends in a very different environment.

Professional:

The third challenge we face is professional. In the near future, when the published output of our age is available in full text and the published output of the past is nearing complete conversion, the realities of the end of descriptive cataloging will be more evident. Today we have arrived at the end of one road and have little choice but to turn onto an unknown, but challenging, new path.

Twenty-five years ago, the library community invented a very robust interchange protocol now known as ISO2709/ANSI Z39.2 for exchange of a new kind of information: Machine Readable Cataloging or MARC. The content standard that is at the heart of the MARC record has evolved significantly in the decades since MARC was introduced, but the interchange protocol has been stable. We are now on the edge of a change to the interchange format that will take place without necessary changes to the format, but which will place Machine Readable Cataloging in an entirely new realm. Essentially, the future is that there will be no activity called cataloging that is separate from creation and acquisition of records and publications. (I note that thirty years ago, before cooperative cataloging made possible by MARC, the number of professional catalogers at Universities was many times what it is today, so if we take this change in stride it will not be the first revolution we have weathered).

The reason is that the publications themselves will be "machine-readable" so it will not be necessary to create surrogates which are only extracts from sources of information that are capable of being machine-read. Now it is only necessary to be able to write the rules by which the desired information would be extracted. Of course, if the information necessary to describe and index the publications can be extracted, it need not actually be separated from the record of the whole -- no new databases need to be created. All we need to do is to enable the publication to provide the information required by our knowledge-bases. In a sense, this represents a logical extension of the Cataloging in Publication program initiated fifteen years ago.

Catalogers, indexers and reference librarians will not, however, find themselves without a job in this environment. On the contrary, they will have more to do to catalog above the item level than they ever previously did at the item level. Because "publication" will become an activity in which anyone can engage, millions of new sources of information will become available in the next decade that would never have been published in a traditional sense. We will need to develop methods to separate the meaningful information into categories and discard the noise, to map the universe of knowledge in a way that the general public can navigate. But we need to begin the new activity now; not continue wasting money and human intelligence on methods we know have no future.

My colleague Ian Wilson, archivist and previously Librarian of Ontario, summed it nicely last month in a poignant story of a dedicated member of the staff of the National Archives of Canada who spent a lifetime in London copying out by hand the colonial documents relating to Canada, barely survived the blitzkrieg and retired in broken health after World War II. After the war, the work she and her colleagues had done since 1890 was microfilmed, along with dozens of times as much untranscribed material, in a matter of three years. Wilson pointed out that her superiors in Ottawa had long been tracking microfilm technology and new it would soon be practicable to film everything in situ but kept the team in London doing its job. The work that we as book catalogers have done for the past hundred years stands in exactly the same relation to the new technologies we face in the digital era.

The self-documenting digital resource is not only a possibility, it is a professional necessity. We need to put ourselves out of the descriptive cataloging business because what we see is not what we get (under software control, data structure and context may not be evident) and because the larger challenge is to develop means for those with diverse and unknown intellectual perspectives to retrieve pertinent information from the cultural knowledge-base. Earlier today you heard some discussion of the concept of an electronic CIP; in the archives community we are discussing the more radical concept of metadata standards for self-documenting transactions. In the near future cataloging interests must define such self-documentation requirements and get individual catalogers out of a line of business that is the effective equivalent of Ian Wilson's sad transcribers.

Let me provide the simplest example of this, describe the research implications of it, and then discuss how it impacts us today. Imagine that we have a list of terms that can be used to index a collection of objects, be they books, archives or field notebook entries from an archaeological dig. We can sit down with each of these items and construct a description, according to the different cataloging traditions described above, that uses terminology from our controlled vocabularies and thereby ensures that it will "file" correctly in a larger system. Or we can capture the documents we have in front of us in their full uncontrolled text and use the controlled vocabularies as an entree to them in the search process.

The model we have used in the past has been to create vocabulary controlled surrogates. There are many theoretical arguments for why this has always been a poor idea, not the least with historical material where our term assignment obscures the original language, but these theoretical arguments are now being overtaken by practical realities. The fact is that in the next fifty years or so the complete corpus of the knowledge of civilization will be digitized, not as surrogates but in its full text, image and sound. The first items that will be captured will be all published books and with them all libraries will be transformed from places where books are to places where experts assist users in finding information. You may say (and to some extent legitimately) that this has been true now for a decade or more, but the reality is only now with us.

In principle, this means we will have lots of language and few cataloging records. We will need to provide relevant access to information without catalogers having described it but from the greater richness of its original; text and meaning. This means we will need to develop methods to search natural language with greater recall and precision than we have heretofore been able to achieve. How? By interspersing the languages of classification between the users and the database both on the way in and the way out of a query.

The syndetic structure of a faceted vocabulary gives us an example of the benefits of such a strategy and alerts us to some of the software requirements if we are to make it really work. The benefits are that a user who knows only a relatively broad term can find chestnut and cherry tables when looking for wood tables because wood is a broader term, while one concerned with only chestnut furniture can find tables, chairs and lamps all made of chestnut in a single query because furniture is a broader term. The disadvantages, as I've recently discussed in greater detail elsewhere, is that we lack the functionality to make all the kinds of quite reasonable user assumptions about such an environment operative.

But, frankly, it doesn't matter. The existing corpus is going to be captured and made available so fast we need to act now to get ahead of the flood and act with architectures that assume full-text analysis on the fly. Catalogers, as we have described them in the past, still have a role to play in this environment, but they need to take on the tasks that are presumed in the academic course descriptions of their discipline: "the organization of knowledge". It is no accident that "cataloging" is not used as a description of these courses which train the current novice in the art they will practice in the twenty-first century.

So what does it mean to be an organizer of knowledge rather than a cataloger? As I have recently said in horrifying my archival colleagues, it means that we do our work without touching (or reading) the "stuff". We become agents in developing the means by which users access the records or information they require but the agency we are now part of is not that of term substitution but of constructing an intellectual scaffold from which the holdings of our collections can be accessed.

Note that these traditions assume different types of knowledge about what we hold and different uses for the information. However, each in its way creates a surrogate--for the item, the collection or the function that led to its creation. The future may involve a very different way of providing access--instead of creating surrogates for objects that we want to retrieve, we may have a universe of objects that could be retrieved and spend our intellectual efforts creating rules, semantics, and frames through which the information that is captured as part of the item is made meaningful to a researcher or query. Of particular interest are methods to reveal the structural dimensions in large bodies of information using tools that require only repetitive low intelligence processing such as those that are used in image recognition or citation mapping. These techniques will include term co-occurrence analysis in place of labor intensive indexing, mapping of genre attributes in place of descriptive cataloging, and discourse analysis in place of expert systems or intelligent front ends.

Ethical:

Finally, but not trivially, we are confronting a major ethical challenge that will determine whether the undertaking succeeds. The library and academic communities, where much knowledge is made and distributed, are currently standing in the way of its widespread digital capture by insisting on the concept that they can make copies of publications without payment of fees to authors under the terms of the "fair use" exemption of the copyright law. The effect of insisting on this so-called "right" to date has been to make it very difficult to obtain the rights to convert books and certain images into digital form, because if some copies that are in the universe of copies have been made under the terms of fair use, it is much more expensive to enforce and restrict copying.

Now I understand as well as anyone else the arguments that have been advanced for "fair use", but I am convinced from my experience monitoring multimedia product development in the past few years that unless we give up the concept of fair use, we will be the victims of it. We will see an increasing proportion of intellectual property seek protections like licensing, patents and trade secrecy, that are much more restrictive than copyright which was, after all, intended to allow authors to broadly disseminate their ideas while maintaining control. We will starve the universal knowledge-base that might otherwise explode on the networks, and particularly limit its diet of current publication which is the very substance that could justify, and pay for, the rest of the system. In the end, we will drive the good information into protective regimes that will not be available to the general public, and defeat the first purpose of copyright which was to enable creators of ideas to make them publicly known.

You will have noticed that Dr. Billington's talk this morning tiptoed around copyright, but left little doubt that the materials that are in copyright and in heavy demand won't be available digitally, not because this wouldn't be socially desirable (because it obviously would) but because we are still without frameworks for a social contract in this area. I hope that efforts to site-license large collections of digital material to universities, such as the Tulip project sponsored by Elsevier, or the Museum Educational Site Licensing project sponsored by the Getty Trust and MUSE, will fashion an acceptable model for intellectual property administration. And I hope that the library community will be able to see that the long-term interests of users and those of knowledge creators coincide and that the money saved in acquisition, cataloging, binding and circulating books can, in part, be returned to the publishers and authors. The alternative could be a world in which knowledge is made available only under contracts and can be withdrawn at any time.

Conclusions:

In the coming months, you will have numerous opportunities to act affirmatively on each of these challenges, and many similar opportunities will arise in the coming years. To obtain the appropriate sort of funding and support for conversion of the cultural heritage to digital form, you can begin by commenting on the NII Task Force Application paper "Arts, Humanities and Culture" released for public comment on September 7. In so doing, you might find "Humanities and Arts on the Information Highways: A Profile" published by the Getty Trust, ACLS and Coalition for Networked Information, a valuable tutorial. To adopt new methods of intellectual access you can support implementation of the Government Information Locator Service by the National Archives and urge the critical importance of the provenance of records over approaches using subject headings more appropriate for access to publications. Comments on the current draft GILS Bulletin are due at OMB tomorrow, but other policy guidances will follow. To accept the challenge being presented to cataloging, you can work through the Register of Copyrights on the adoption of metadata standards for documentation of electronic texts and resources and develop means whereby copyright registration and cataloging take place simultaneously with publication. To act on copyright, you can comment on the proposals of the NII Task Force on modifications to the copyright law, and sign up to participate as one of the institutions involved in testing out new methods of accounting for intellectual property usage in the Museum Educational Site Licensing project or one of the other testbeds.

The challenges we face are many. Interestingly, few of them are really technological. The real issue is to face, and indeed lead, social change.