Musings on Priscilla Caplan's "International Metadata Initiatives: Lessons in Bibliographic Control"

Robin Wendler, commentator

Final version

Any attempt to review metadata initiatives on an international scale is daunting not only because of the sheer number of efforts but also because of the range of forms they take and the range of purposes they serve. Some metadata initiatives produce data element lists and data dictionaries, of course, but others produce encoding syntaxes, registries, controlled vocabularies and thesauri, rules for selecting or formulating the content of data elements, and so on. Metadata elements or values with the same semantics can exist in diverse schemes designed to support intellectual access, rights management, preservation, commerce, and structural representation. Complicating the picture is the fact that many metadata schemes encompass more than one of these categories to some extent.

Priscilla Caplan's paper focuses primarily but not exclusively on schemes which support intellectual access. She provides a very focused overview and an insightful analysis of the key metadata initiatives of relevance to libraries and to the kinds of organizations most closely allied with them, such as archives, museums, and publishers. Rather than comment on the initiatives individually, I'd like to highlight some of the conclusions she draws. These are essential points. Not only are they the logical lessons to be drawn from a review of these metadata efforts, but they reinforce the lessons libraries themselves have learned about creating and sharing metadata on a large scale.

"Approaches to resource description differ because the underlying functional needs for the metadata differ."

Access to information resources does not occur in some abstract space. Resources are described within a context, and a description of a resource reflects the perspective of that context. With the exception of the Dublin Core, each of the schemes Cilla analyzes has emerged from a specific community with a specific worldview and was developed to fulfill the requirements of that community. Those requirements often differ markedly from those that AACR2 and MARC were designed to support.

Cilla describes the marketing and rights management needs of the publishing community, and contrasts them with the need of libraries to manage huge inventories over long periods of time. Another example of how the same materials are described in very different ways comes from the image world. A given photograph held in a visual resource collection or an archive would receive radically different treatments in those environments than it would in a library catalog. The visual resource approach gives primacy to the subject of the photograph, as opposed to the photograph as an object per se, and requires a richer and more integrated way of managing relationships among various "works" and images of those works than MARC provides. Archivists, like libraries, tend to describe the photograph as an artifact. However, while they generally provide little information about an individual photograph, they do require the ability to express how it fits within the intellectual organization of a complex body of materials.

Seeing how differently another community conceives its information reminds us to be sensitive to their functional needs. In some sense, the contrast does, or should, make us see our own in a new light. This means both questioning our own assumptions about how resources should be described and why, but also valuing the requirements which survive such scrutiny. When we evaluate new metadata schemes for their potential applicability (either directly or indirectly) in a library setting, we must

Doing these things will enable us to make trade-offs in an informed way. What we might be trading away becomes clear when we look at another of Cilla's key points:

"Metadata schemes without content rules are not very useable"

Library cataloging according to AACR2 and MARC is the foundation of an incredibly complex and robust flow of data that libraries rely on not only for public access, but also for acquisitions, copy cataloging, resource sharing, cooperative collection development, and cooperative preservation. An amazing quantity of library metadata flows daily among countless computer systems in countless organizations. Our systems ingest this metadata, index it in sophisticated ways, sort it, identify duplicate records, correct errors in it, and update names and subject terminology within it. We can automate these functions precisely because the form and nature of the content is regulated and well-understood. Libraries have achieved an impressive degree of interoperability (that infamous word!) because we have created, maintain, and apply metadata standards, including cataloging rules. The initiatives Cilla describes have not yet resulted in anywhere near the volume of interoperable descriptive metadata that libraries have, but as they begin to scale up, they find themselves limited by the lack of consistency in their metadata.

Most of our catalogs, fortunately, still reflect Cutter's "objects", and we achieve these through the application of content rules and authority control. With consistently formulated metadata, we can provide fairly consistent and comprehensive retrieval of items by a given author, with a given title, and on a given topic. We can present searchers with well-ordered (and evidently ordered) lists of search results. In the absence of consistently formulated metadata we cannot do these things. To the extent that we choose to abandon or downplay content rules, we choose to limit what we can do, what functions we can support, with our metadata.

Martin Dillon calls for creating DC records in MARC.[1] Sarah Thomas calls for using Dublin Core in order to reduce the time spent cataloging books.[2] However, the decision to "use DC", whether in MARC or some other form, is not enough. It does restrict the universe of metadata elements, but was choosing which fields to fill in ever the tricky and time-consuming part of cataloging? It fails to address the question of how the content of the data elements should be selected, how it should be formulated, and whether any elements should be required. It is silent on the functional relationship of this metadata to other library metadata, particularly that in our OPACs. If we choose to create DC records without content rules or authority control, we must ask what the library can do with the resulting metadata. Can we continue to fulfill our objectives? Do we still have objectives, and can we articulate them at the level of detail that Cutter did?

A decision by libraries to use DC or any other metadata scheme as a native form of metadata should be accompanied or, ideally, preceded by far more important decisions about what functions this metadata must support and the rules that will be necessary to enable those functions. (And no, "catalog more stuff" does not constitute functional analysis.) By developing a library-community application profile for the Dublin Core Metadata Element Set, complete with refinements, extensions, and content rules, two things would become possible. DC would become a responsible option for native library metadata, and librarians could more fairly assess the costs and benefits of using DC as opposed to our current minimal or core-level cataloging records.

However, I am not advocating that the Dublin Core initiative as a whole create or adopt some cataloging code. Quite the contrary. I would modify Cilla's maxim slightly:

"Metadata schemes without content rules are not very useful as native forms of metadata."

The absence of content rules is critically important for two things the Dublin Core does extremely well: 1) serve as a kernel around which specific communities can develop their own richer metadata schemes, and 2) mediate between community-specific metadata schemes.

The Dublin Core was not intended to be sufficient to meet the internal metadata requirements of any single community. In fact, as a "native" scheme, it is almost always used with refinements and extensions. It is true that for each community, the process of developing content standards is difficult and time-consuming. However, given the variety of materials and perspectives represented in the DC effort today, the difficulty of creating or imposing a single content standard is immense. Nor I am convinced that these communities would continue to see Dublin Core as a viable option were they forced to adopt a single content standard.

Cilla has identified three ways the Dublin Core can be used to achieve the second goal, that is, to mediate between community-specific metadata schemes:

Note that each of these implies the existence of a richer description, presumably operating within an environment that takes advantage of that richness. The Dublin Core elements are, in these scenarios, either extracted from or mapped to elements in that richer description. If the Dublin Core Metadata Element Set itself were to be tied to a particular set of content rules, it would be difficult if not impossible to use it in these ways, since such Dublin Core-specific rules would inevitably conflict with community-specific rules. Which brings us to Cilla's next point:

"Simply mapping from a semantic or syntactical element in one scheme to a comparable element in another does not guarantee the usability of the converted metadata."

We have a tendency in this community to wave our hands and say "Oh, we have crosswalks -- it'll be fine." (And for use of Dublin Core as a so-called "switching language" among richer metadata sets, it generally is, because all we are trying to accomplish is fairly coarse, high-level discovery.) But the fact is that mapping between metadata schemes always results in loss: loss of data, loss of meaning, loss of specificity, loss of accuracy. As Caroline Arms notes in her paper, it is relatively easy to map from a richer scheme into a simpler one, accepting such loss, but mapping between rich metadata schemes is difficult, costly, and, I would add, rarely very effective.[3] What you get is often the proverbial dancing bear: it's not that he does it well-- the wonder is that he can do it at all. Or as Greg Colati of Tufts recently noted, mapping between metadata schemas enables us to communicate in grunts.

Semantic and syntactical mapping are themselves extremely imperfect. Differences in concept and in specificity inevitably result in metadata which reflects the lowest common denominator. In addition, as Cilla points out, metadata must be created according to content rules in order to be reliably useful. Therefore, any mapping which does not also transform the element content where applicable will result in metadata that is not very useful. The application of content rules usually requires human judgment in conjunction with an examination of the resource itself. In contrast, mapping metadata from one scheme to another is done algorithmically, takes place in the absence of the original resource, and is generally performed without human intervention. Transformation of the content of the elements (its selection or its form) is rarely attempted during such a process, and for good reason.

Specifically, mapping will not necessarily allow metadata that was designed to operate in one context -- in support of a given set of functionality -- to operate in another context, in support of a different set of functionality. Converted metadata will certainly not operate as well as metadata created expressly for the context would. This view contrasts with that expressed by Carl Lagoze in his paper. Carl advocates that libraries promote the catalog as a "mapping mechanism", and envisions an environment based on a "model that recognizes distinct entities that are common across virtually all descriptive schemas - people, places, creations, dates, and the like - and that includes events as first-class objects."[4] Carl is certainly correct that libraries have a mediating role, and high-level discovery across domains is a service many libraries are actively developing. Hopefully, these services will be in addition to full-featured catalogs tailored to particular communities and particular kinds of research, not in place of them. But the kind of cross-domain interoperation that Carl envisions supposes a degree of coordination and like-mindedness among all information producers that is hard to imagine, and a level of operational complexity beyond anything we know today. Even further, it underestimates the fundamental differences in worldview among various information-describing communities.

How well mapping will serve you depends on how far apart the schemes are in structure, semantics, and content rules, and on how much functionality from either the source or the target environment you need to retain. Mapping has its uses, but we need to recognize its limitations up front and not oversell its capabilities.

Cilla also exhorts the librarians to

"...begin thinking about basic bibliographic metadata as a commodity, produced and exchanged by a number of communities in order to serve a number of purposes."

Metadata created in the publishing arena is a natural fit with libraries. Clearly this is an area where active coordination and necessary compromise could yield real benefits. However, the differences in approach are not trivial. As Cilla points out, publishers are even more passionate than libraries are about controlling certain information such as the identification of rights holders. Unfortunately, given their interest in current materials, the publishing community's set of rights holders intersects but is not coextensive with the set of personal and corporate names so important to libraries, which cover many centuries of authorship.

Conceivably, commodity metadata could extend into other arenas as well. Perhaps there is a role for commodity metadata in any cases where the thing described is mass-produced, mass-accessible, or mass-referenced. This takes us beyond bibliographic metadata for published materials and into metadata for commonly referenced but singularly occurring works such as the Mona Lisa or the Vietnam Veterans' Memorial, into metadata which describes agents such as authors, performers, etc., and into gazetteer-type metadata for geographic locations.

The examples Cilla gives of how we can fruitfully approach the meaningful sharing of metadata between the publishing and library communities:

are useful areas to examine whenever libraries want to interact with another metadata-producing community. Making an informed choice about which metadata schemes to adopt or to adapt requires analysis and decision-making at this level of detail.

Finally, Cilla makes a critical observation that is often lost in the frenzy to "catalog the web":

"The key question not bibliographic control of Web resources, but rather bibliographic control of both digital and non-digital resources in the Web environment."

The theme of "selection" came up time after time in this conference. Libraries have never cataloged every take-out menu and place mat, which is the level of much of the content of the "free web" today. As more of the substantive content that libraries have always chosen to provide and preserve moves to restricted, fee-based web delivery, the more important our formal relationships with the publishing community will become, and with them, the ability to repurpose the "commodity metadata" of which Cilla has spoken.

It is no accident that most of the metadata schemes Cilla enumerated apply to both digital and non-digital media. Enormous quantities of valuable physical resources exist and will continue to exist, and any model of bibliographic control for the new millennium must take these into account. The web permits vast amounts of non-digital information to be exposed for discovery through descriptive databases. Such non-digital material exists not only in libraries but also in such diverse organizations as museums, natural history collections, and visual resource collections. Much of this information will never be made digital due to the impossibility of capturing its artifactual value in digital form, to the economics of converting and maintaining information in digital form, or to other constraints. The problem facing libraries derives only in part from the proliferation and complexity of web resources. It also lies in the challenge and opportunity to help users make sense of the flood of resources, both digital and non-digital, that the web reveals.

  1. Dillon, Martin. "Metadata for Web Resources: How Metadata Works on the Web", Bicentennial Conference on Bibliographic Control for the New Millennium, November 15-17, 2000
  2. Thomas, Sarah. "The Catalog as Portal to the Internet", Bicentennial Conference on Bibliographic Control for the New Millennium, November 15-17, 2000
  3. Arms, Caroline. "Some Observations on Metadata and Digital Libraries", Bicentennial Conference on Bibliographic Control for the New Millennium, November 15-17, 2000
  4. Lagoze, Carl. "Business Unusual: How 'Event-Awareness' May Breathe Life Into the Catalog", Bicentennial Conference on Bibliographic Control for the New Millennium, November 15-17, 2000

Library of Congress
December 21, 2000
Library of Congress Help Desk