In two centuries we have come a long way in the construction of our bibliographic catalogs. We have evolved from the book style catalogs of a former century that were also made famous in this century by the enormous catalogs such as those of the British Library and our own National Union Catalog-surely the book catalog/union list to end all catalogs-to card catalogs that comprised 1000s of drawers at large institutions to microfilm or microfiche and thence to comfiche and then to online OPACs. While I can say that all in a single list of nouns with a grammatically syndetic structure (that is to say, an 'and-ed' string), the fact is that this small list-book, card, film, fiche, computer output microform, and online-represents a credible chronological period. Call it 150 to 200 years, give or take. Another way to measure it is to say that is encompasses about 30% of the time that has elapsed since the invention of the printing press with moveable type.
And this only addresses the issue of the form or format of the individual catalog records and the catalog itself. I'd like you to view the catalog records as analyzed pieces of data, discrete but interconnected, and the catalog proper as the container for the actual catalog records. The individual records linked together in what we used regularly to call a 'syndetic' structure are what make a catalog out of discrete bits of data.
On this level we have created the actual standards and protocols of the catalog records that really are the infrastructure of the catalog itself. These are the standards upon which individual records interconnect and hold together as a unified, coherent whole. These codes, both the historical codes and the present-day codes, are the standards and protocols of the catalog just as DNA is the protocol of the genes that make up all living beings. Some among us who are a bit jaundiced about our ability to adapt might claim that the codes are more immutable by far than DNA-and even sometimes as hard to understand. Change happens slowly-but, unlike with life forms, never by accident. There are no accidental mutations in our world. Our catalog code mutations are deliberate and are subject to national and international standards bodies that give them their seal of approval only after due consideration.
It is worth observing that the catalog formats and the catalog codes with which we are most familiar were developed in an overlapping fashion. Most of us would probably consider the late 19th and the 20th centuries as the period that comprises the age of substantial catalog form development and the time when bibliographic control or, at any rate, the age when meaningful bibliographic cataloging codes, developed.
But neither time nor technology stands still and now we have the World Wide Web. The Web. No one needs to ask what you mean; there is only one Web and it is 'the' Web. You don't even need to capitalize the 'T' in 'the' nor, I think, pronounce 'the' as 'the' with a long 'e'. That is what the Channel 56 television news does in Boston to show that they are the first and original 10 PM newscast: they call themselves "The Ten O'clock News." The Web is surely different, however. It is not Charlotte's Web, nor anyone else's Web. It belongs to all of us, if it can be said to belong to anyone. After all, the Internet is famous for its loose governance structure. That is what makes it so exciting and so revolutionary, but it is also what has come to make life so challenging, not to say difficult, for those of us whose lives are centered on the concept of bibliographic control.
I hope that those of you may have had training in classics, as Eric Jul and I did, or at least in mythology, will agree with me when I say that I find parallels between the sudden efflorescence of the Web and the birth of the goddess Athena. According to Hesiod, Athena was born fully-grown from the head of Zeus. Pindar expounded on that a bit further and tells us that Hephaistos struck open Zeus' head with an axe to release Athena. The Web is not yet a decade old and its precursors go back considerably further. Yet it has come upon us so suddenly and with such overwhelming force that it can be thought to have sprung almost as fully grown and fully armed as Athena did at her first appearance. For a technology that young it has attained a tremendous level of maturity and acceptance, but bibliographic standards as we know them are not yet among its major accomplishments. That being said, I feel safe in saying it has certain characteristics of a young and rebellious child in that it refuses to be governed by the long-standing rules of its elders. Unfortunately for those of us whose lives revolve around issues of bibliographic control, its growing pains are our parental pains.
In the early 1990s I was a member of OCLC's Cataloging and Database Services Advisory Committee when OCLC undertook an experiment, the InterCat project, to catalog electronic resources. Electronic lists and journals were still few enough in number that it seemed possible to actually control this new realm in a semi-traditional fashion. It was at this time that the MARC field 856 was defined for the URL. Not too many years later the first of the Dublin Core conferences was held, and by then it was abundantly clear that we had entered into a new dimension where the traditional no longer held sway. In talking about cooperative relationships in building metadata I need to note the absolutely pivotal position of OCLC and its Office of Research in this, not only because of this InterCat experiment and the development of the 856, but also because from the beginning this experiment was conceived of as cooperative in nature. Martin Dillon and Eric Jul were visionary within OCLC's Office of Research then, and they continue to help guide us today. In a hallway conversation at this summer's ALA conference Martin and I were talking about the program entitled "Is MARC dead?" We were talking about XML and Dublin Core, and he observed (perhaps quoting Fred Kilgour?) that the struggle to define and refine and redefine Dublin Core has taken longer than World War II did to fight and win!
On the face of it that is true and you have got to wonder about our sense of priorities and perspective. Perhaps a rephrasing the famous Latin phrase Vita brevis, ars longa (Life is short, but Art is long) is called for: Data brevia, regula longa, or, It only takes a moment to create a record, but the route to standards is long! Finally, I would alter Martin's hallway comment about the Dublin Core and compare it not to World War II but to the Hundred Years' War. Like the Hundred Years' War, MARC and AACR2 and Dublin Core have their adherents and their political and religious believers. Like many of the struggles in bibliographic control, as that over Full- vs. Minimal- vs. Core-level cataloging, the struggle resolves itself to a common denominator of ideology.
Let me turn back to the subject at hand. The data that underpins our catalogs has a long and distinguished history. It is what some people would refer to as real cataloging or real metadata. Of course, very few of us had ever heard, let alone used, the term metadata until cataloging of Internet resources became a part of our job descriptions. I suggest that the new term was adopted partly because new standards became an issue and so a new term seemed appropriate, but also because the issue of providing descriptive and other sorts of data about them had been co-opted by computer scientists. In her keynote presentation at the ALCTS Conference "Metadata: Libraries and the Web" at last summer's ALA Annual Conference, Jennifer Younger reported on a brief survey of online literature for use of the term metadata. Searching the Web of Science she found some 279 hits on the term, of which the oldest was from 1982. I myself had previously tried searching for it in the online Oxford English Dictionary and Merriam-Webster, but to no avail. It has not yet entered popular parlance to the degree that it is recognized by those authorities.
So, if scientists first made use of the term to refer to categorization and classification and then passed it on to the library science community, I would not be surprised. We have had and continue to have our challenges in dealing with a scientific community that is not grounded in our discipline and that is amazed that we had a tradition of authority control long before they ever found a need for it. I suspect they concocted the term 'metadata' because they did not know that we already had a perfectly good name for what they were engaged in doing. By that I mean 'cataloging', of course.
By now we have all absorbed the concept of metadata. I am not sure that we have all come to recognize it as a semi-Aristotelian concept, however. Aristotle wrote the Physica and then the Metaphysica. For him it was an issue of works on physics and the things that came after that. In Greek it was really the Physica and the Meta ta Physica, the Physics and the AfterPhysics. We might as well speak of CatalogData and MetaData or TraditionalData and Na´veData or Metadata. I cannot take credit for Na´veData; it is what my good friend and colleague from Yale, Matthew Beacom, likes to call the New MetaData. By Na´ve Metadata he is referring to cataloging created by non-catalogers and designed to describe newer types of electronic materials.
It is another perspective on what Elizabeth Mangen of the Geography and Map Division here at the Library of Congress said at the ALCTS metadata conference: "Metadata is a complement to cataloging data, not a replacement for it."
As comforting as it would be to espouse such sentiments, I am afraid that I cannot. If Cicero were here with us today, he might well exclaim, "O tempora, o mores"-"Oh, the times, oh, the customs." I cannot. It is simply too atavistic to pretend that the old solutions are still viable and that the world is not changing in revolutionary ways around us. Nor that we can hunker down in our bibliographic bunkers and go it alone. We cannot.
Be that as it may, I suspect that if we as a profession had been more nimble on our feet, if we had been quicker to change and to recognize the profound changes that were about to confront us, and if we could have known with the power of hindsight the absolute explosion of data that was to engulf us, we might have tried to accelerate our standards processes to encompass these new data types. We may have, but I am not convinced that we could do it. Standards-setting bodies by their nature are not prone to moving quickly. I remember one of the very early formative meetings of the Cooperative Cataloging Council even before it became the Program for Cooperative Cataloging. A group of forward-thinking individuals was concerned with adapting our cataloging codes more quickly to the challenges of Internet resources and seeing us be proactive, not reactive by a factor of years.
We have come a long way, but we still too much resemble, to my mind, the legal profession. It is obvious that the legal profession lags far behind technology in societal and ethical issues, whether it is the length of time it has taken to establish the validity of digital signatures or the plethora of medical conundrums that advances in medical science force upon us. While I recognize the necessity of affirming and reaffirming our roles as custodians and catalogers in an historical and intellectual framework, we will lose our role unless we learn to swim faster than ever before. It is either that or we drown. Or, as I suggest, we learn to engage in cooperative swimming. We must, because we are no longer swimming in a small, well-defined, and clearly bounded swimming pool. We are attempting to swim in an ocean, and one being constantly replenished with enormous rivers of data every second of every day.
I am reminded of a NELINET annual meeting a few years ago. Marshall Keys, the then Executive Director, was giving his annual state of NELINET address. He told us of a job description for an institution in North Carolina, I think it was, that included this statement: "Successful candidate does not have to be able to walk on water, but definitely must be a strong swimmer." I think that is an apt description of what we are trying to accomplish in technical services today.
Now, if I am candid and truthful, I will admit to you that in the academic library circles where I have spent most of my professional life we were already in danger of drowning a long time ago. It is not the Internet phenomenon that alone threatens to overwhelm us. I spent many years at a large, well-known institution in Cambridge, Massachusetts, at the opposite end of Massachusetts Avenue from MIT-the very place that Priscilla Caplan and Robin Wendler and I met-that never managed to process, let alone seriously catalog, anything like the major part of its enormous annual intake of new materials. As long ago as 1992 I was venting on these subjects in a talk entitled "Minimal-level Cataloging and Other Solutions to the Backlog" in a seminar on "Cataloging in the '90's," sponsored by New England Technical Services Librarians and NELINET. I may be guilty of revisionist history, but I don't seem to recall advocating minimal-level cataloging at that time, but I did strongly recommend a more fundamental reassessment of what we meant by copy cataloging and inter-institutional cooperation. These became, in fact, some of the basic tenets of the Program for Cooperative Cataloging. Minimal-level cataloging concepts, on the other hand, led by various pathways to the definition of the Dublin Core. To revisit a Reagan-era mantra, I believe that I advocated the concept of 'trust but verify' in the context where verification was predicated on machine-assistance from both the client and the server levels.
So, if we were already in danger of drowning in a sea of print-based materials by 1992, how can I characterize our state in 2000? Really drowning? Really, really drowning? Or really, really in need of help and new helpers? How much drowning in a sea of uncataloged or uncontrolled materials can any one individual or institution endure before they are dead, or really, really dead?
Let me turn back to the issue of standards, new standards, and the information landscape that stretches before us.
In the beginning of this story we had the ALA data standard, then the Anglo-American standards, AACR, known in retrospect as AACR1, then AACR2, finally AACR2 with its various revisions. The Germans have their RAK. On the other side are MARC21 and other data structures: the Dublin Core metadata standard is the most prominent these days. But there is an entire host of other standards out there as well, whether it is the Visual Resources Association (VRA) Core, the Enhanced Archival Description (EAD), the Text Encoding Initiative (TEI), the Computerized Interchange of Museum Information (CIMI), or the Records Export for Art and Cultural Heritage (REACH). There are geospatial standards and there non-Anglo-American standards-the German MAB and MAB2. We Anglophones need to accept the fact that, while English is at the center of the modern information network, it is not universal, and we need to reach out and be linguistically inclusive and find ways to map and marry different national, linguistic data standards to our own every bit as much as we need to consider different data structures. As hard as it has proved to harmonize the former USMARC and CanMARC into MARC21, and with UKMARC still to join the other two, reaching across linguistic boundaries is even harder. I am vitally intrigued and hopeful that someday we will see a dynamic structure that allows all linguistic communities to participate in metadata creation, both bibliographic and authority, and that we can have a system where we Anglophones can use a heading such as Cologne (Germany) and the Germans can use K÷ln (Deutschland) and intelligent systems can substitute one for the other based on individual and institutional contexts. In essence this is Barbara Tillett's thesis. But I am not na´ve about the challenges-I do not expect to see such a system anytime soon, perhaps not even this decade.
How well do you remember the old style book catalog? It was very one-dimensional and extraordinarily static. The National Union Catalog and its regular cumulations were like that. There tended to be but a single access point, and that was it. With card catalogs we developed the ability to photocopy multiple cards, create additional access points, and maintain a multiplicity of complementary files. A declining number of us-a number becoming fewer every year-enjoyed the rigors of indentions, spacing, and typing in red! As I like to recall, one of the inhibitions on multiple access points was the need for the photocopying and manual typing. OCLC and RLIN cards sets put that particular inhibition to rest. In fact, I suggest that catalog records endured a gradual increase in length over time because the computer-assisted nature of card production made that aspect of catalog generation easier and easier. However, fewer and fewer of us have recollections of dealing with multiple catalogs (union, author, title, subject, dictionary). Those cards that have all but disappeared-and, as I like to say, to a humiliating end! since they have lost their lot in life as scrap cards to the ubiquitous yellow stickies. But let us not forget that those cards offered the first meaningful ability to cross-link and trace from one card to another, and from one alphabetical arrangement in the catalog to another.
What confronts us then is finding a way to reconcile-I will not say to 'unite', but to 'reconcile'-different metadata structures and standards and to develop sensible presentations of them. I will try to keep this simple and presume that the metadata structures are Anglophone-centric, not in their content but rather in their bibliographic and authority structures. Now let me take it one step further and say that the future record, whether you prefer to call it a catalog record or a metadata record, will not be one-dimensional or static. Rather, it will be multi-faceted and dynamic. It will be composed on-the-fly of a variety of different "metadata-lets": the traditional bibliographic description at its core, but with a number of concentric circles associated with it and including such information as citations, reviews, dust jacket illustrations, author information, links for delivery or ordering, etc. I think it fair to say that we have been moving gradually over a long period of time toward that vision, but technology is only now getting us to a point where it can be realized. Widespread use of A&I and citation databases and the creation of table of contents databases separate from the MARC record were the first glimmers of enhanced records. Alive to the possibilities, OCLC has now presented sketches of its planned successor service to WorldCat that will be based on a CORC standard and tentatively called 'eXtended WorldCat':
Now, when I was asked to lead this panel by John Byrum more than a year ago, I was Associate Dean of Libraries and Director of Technical Services at the Indiana University Libraries-Bloomington. Since that time, as you know, I have left a lifetime in academia and taken the position of Director of Product Management for Ex Libris (USA), Inc. As I envisioned the panel we would have a group of vendors talking about the catalog record as a dynamic entity and their role in creating it. What do I mean by a dynamic record? It is a record that has a core consisting of a traditional bibliographic description, but it is regularly enhanced and/or refreshed by a series of elements that we have not routinely associated with the record. These elements will be provided to us as part of a regular bibliographic commerce conducted on the fly. In addition, we will benefit from regular infusions of data files from other sources, data that will accompany regular infusions of digital text files.
I want to make clear that I am not talking about outsourcing our bibliographic responsiblities in a wholesale sense, but there is an element of it here and I want to talk about it in a non-traditional sense. I want to discuss it as involving either discrete parts of the record, on the one hand, or of aggregations, on the other. Where parts are at issue, I want to talk of outsourcing the way General Motors would when it buys transmissions from one supplier and engines from another, or better, when it buys optional, value-added parts of the automobile-roof racks or moon roofs-from outside vendors. Engines and transmissions are essential to automobiles just as a core bibliographic description is essential to the bibliographic record. I want to talk about purchasing parts of records from diverse sources that are in the best position to manufacture them, provided of course they assure us of the quality control that we have every right to demand.
I was originally intrigued by several varieties of cataloging or cataloging enhancements that are becoming more and more significant to us and to our patrons. Let me offer to you four views of them.
* The first concerns aggregators and aggregations.
Like their printed or microformat counterparts of the 1970s and 1980s, electronic aggregations-principally of serials-threaten to overwhelm us. In their size and complexity they rival the major microform sets that we purchased during the past two to three decades. To some degree we managed bibliographic control over those sets by cooperative cataloging and bulk purchases and batch loading. In their absence, we usually made do with a printed index to the set. Unfortunately, few libraries are left that have the time or the staff to handle catalolging of these massive sets any longer. It is my perception that a collection-level record in the catalog for databases such as JStor or Project Muse or Ovid or any of the others have little utility beyond providing a place for recording payments. While some users might approach their journal searching on this level, most users clearly are after specific titles and not the database in which they are contained. (More on this below.) Into this void the publishers-for example, Bell and Howell and Primary Source Media-have now, thankfully, stepped and are offering to sell us the cataloging copy that goes with the sets. I would frankly be happier if they recognized that electronic copy for these sets is essential and the sets should be priced with the copy included and above all did not treat the cataloging copy as an add-on available at additional cost.
Where electronic, primarily serial aggregations are concerned, decisions are routinely made to purchase large electronic sets and then we in technical services are left holding the virtual bag trying to offer access. The Program for Cooperative Cataloging is setting basic standards and enticing the aggregators to provide us with copy, and I applaud EBSCO and others who are participating in our experiment. But, when the individual titles come and go at the speed of light, even their aggregator creators may not always know what is in the aggregations. In fact, I can tell you, without naming names, that one of the largest aggregators welcomed the PCC initiative precisely because it gave them an excuse to try to inventory their offerings. It is essential for us to make the issue of regular, low-cost or no-cost access to electronic catalog records a matter of competitive market advantage in our dealings with the vendors of aggregated content.
It should by no means be beyond the ability of aggregators to provide us with this data, and the best incentive for them is likely an economic one if libraries, principally collection development officers, are willing to wield it as a trump card. I have in mind that electronic records for all sets be provided in standard form by publishers as a part of their set and that their presence be considered an essential part of the evaluative process that precedes buying them. I have invited Amira Aaron, Director of Marketing and Programs for Faxon/RoweCom, to help us think through all these challenges.
Having had experience with purchasing some hundreds of thousands of major microform-style records while Head of Database Management at Harvard University in the early 1990s and then as Director of Technical Services at Indiana University-Bloomington in the late 1990s, I can tell you that there are substantial standards and data-loading issues to overcome. One of the principal challenges we face with both major microform sets and electronic aggregations is that of multiple versions and the single-record approach to the catalog. The challenge, as I see it, is more static in the microform arena than that of the electronic aggregation, and I personally favor the Mulver approach long held at Harvard University. Yet, while it can be relatively easy to match and load the microform copy to the record for the paper original for monographs and then to forget about it, we have no such luxury where the paper and digital worlds collide. That is because a very high percentage of the digital content in the typical aggregation comes and goes with alarming frequency. I have a very serious challenge to pose to the aggregators and to the system vendors among us, and it is this:
Libraries have a pressing need to develop a cradle-to-grave approach to handling electronic sets. That means obtaining the corresponding electronic records from the aggregator and receiving regular maintenance updates to the sets. Maintenance would ideally include additions, deletions, changes in coverage, etc. Now, as fond as I am in a theoretical sense of the Mulver or one-record approach also to electronic journals, I frankly do not see any simple way for us to allow the aggregator's data file to reach deep down into our integrated library systems and to touch data on this level if the data is buried within a Mulver-style record. It is simply fraught with too many difficulties. So I reluctantly conclude that, if we want to look to a hands-off, computer-to-computer data interchange, we need to keep the basic building blocks of that exchange as simple as possible. From the perspective of a former Director of Technical Services in one of the so-called Big Heads libraries, I have to admit that managing large serial aggregations is an impossibility at the local level without firm and decisive actions from the aggregators and help from the ILS vendors. I am desperate for an EDI-type solution for all aspects of aggregations that drills right down into the local catalog to solve this problem that I see as growing increasingly intractable.
So let me add a specific recommendation to this complaint. Contrary to what I said earlier about preferring the Mulver solution to disparate formats, I have reluctantly concluded that a partnered solution here will mean keeping the electronic journal records separate from their paper counterparts. That way an incoming record, particularly an incoming maintenance-level record, can be programmed to behave in predictable ways vis-Ó-vis the existing database record. This puts my public services preference at odds with my technical services solution, but I believe the solution that best serves technical services and that provides the best, easiest, and most-up-to-date access will ultimately prove the best public services solution. Besides, it might prove possible to then merge records for the public view that are kept separate in the technical services components of our catalogs.
Part and parcel of this is how we use this data to facilitate generalized user access to these resources. I mentioned my feeling that a collection-level record for electronic sets has little utility beyond serving as a locus to record payments. I think that most technical services and collection development librarians agree with me in this regard. And, while some libraries have extended the scope of this collection-level record by including in it so-called analytical added entries for individual serial titles, the extent of most aggregations renders that solution impractical, if not impossible. Furthermore, communicating a record for resource sharing that might be enhanced or, better, burdened, with hundreds or even thousands of analytic titles is beyond the scope of the MARC record. Lastly, let us not forget that a title entry within a collection-level record is still bereft of all the other access points such as subjects and responsible corporate bodies that our users have every right to expect.
Even if it were possible, however, such a listing has minimal utility unless it resides in a Web-enabled, clickable catalog. Thankfully, most of us are finally reaching that point in our integrated library systems development. In the meantime, however, most libraries of which I am aware have taken the approach of creating Web listings of Internet resources. For the most part these include the major database offerings, such as Lexis-Nexis, JStor, etc., as well as an itemized list of electronic journals, both for individually obtained ejournals and sometimes for those included within aggregated databases. For the latter, you will typically find this where it has proven feasible to create a list either because the number of journals comprised in the database is manageable or is facilitated by a publisher-supplied inventory.
So, then, libraries have created various work arounds, most involving Web listings of their electronic journals. This is an unfortunate but largely unavoidable consequence of technological change coming to the library in varying stages. It greatly troubles me because it represents duplicate, wasted effort at a time of diminished resources and because it takes effort away from the centrality of the catalog. One might argue that in an age of digital collections the catalog is destined to lose its centrality anyway. So why should I worry?
To this argument I have two objections. The first is that I distinguish between textual journals, whether paper or electronic, whether born-digital or digitally reincarnated, and other sorts of digital resources that live on the Internet. Moreover, I believe that our users do so as well. Secondly, I am emboldened by a survey that Indiana University-Bloomington conducted recently of its users' behavior regarding ejournals in which they discovered that their users prefer search and discovery that begins with the catalog, not with a Web page listing. I feel this assertion is corroborated by plans that Pat Sabosik and Daviess Menefee of Elsevier Science Direct made at a conference sponsored by Der Kooperative Bibliotheksverbund Berlin-Brandenburg (KOBV) in June 2000. (As a matter of product placement, I should note that KOBV is an Ex Libris' ALEPH 500 consortium.) It is their belief that local systems will serve as the primary information portal for ejournals and that more types of information and linking will begin to occur within the local systems context.The information paradigm of the future will soon resemble this flow:
* Discover==>Navigate==>Locate document==>
Order document==>Obtain document (Document delivery)
And what of digital resources not under bibliographic control in the catalog proper but still somehow listed as part of a local or remote digital collection? How will the users access them? My fond expectation is that we are beginning to see the development of search interfaces and search engines that will simultaneously search and unify heterogeneous databases consisting of MARC records in all the bibliographic formats, as well as databases comprising all present and future data structures: Dublin Core, EAD, TEI, CIMI, VRA Core, MAB, MAB2, etc. This is the promise of the new MetaLib(tm) product already being tested by Ex Libris in several European libraries. It is a recognition that one standard will no longer rule the roost and that we have great need of a Universal Gateway for our users. It can or will handle many of the data format and structure issues, but by itself it will not handle different authority conventions. We need to adopt Barbara Tillett's approach to do that.
By channeling our bibliographic energies in this direction we can maintain the integrity of our traditional authority controlled databases and expand our concepts of bibliographic control in new directions. It will not be easy to accomplish this, and not just because of the requirements of crosswalks and data mappings, but rather precisely because of the difficulties of conjoining files subject to different versions of authority control or non-control.
There is one further aspect to all this, too, and that is user preference and user understanding. In general, I have maintained that a Mulver record best serves users provided that the display can be made clear and sensible. That means that all the information that is pertinent only to the reproduction needs to be clearly associated with the reproduction. Problems still abound for issues of searching and proper indexing where format or date of reproduction issues are concerned. (Though I would hope that dual indexing of 008 and 006 format types and somehow capturing and indexing original date and date of reproduction be factored into the equation as well.) Still, as a user in a long ago graduate lifetime, I will say that finding all the relevant records, regardless of format, in a single, intelligible bibliographic construct would have been the arrangement I would have found most appealing. As I have already commented, it has advantages and disadvantages for technical services, and the short-term solution many have opted for may not in fact be the best long-term solution if we can get content providers and system vendors to cooperate on a shared approach to the problem.* My second observation concerns ancillary or adjunct data.
To start with, you must understand that I do not mean 'ancillary' or 'adjunct' in a demeaning or derogatory fashion. Among the oldest examples that come to my mind are A&I databases and table of contents, such as those pioneered by Blackwell North America and others in the late 1980s and early 1990s. For tables of contents, a library would subscribe to some or all of a set, or purchase TOC data for individual titles. Local systems have developed the means to load individual tables of contents records into the actual OPAC where the TOC data is generally keyword searchable. Along with similar sorts of data, such as dust jacket blurbs or back of the book data, there is a wealth of useful data here for our users. Tables of contents often offer a contents-rich vocabulary on the chapter level that goes far beyond the limitations of our controlled subject terminologies and therefore has great appeal to our users. Moreover, while our copy-cataloging staffs who labor in a manual environment have long been held hostage to limitations on the length of those TOCs, scanning technology and, even better, electronic texts offer elegant solutions to the problem posed by manual keying. A few years ago the inventive and technologically adept wizards at the Library of Congress embarked on twin attempts to enrich their records with TOC data through their electronic CIP and business reference books experiments. In a sign of the dynamic future of the record that I am envisioning, their business project did not directly and physically encapsulate this information within the record, but rather pointed to it with a URL. (Of course, this is also one way to overcome some of the limitation of length in the MARC record.)
TOC data is but one piece of a constellation in a galaxy of similar constellations: why do we not add back of the book indexes, author portraits, author pictures, fly-leaf information, back cover information, summaries, or book reviews? Along with TOC data, these are parts of my dynamically enhanced record. To represent that perspective, I have invited Jeff Calcagno to join us from Syndetic Solutions, Inc. Syndetic Solutions, whose motto is "Bringing books and readers together", is in the catalog enrichment services business. In addition to tables of contents, they provide fiction and biographical descriptors, cover images, author notes, and summaries and reviews. I know from my brief stint at Indiana that reviews were a hot topic among the CIC libraries, and that and TOC data is how I first came to learn of Syndetic Solutions. Among the challenges here is that of linking up TOC data held in a central repository with a database that will have both its retrospective and ongoing aspects and adding that magical TOC button-or the magical added info button-to let the user know more information is available. Beyond that there is the technical challenge of continuing to provide keyword access to the valuable TOC metadata if the TOC metadata is not actually already resident in the catalog but only retrieved when requested.* Third, there is the challenge posed by the rapidly growing body of full-text digital books and providing appropriate metadata for them.
I am delighted, therefore, that Lynn Connaway, Vice President for Research and Development of netLibrary, is here today to talk to us about the role of metadata in their strategic plans.
In my future paradigm that starts with Discovery and ends with Ordering and Document delivery, inclusion of titles such as those netLibrary and others provides a crucial link. The library and the library catalog become conduits not only for discovery and for the full-text, but potentially also for the user to obtain personal copies. Needless to say, this is not limited to delivery of electronically available documents. The Harvard Business School's Baker Library is an example of a library that showcases its new books on its electronic bookshelf and makes available direct links for ordering. This is an example of non-traditional document delivery-one where the user pays for the privilege of having his or her own printed copy and having it delivered direct to the doorstep-and it is also an example of institutional entrepreneurship where the library receives a percentage of the purchase price for funneling the purchaser to the online bookstore.* I have one final, fourth bibliographic challenge. This is the world of the Internet at large.
This is a universe that is largely uncontrolled and, I dare say, uncontrollable except insofar as Internet search engines such as AltaVista or Excite or Lycos can make any claim to indexing and retrieval. But, even within this vast and chaotic jungle where a new Internet domain is registered every 3 seconds day and night, there are pockets of rationality and hope. I say 'pockets' because I think it too much to think that we could or would even want to catalog the Internet. Such an attempt would be not only doomed to failure, it would be a foolish exercise in cataloging an electronic rubbish heap. We used to have a needlepoint sign in the Office for Systems Planning and Research at Harvard University: "The DUC (i.e., The Distributable Union Catalog) is not a Dump." That referred primarily to the quality of the catalog records, but today we could as easily say that it refers to the quality of the resources for which we want to provide quality metadata. We do not want to consume our precious and limited talents on individual homepages or on junk-like resources.
So, with this in mind, I am very attracted by the assertion of OCLC's Office of Research that controlling 100,000 top-level, intellectually viable and valuable Web sites in a cooperative venture built on human investments would provide a solid and proportionate investment in control of the Internet. That having been said, I still have to admit a residual attraction in long espoused notions that those who create meaningful Internet resources-you will have to determine for yourselves precisely what 'meaningful' connotes to you-should be offered an appropriate, simplified template for metadata self-creation. At least for data archives such as preprint servers and similar repositories I see this as an integral registration service that is the best approach to creating the item-level metadata records that are so important in Internet discovery.
I do not intend to do more than mention non-textual Internet realms, such as those of art images, music, etc. They require more expertise than I can bring to bear on them, but they, too, require metadata, obviously even more than their textual counterparts since otherwise they are otherwise basically un-index-able.
The bottom line in all this is the need for widespread cooperation and for standards. It is, or at least should be, a fundamental tenet of all cooperative ventures that there be standards embedded in them, and it is likewise fundamental that there be widespread consensus in standards building. There is nothing more frustrating than trying to bring order out of a bibliographic chaos that could have been avoided had appropriate standards been established from the get-go. I cannot help but think of all the hard work we would have made unnecessary had we as a profession adopted Pinyin instead of Wade-Giles as our standard for Chinese Romanization 20 years ago when the body of Chinese bibliographic data was relatively small. Yet, that is where we find ourselves because librarians, specifically technical services librarians, were not fully involved in either the birthing of the Web as a home for Internet resources or in the first attempts at devising a non-cataloging metadata standard. As libraries move with greater assurance into digital collection development both on the creation and collection levels, it will be paramount that we get our bibliographic control house in order.
But that is not and cannot be the entirety of the argument. We need to accept the fact that more materials, both printed and digital, will always demand more of our attention than we can possibly accommodate. As I observed earlier, this was true in a print-only world and it is exponentially true now in the current environment. We need to consider the feasibility of two complementary approaches to this overwhelming bibliographic tsunami.
The first is employing more technology, and more sophisticated technology at that. I have in mind machine analysis of digital documents. For PDF documents this will involve OCR conversion as the basis for full-text indexing and then automatic metadata creation based on the OCR derived output. For documents that are already XML-based, as I expect will ultimately be the case for most non-traditional metadata and ultimately even for most MARC data, we will see automatic metadata creation-or perhaps for an indeterminate period of time-machine-assisted, cataloger-approved metadata creation. This is the plan espoused, for instance, by Ex Libris for its upcoming DigiTooLibrary(tm) product. This will follow the model of the mid- to late-1990s authority record creation that has come to be a staple of participants in the NACO program. It has proven itself successful and been a tremendous help in automating the more mundane, predictable parts of authority control.
If we are to see this concept through to implementation, though, more is needed than just standards and technology. What is needed will be true partnerships and true commitment, substantially greater than we have now in OCLC or RLG where librarians often pay lip service to cooperation and to trust. Rather, they maintain lists of acceptable and non-acceptable libraries and are constantly re-inventing the bibliographic record to meet their own, internal and internalized standards. I have said it often before in venues devoted to issues of cataloging of printed materials, and let me say it here as well. We cannot afford such arbitrary distinctions. The days of golden records are long gone. I once heard Sarah Thomas quote Winston Tabb as saying, "What good is a bibliographic record if it is not there when we need it and at a priced we can afford?" This is all too true. If the world is simply not to pass us by as an anachronistic profession, or if we in technical services are not to become an archaic sub-profession, we must adapt and seek out new partnerships. These partnerships will be technology-based and help us to do more faster and more accurately, and they will also be based on new arrangements with content creators, both individual and corporate, and with corporate content providers. These partnerships should be based on mutual understandings of what is desirable and what is possible, and a realization that not all that is desirable in terms of traditional bibliographic control is within our grasp even with the most advanced technology.
OCLC's CORC project is an attempt in this direction. I have been watching the development of the software for about 2 years now. It has the right elements in its repertoire, it has the correct pieces of an Internet toolkit, and OCLC has spent millions of dollars of research time on it. Is it all that we want or hope? No, at least not yet. But it does show that metadata about the new Internet frontier is crucial to our collective future well being.
I was recently reminded of a pithy aphorism that I formerly used in talking about standards and workflows: "Better is the enemy of good enough". Regina Reynolds, of the National Serials Data Program, put a name to my saying when she quoted Voltaire as saying, "The perfect is the enemy of the good." My saying is a slight variation on Voltaire, to be sure, and I had no particular source for it. I have to admit, with some sense of embarrassment, that it does not sound particularly self-flattering to admit that you are willing to settle for less than perfection. To return to Matthew Beacom's comment about Na´veMetadata, the fact is that searching and linking that is dependent on metadata can only be as complete or as precise as the metadata itself is complete or precise. But, in telling my former staff at Indiana that there is the Good and the GoodEnough, I also pointed out that metadata description, taken in its full context of description and classification, is an art, not a science, and certainly not an exact, reproducible science. Those of us who have long considered ourselves the guardians of the bibliographic universe need to broaden our horizons and recognize that the world is not as narrow as we once defined it. It requires new approaches, new technologies, and new allies. Let us accept that and frame the debate in wider, more inclusive terms than formerly.
I do not want to end on a 'down' note. The subject is too important and the consequences of failure or failure to act are too high. So I want to draw your attention to some recent research and a new opportunity to link together our disparate metadata resources. This comes in the form of a relatively new adjunct to the Net and the Web, and I became aware of it, coincidentally, as I left almost 30 years in academia for the vendor world. This is the technology created by Herbert van de Sompel and his colleagues at the University of Ghent. Called SFX, for Special Effects-and not to be confused with the SFX technology that exists primarily to deliver audio-visual resources over the Web-this is nothing short of a revolution in how we should envision research on the Web. The University of Ghent is an Aleph 500 site, so it is no surprise that Ex Libris, the creator of the Aleph library management system, saw the potential in this product and now has an exclusive right to license this technology. SFX is a framework for context-sensitive linking between Web resources. It is the means to unite or link disparate, heterogeneous electronic resources such as abstracts and full-text, all the while keeping in mind the context in which the user works and that some sources of data may be institutionally more appropriate for that user than others. It also has the ability to link to related occurrences of authors, subjects, and other metadata access points.
This is truly exciting stuff. Yet I am struck by the notion, heard all too often but I do believe true in this case, that we have hit on one of the Holy Grails of research. The Holy Grail is that of 'seamless interconnectivity'. To back up a step, though, this technology is seamless only because the metadata exists as seams in an information architecture. SFX then takes the seams one step further and turns them into a library-defined seamless whole. Please note, that is 'whole' and not 'hole'.
What is most gratifying about the SFX solution to reference-linking is its unabashed reliance on metadata. The fuller and more accurate it is, the better the reference-linking that will result. In a world with result sets numbered in the 1,000s or 10,000s, precision and recall are constantly at odds. I have therefore found myself intrigued by the concept of metadata as the 'wrapper' of Internet resources. Now, it is true that wrapper has a specific meaning beyond what I am according it here. To me, if I can make use of a bit of license, the metadata wrapper is as simple as that of the gift wrap of a present or of a candy bar wrapper. In the case of a present, the purpose of the gift-wrapping is to disguise or even hide the contents. In the case of a candy bar, the wrapper serves to convey all the information required to accurately know what is contained inside: the brand of candy bar (= title), the manufacturer and place of manufacture (= imprint), the weight (= collation). It also has a key number (scannable barcode), list of ingredients (= table of contents?) and nutritional information (based on a 2,000 calorie/day diet), and the notation that it is Kosher-Dairy. So, for example, my half-serious attempt at cataloging a Nestle(r) Crunch(r) candy bar:
Nestle(r) Crunch(r) [candy bar].-Glendale, CA, USA : Nestle USA, Inc., Confections Division, . 1.55 oz. "28000-13170." "OU-D." Nutrition facts: Calories, 230; Fat cal., 100. Ingredients: milk chocolate ...
This is important metadata because not only does it ensure that I do not buy a bag of M&Ms(r) when what I really want is a Crunch(r) bar, but it gives me all the essentials except the price. Content providers could do worse than imitate Nestle(r) Inc. What I do not want them to do is give me an information present enclosed in metadata gift wrapping that gives no clue to the informationpresent's true identity or, even, worse, to give me the information present with no metadata wrapper at all. Think of it as an anonymous candy bar or a tin can that has lost its label. Who would have any interest in such an object?
Those of us who are devotees of and believers in bibliographic control need to recognize the absolutely pivotal role metadata has in the new information economy. Many times in recent months I have heard that we need to claim our rightful place as information managers. Morever, the truth is that the information we now must manage is of an entirely different magnitude than what we faced before. If we ever thought that we could manage it alone-and I think we were mistaken if we thought we could-the fact is that we can no longer do so. We need to seek out and develop our natural partnerships in the information and systems communities and make certain that these partnerships are based on a shared vision.
January 23, 2001
Library of Congress Help Desk