DATE: May 26, 1995
NAME: Defining a Generic Author Field in USMARC
SOURCE: OCLC/NCSA Metadata Workshop
SUMMARY: This paper discusses the options for recording author names in USMARC records that do not use standard cataloging rules. The OCLC/NCSA Metadata Workshop held in Dublin, Ohio in March established a list of core data elements needed for discovery and retrieval of Internet resources ("metadata"). These included "Author" and "Other Agent", which does not distinguish between personal and corporate names. This paper suggests three options for mapping these to USMARC: 1) choose a single already established field to be used for generic author despite formal definitions; 2) relax the definition of 700-711 fields to include a value in the first indicator for "unknown or not specified"; or, 3) define a new, repeatable USMARC field for names of authors not formulated according to cataloging rules.
RELATED: DP86 (June 1995)
KEYWORDS: OCLC/NCSA Metadata Workshop; Dublin core data elements; Author
5/26/95 - Forwarded to USMARC Advisory Group for discussion at the MARBI meetings.
6/26/95 - Results of USMARC Advisory Group discussion - Discussion indicated that there was interest in a proposal for a generic name field with the following characteristics:
- field is to be for uncontrolled names;
- first indicator for type might have only personal broken out, options needed;
- no subfield for other information but a subfield $e for relationship is needed, the role would indicate whether the entity is an author or other agent.
DISCUSSION PAPER NO. 88: Defining a Generic Author Field 1. INTRODUCTION According to the General Introduction to the USMARC Concise Formats, the USMARC record is composed of three elements: record structure, content designation, and data content. Content designation for bibliographic records is defined in the USMARC Format for Bibliographic Data. The data content itself is usually proscribed by standards outside the USMARC formats, including ISBD, AACR2, and various thesauri. For example, USMARC content designation accommodates a topical Library of Congress subject heading in a field tagged 650 with a second indicator of value "0". The external document Library of Congress Subject Headings (LCSH), however, provides the authority for the data content by defining valid and well-formed topical LC subject headings. The USMARC bibliographic format was designed primarily to support library cataloging, and in particular to support the Anglo-American Cataloging Rules. Therefore there are many data elements defined in USMARC that relate specifically to particular cataloging constructs. It is generally accepted that when USMARC is being used to represent a cataloging record, the cataloging rules should govern the data content whenever applicable. When a USMARC record is created for some purpose unrelated to cataloging, fields can be used for data congruent with the field definition, even if the data content is not formulated according to the cataloging rules. For names of authors and other agents responsible for all or part of the intellectual content of the work, there is a particularly close relationship between the content designation defined in USMARC and the cataloging rules. It often surprises non-librarians to discover there is no MARC field defined specifically for author, but rather sets of fields defined for main and added entries, concepts that exist only within certain cataloging codes and which also encompass a number of non-authorial relationships. Such integral support for cataloging is clearly a desirable feature in the bibliographic formats, but it also raises problems when creating bibliographic data for other purposes. For one thing, since the 1XX and 7XX tag ranges are defined explicitly in terms of the cataloging concepts main and added entry, it is difficult to use them in an environment that lacks these concepts. A second problem with the 1XX and 7XX content designation in USMARC is that to properly encode it, one has to know quite a large number of things, including the author's relation to the work, whether the name in question is that of a person, corporate body, or meeting, and, to correctly set the first indicator, the form of entry of element of the name. There is no option in USMARC to choose not to supply any of this information or to indicate that it is unknown. This can pose a barrier to use of these fields for non-cataloging purposes. While storing and communicating library cataloging data is without doubt the predominant use of the USMARC bibliographic formats, there are other uses commonly made of the formats which certainly are to the advantage of the library community. To name only a few: - A bibliographic record might be used in a library acquisitions system for the purpose of creating a purchase order. In this case, the data is not cataloging data, and the acquisitions clerk creating the data may neither know the rules governing form and choice of entry nor have sufficient information in his citation to assign content designation congruent with the cataloging rules. - Bibliographic records might be created for the purpose of generating a set of references (endnotes or footnotes) according to some external authority such as the Chicago Manual of Style, which has quite different rules for citing authors' names than the cataloging standard. For example, the seminar papers published in 1974 under the title Networks for research and education: Sharing of computer and information resources nationwide had four editors. According to the Chicago Manual, the names of all four editors should be listed before the title. According to AACR2, there would be no main entry and the name of the first editor only would be recorded as an added entry. - An increasingly common use of USMARC bibliographic records is as a vehicle for metadata created by various communities according to various other standards. For example, the Government Information Locator Service (GILS) defines a set of GILS Core Elements and specifies that these must be represented in three different record syntaxes, one of them USMARC. More recently another standard known as the Dublin Core Element set was proposed for describing network accessible electronic resources. The Dublin Core, with only minor variations, is also being incorporated into the emerging IETF standard for Uniform Resource Characteristic (URC). This last use is gaining importance in the networked environment where libraries are only one player in an increasingly complex system of information creators, publishers and disseminators. There is clearly great utility in being able to represent metadata created according to standards other than AACR and AACR2 into USMARC. The data, which is inherently bibliographic in nature, can then be edited and manipulated by the many existing software packages for processing MARC bibliographic records, and the records can be integrated into existing library catalogs and searched byMARC-based bibliographic retrieval systems. Both the library community and the information providers are benefited. When it is possible to map other metadata element schemes into MARC content designation, in general the most problematic element is the author. The GILS Core Element set has no element for author in the sense of AACR2, but does have an element for originator which identifies the originator of the information resource. This is by convention mapped to the USMARC 710 field. The x10 was chosen because GILS originators can be assumed to be government agencies, and the 7XX block was chosen over the 1XX block because of its repeatability. The Dublin Core, which is more fully described in Discussion Paper No. 86 contains two data elements for names of entities responsible for intellectual content, Author and Other Agent, which are not necessarily governed by cataloging rules and which map only imperfectly to USMARC 1XX and 7XX fields. In its simplest form, the Author element could be recorded simply as: AUTHOR = Miller, Bruce with no indication of the relationship of the author to the work. A cataloger converting this data to USMARC is likely to be able to infer that this is a personal name, but may be less likely to know whether Mr. Miller is related to the resource in the capacity of main or added entry. The Dublin Core does allow for qualifiers which, if extensively used, could provide enough information about a name for accurate human or even machine mapping to USMARC, assuming the name was formulated according to AACR. The "scheme" qualifier can be used to specify the cataloging ruleset, the "type" qualifier can specify personal, corporate or other authorship, etc. AUTHOR (scheme = AACR2, role = Main Entry, type = Personal, form = Single Surname) = Miller, Bruce However, since the Dublin Core was defined expressly for the purpose of encouraging metadata creation by non-catalogers, the likelihood of this information being supplied for the majority of objects is low. 2. SOLUTIONS A possible solution, and the one in common practice now, is simply to ignore the formal definition of the 1XX and 7XX fields and choose a single field to be used for authors generically. Usually a 700 or 710 is chosen, either to avoid the implication of main entry or because these fields are legally repeatable. The advantages are that this can be easily done, no change to USMARC is required, and these fields are generally treated appropriately by relevant software programs. There are also disadvantages: it is formally a non-appropriate use of the data element, the mapping may actually be incorrect, and systems are unable to distinguish data content properly and improperly represented. A significant problem is that the 1XX and 7XX fields have a close relationship to name fields in the authorities format, and many local systems are designed to require some form of linked or unlinked authority control on name headings, making extensive use of non-standard and non-authority controlled data in these fields problematic. A second option would be to change the USMARC bibliographic formats to relax the definition of the 700-711 tag range. Rather than being described as added entry fields, they could be redefined as appropriate for names not known to be main entries. A value for "unknown or not specified" could be added to the first indicator position (Type of ... name entry element) [See also DP85] and a value for "generic name entry" added to the second indicator (Type of added entry). This option would still require that personal, corporate and meeting names be distinguished from each other (which might be done by inference as for GILS data or by format recognition) with the concomitant disadvantage that this distinction would often be subverted or erroneous. A third option would be to define a new, repeatable USMARC field explicitly for names of authors not formulated according to the cataloging rules or contained in an authority file or list. This has been done in the format for other types of access points: field 653 (Index Term -- Uncontrolled) which may contain subject terms that are topical, name, etc. and are not authority controlled by an authority file or list, and field 740 (Added Entry -- Uncontrolled Related/Analytical Title) which serves the same function for title access points. This "generic author" field would not distinguish between main and added entry, and would not require the distinction between types of authors. Internal content designation would be optional and kept to a minimum. A possible definition of a field for generic author would be: 720 (R) Author Indicators First Type of name # Unknown or not specified 1 Personal 2 Corporate 3 Meeting 4 Other Second Undefined; contains a blank Subfield codes $a Name $b Other information 720 1# $aBlacklock, Joseph 3. QUESTIONS 1. No matter which option is used, integrating cataloging and non-cataloging data in bibliographic systems raises problems for indexing, display and retrieval. Are we more likely to want to integrate "generic" authors with standard name fields or to segregate them in separate keyword or alphabetical indexes? Which option would have the least adverse effect on existing systems? Which would give most flexibility in treatment? 2. It is possible that we should think of a generic "agent" field that would allow relationships other than author to be recorded. In this case the field name could be "720 (R) Name (or Name -- Uncontrolled)". The subfield for "other information" could be used for relator information, or a third subfield defined specifically for this. 720 1# $aVonderohe, Robert$b1934-$eeditor 720 2# $aCAPCON Library Network$eauthor 3. If we define a generic "author" or "agent" field, how much content designation should be provided? Would additional subfielding be useful or impractical if its use was optional? If, for example, a Dublin Core element with qualifiers was being mapped, where would qualifier data be recorded? 4. In a generic "author" or "agent" field, is there any virtue to using the second indicator position to indicate type of entry or form of name? It might be useful to record if a personal name in the $a is in inverted form, direct order, or unknown order, so that alphabetical indexes could be limited to only inverted forms of names.