DATE: May 5, 1995
REVISED:
NAME: Mapping the Dublin Core Metadata Elements to USMARC
SOURCE: OCLC/NCSA Metadata Workshop; Library of Congress
SUMMARY: This paper reviews the discussions held at the OCLC/NCSA Metadata Workshop in Dublin, Ohio in March about core data elements for discovery and retrieval of Internet resources ("metadata") by a diverse group of Internet users. The data elements as defined by the participants at the Workshop are listed with possible equivalents in USMARC. Problems in mapping are reviewed and options for resolutions are suggested.
KEYWORDS: OCLC/NCSA Metadata Workshop; Dublin core data elements; Internet resources
RELATED: DP87 (June 1995); DP88 (June 1995); 94-9 (June 1994)
STATUS/COMMENTS:
5/5/95 - Forwarded to USMARC Advisory Group for discussion at the June 1995 MARBI meetings.
6/26/95 - Results of USMARC Advisory Group discussion - Participants were interested in this effort and wanted to know how it would be implemented. The discussion paper will be reposted on the USMARC list, and the OCLC/NCSA Metadata Workshop Report will also be posted to continue discussion. Specific comments were as follows:
1) Date was unclear. It either should be generalized or more specifically defined.
2) It would be preferable to merge Author and OtherAgent and make them one element.
3) Attention needs to be given to the question of version; there was no specific element for it.
4) Questions arose as to why Coverage was included, since it is not generally applicable. In addition the term "Coverage" is vague and another term might be reconsidered.
5) Abstract would be a useful element; it was explained that this data could be under Subject with a scheme subelement=abstract.
6) It was suggested that 040$e be used to identify the origin of the metadata as outside the traditional cataloging record. Another option is in Leader/17 (Encoding level) as partial or preliminary. This needs to be considered further.
7) ObjectType: needs further consideration. Much of this is in 516.
DISCUSSION PAPER NO. 86: Mapping the Dublin core metadata elements to USMARC I. INTRODUCTION The USMARC Advisory Group has discussed the creation of USMARC records for Internet resources several times and has modified the USMARC bibliographic format in several ways to accommodate them. Field 856 (Electronic Location and Access) was first defined in January 1993 with Proposal No. 93-4 to provide location and access information for electronic resources, and Internet resources in particular, as a result of the OCLC Internet Resources Project. The field has been modified and enhanced in several proposals since its initial approval. Most recently, Proposal No. 94-9 (Changes to the USMARC Bibliographic Format to Accommodate Online Systems and Services) was discussed in June 1994 to make further changes in the bibliographic format to allow for the creation of records for online systems and services. That paper included a list of data elements needed for the description of these types of resources and their mapping to USMARC fields. The OCLC/NCSA Metadata Workshop, held in Dublin Ohio March 1-3, 1995, was organized by OCLC and the National Center for Supercomputer Applications (NCSA) to address the problem of providing metadata for a larger proportion of network-accessible materials. ("Metadata" is defined as data about data; it is roughly equivalent to a bibliographic description.) The original intent was to recognize various "stakeholder" communities with an interest in the search and retrieval of Internet resources, to understand the uses descriptive metadata would serve for these communities, and to achieve if possible some consensus on a limited data element set for identifying these resources. Workshop participants included librarians and archivists, researchers, computer and information scientists, software developers, publishers, and members of Internet Engineering Task Force (IETF) working groups. Within these constituencies there was tremendous diversity of approach. Some participants were concerned with electronic data resources in general while others focused on particular types of materials, such as humanities texts or geospatial metadata. Some were interested in the network services and protocols that would make use of the metadata, while others took the point of view of the author, publisher or end-user. The one thing that united all participants was a belief that nearly any standard metadata would be better than none, since currently there is little agreement and no standardization. Nonetheless, early in the course of the workshop it became evident that no single data element set whether limited or unlimited would satisfy the widely divergent and highly specific needs of the various stakeholders. The emphasis therefore shifted to something that was perceived as both useful and doable: the definition of a simple data element set that could be used by information providers to describe their own resources. The goal was to draft a single sheet of instructions that an author or publisher mounting a document on a network server would be able to follow without excessive effort or additional knowledge. Such a data element set, if it could become an official or de facto standard, would have at least four different uses. It would encourage authors and publishers to provide metadata simultaneously with their data. It would allow the developers of authoring tools for network publishing to include templates for this information directly in their software, making it even easier for the information providers to supply it. The metadata created by the information providers would serve as a basis for more detailed cataloging or description when warranted by specific communities. And it would ensure a common core set of elements that could be understood across communities, even if more specific information was required within a particular interest group. In order to make the task more manageable, some limitations on scope were imposed. First, the universe of materials to be described was limited to document-like objects, or DLOs. The universe of document-like objects itself was left undefined, but intuitively this would seem to include certain things like texts and digitized photographs, and to exclude others like human beings and computer services. Second, the data elements themselves were limited to those supporting discovery and retrieval of the DLO; that is, to ascertaining that an object exists and obtaining a copy of it. Enough description should be included to allow the searcher to confirm that the DLO in question is actually the desired object, but not necessarily all of the information one would need to support other valid purposes like security, authentication, purchase or use. Because it was agreed that a fairly short list of data elements would be most useful and simple for naive users to use, the concept of extensibility was established. The metadata element set concentrated on describing intrinsic properties of the resource. Extrinsic data, such as cost, access limitations, etc. was considered outside the scope of the core set. The extension mechanism would allow for the base set to be extended for a variety of purposes. Specific user communities may have additional data elements that are of particular importance. Local additions are accommodated by allowing any elements to be added to the record for a resource. A particular user community may establish a list of additional elements that may be incorporated for specialized purposes. A scheme sub-element is defined for some of the elements in the core set and may also be used for the extensible sets. This allows for the specification of established schemes or sets of rules that govern the syntax or semantics of an element. (For instance, the URL scheme might be specified for electronic location information; the various subject thesauri might be specified for subjects.) The metadata element set that emerged is documented in a position paper which is available through OCLC's World Wide Web server at the following URL: <URL:http://www.oclc.org:5046/conferences/metadata/dublin_core_re port.html> It includes thirteen elements described briefly with examples of their use. These can be grouped roughly into three categories: - access points (Title, Subject, Author, OtherAgent, Identifier) - information to facilitate identification (Publisher, Date, ObjectType, Language, Form, Coverage) - information to relate this object to other objects (Relation, Source) All elements are optional and repeatable, since the participants did not feel they could predict every specialized use for the metadata. Some elements have additional subelements defined. The Dublin Core metadata element set is a core set in the sense that it is a small number of elements, judged to have general applicability, that will be universally understood if the standard is followed. It is not a core data element set in the sense of being a minimum number of required elements. The assumption is that while the information provider is encouraged to supply all of these elements, information that is not applicable or not readily available can be omitted. It is also not a core data element set in the sense of being the minimum number of elements adequate to describe an object. As mentioned above, the extensibility mechanism allows for additional data to support other purposes. Any implementation will require an extensibility mechanism to include other elements, either of local significance or pointers to other established element sets (MARC, GILS, TEI, etc.). Perhaps the most important thing to note about the Core Metadata Elements is that it is syntax-independent; that is, the meaning and content of the elements are defined and described independent of any particular way of encoding them, defined with no necessary relation to any particular transport syntax. The intent is that the Core set can be mapped to any desired syntax, (e.g., USMARC, Standardized Generalized Markup Language, etc.). This situation can be compared to a cataloging code such as AACR2 that identifies standard data elements but does not define a format to use them. It is important to consider the different issues raised if a human being or a machine performs the mapping to MARC. If a cataloger uses the metadata as a basis for creating a catalog record, appropriate decisions can be made on a case-by-case basis. If the mapping is done by machine, it becomes more problemmatic. For example the SCHEME element may be helpful in machine mapping, but only if the content of a SCHEME field is itself taken from an authority list. II. The "Dublin Core" Below is the list of core data elements with definitions and examples where available that were formulated at the Dublin Metadata Workshop. A mapping to USMARC fields is indicated (formulated by the Network Development and MARC Standards Office, and not at the Workshop). In some cases questions are posed for resolution of the problems in mapping. Note that a mechanism would have to be in place to convert the data from one transport syntax to USMARC. Subject: words or phrases indicative of the information content. If the value comes from a controlled vocabulary, the SCHEME sub-element is used to indicate which vocabulary. EXAMPLES: English language -- style -- data processing Dogs USMARC: 653 (Index Term--Uncontrolled) or 650 (Subject Added Entry--Topical Term) This element has a SCHEME subelement defined to indicate the vocabulary. Thus, Library of Congress Subject Heading terms or other controlled thesaurus terms could be used as data. Field 650 can be used for the subject headings or terms, but there may be cases of incorrect mapping, such as when a geographic name (which should be coded as 651) or a personal name (which should be coded as 600) is used as subject. The Dublin core element set does not distinguish these. The indicator would be set according to the SCHEME. This is an example of the content of SCHEME needing to be authority controlled to be useful for machine mapping. Field 653 can be used if SCHEME is not present, but is less than optimal for controlled subject headings, because the implication is that the headings are uncontrolled. Title: the title, name, or short description of the object. EXAMPLES: Moby Dick: an electronic version Photograph of the Empire State Building USMARC: 245 (Title Statement) This could include subtitles. Everything can be included in 245$a, or the conversion would have to attempt to use punctuation for parsing the data in subfields (245$a for title proper; 245$b for subtitle). Author: the name or creator of the content. EXAMPLES: Melville, Herman Mao tse-tung von Neuman Janos von Neuman, John USMARC: 100 Main Entry--Personal Name) or 110 (Main Entry-- Corporate Name) or 700 (Added Entry--Personal Name) or 710 (Added Entry--Corporate Name) Mapping the author brings up several questions. If using 1XX fields, the concept of main entry is not entirely applicable, since main entry is an AACR concept, and there is no assumption that these materials are being described according to library cataloging rules. For our purposes, author could be main entry, but the 1XX fields are not repeatable, so any additional authors would go in 700. Since all elements are repeatable in the Dublin Core, there could be more than one "author" (i.e., person responsible for the general content of the work without a specified role). It could be difficult to determine whether the data belongs in Author or OtherAgent, although otherAgent always would have a role defined, so it could be distinguished from Author on that basis. The identification of either personal or corporate name causes difficulty for the mapping, so whichever field is chosen would result in a certain percentage of incorrect mappings. However, we may be able to assume that the majority of "Authors" will be personal names, and that corporate names will probably have a ROLE subelement attached. The USMARC Advisory Group might consider the definition of a generic author field, i.e. an author that is undistinguished by type. See Discussion Paper No. 88 for a discussion of this issue. The problem of an author element also arose when the Network Development and MARC Standards Office provided a USMARC mapping for the Government Information Locator Service (GILS). The GILS profile used field 710, since it was not desirable to require that a decision be made on main entry. However, for that project, the majority of authors would be government agencies, and under AACR2 would probably not be entered as main entry. The name should be given in the natural sort order of the language being used. Publisher: the name of the entity responsible for making the object available. EXAMPLES: Oxford University Press OCLC [Privately distributed] USMARC: 260$b (Name of publisher, distributor, etc.) OtherAgent: the name of any other entity responsible for the content of the object; the ROLE sub-element describes the type responsibility. EXAMPLES: otherAgent role=illustrator: Maurice Sendak otherAgent role=compiler: John Bear USMARC: 700 or 710 (Added entry--Personal name or Added entry--Corporate name) The same problem of distinguishing personal and corporate authors is evident here; see above under Author for discussion of the issue. OtherAgent includes a ROLE subelement; this would correspond to 700 or 710$e (Relator term). The data may or may not be inverted; how this will be handled needs to be resolved. A proposal could be circulated to the Dublin metadata group to include two "OtherAgents" so that we can distinguish between personal and corporate authorship (OtherAgent (Person) and OtherAgent (Organization)). Date: the date of publication. Specifically not of the content but of the actual object described. USMARC: 260$c (Date of publication, distribution, etc.) There are many dates defined in the USMARC bibliographic format. The only one considered a core element in the Dublin set is the publication date. Note that the date is also given in 008/07-10 in a standardized form (the date in 260$c could include other elements in addition to date of publication). The extensibility mechanism would be needed in many cases, where other types of dates are particularly important (e.g., date of an original for digitized texts). Identifier: a character string or number used to distinguish this object from other objects; a SCHEME subelement identifies the authority. EXAMPLE: Identifier (URL): http://www.oclc.org USMARC: 010 (LC Control Number) 020 (ISBN) 022 (ISSN) 024 (Other Standard Identifier) 856$u (Uniform Resource Locator) Since all elements are repeatable and a SCHEME subelement is defined for Identifier, it can be mapped to various USMARC fields. Object-type: conceptual description of the object. EXAMPLES: book map graphic illustration USMARC: Leader/06 (Type of record) Specific object types would convert to an equivalent value in Leader/06. For instance the object type "book" would convert to code a for language material; "map" to code e for printed map (or cartographic material if Proposal No. 95-16 is approved). When there is more than one value is available, such as sound recordings (nonmusical sound recording and musical sound recording), it may not be possible to make a distinction, but one unambiguous conversion will need to be supplied. In some cases, using Leader/06 may not be specific enough, since something similar to an Specific Material Designator (SMD) may be used. In those cases, a code in the appropriate 008 character position might be more appropriate, but how to map these to USMARC is unclear. Form: physical, logical, or encoding characteristics. (Information as to how it got represented in its current form.) EXAMPLES: TIFF ver. 2.3.4.5.6 SGML / TEI P3-1994 USMARC: 538 (System Details Note) Relation: Important known relationship to other objects; the TYPE sub-element describes the nature of the relationship; the SCHEME sub-element identifies the notation used to identify the related object(s). Relation (supersedes)(url): http://www.oclc.org/cr0.9 USMARC: 772 (Parent Record Entry); 773 (Host Item Entry); 775 (Other Edition Entry); 776 (Additional Physical Form Entry); 780 (Preceding Entry; 785 (Succeeding Entry); 787 (Nonspecific Relationship Entry) The TYPE sub-element indicates the relationship being expressed, and thus the fields to be used. The SCHEME sub-element might map to a URL or a record control number. See Discussion Paper No. 87 (Addition of Subfield $l in Linking Entry Fields 76X-78X in the USMARC Bibliographic Format) for a discussion of defining a URL in linking entry fields. Language: natural language of the object content; the SCHEME element identifies the controlled vocabulary. USMARC: 041 (Language code) or 546 (Language Note) The SCHEME sub-element could identify the USMARC Code List for Languages if 041 is used. Alternatively, field 546 could be used for a textual note. Language is also given in coded form in 008/35-37. Source: object from which this object was derived; contains a nested object description. USMARC: 786 (Data Source Entry) or 776 (Additional Physical Form Entry) Since this element includes a "nested object description", a linking entry field is appropriate with the separate data elements of the nested description in defined subfields. Field 786 was recently defined for Data Source Entry. However, the following elements are not available in the linking entry fields: subject, object type. Coverage: describes the spatial and temporal characteristics of the object and is the key element for supporting spatial or temporal range searching on document- like objects. Coverage can be modified by the qualifiers "spatial" and "temporal". USMARC: Spatial: 034 (Coded Cartographic Mathematical Data) or 255 (Cartographic Mathematical Data) Whether the data is recorded in a coded or textual form would determine which USMARC field would be used. This element was added to the Dublin Core elements as this paper was being finalized; an example shows bounding coordinates as spatial data to be recorded here. In USMARC field 034 coordinates are recorded in separate subfields (westernmost, easternmost, etc.); in field 255 they are recorded in a textual form in subfield $c. Note that the USMARC and Content Standards for Digital Geospatial Metadata (CSDGM) crosswalk uses both fields. Temporal: 045 (Time Period of Content) or 513 (Type of Report and Period Covered Note) Again the data may be recorded in a formatted form as yyyymmddhh in field 045 or in a textual form in 513$b. The example in the Dublin document shows the data as formatted. Note that the USMARC and CSDGM crosswalk uses field 045 for this data. The Government Information Locator Service (GILS) mapping includes both fields, depending on the data. III. CONCLUSIONS A series of messages distributed on the USMARC list in 1994 discussed the need for a specific identification of the record as a non-traditional library catalog record, indicating that the description and access points may not conform to expected standards (AACR2, ISBD, USMARC). In this case, the record might conform to USMARC in structure and tagging, but not precisely to the definitions associated with the tags and other content designators. The Network Development and MARC Standards Office answered that there were several data elements in USMARC that together could be used to show this: Leader/18 set to blank (non-ISBD) or u (unknown), and a code defined in field 042 (Authentication Code) for a particular project that would serve to identify the record as non-standard cataloging (e.g. gils). However, the USMARC Advisory Group might consider the definition of a specific code (perhaps in Leader/18) that would unambiguously indicate the nature of the record. The USMARC Advisory Group might consider creating more generic fields in MARC to accommodate this type of application that requires flexibility in definition and use of fields. It is possible to map the metadata author or subject elements to MARC fields but as with some other data, the data elements will not always be completely accurate representations of data that is expected in the fields as they are presently defined. One instance is with the "author" element. The Dublin core metadata elements will probably be revised and refined as a wider discussion takes place. There is some thought to distributing a document summarizing the work in a Request for Comment (RFC) to the Internet Engineering Task Force. A drafting committee has been formed to continue the effort, which includes representatives from the library community, including MARBI and the Library of Congress. Any resultant document will probably include the following: Goals and scope Underlying assumptions and philosophy List of data element set Extensibility framework Guidelines for application Relationship to other work Theaurus for multidisciplinary vocabulary The USMARC Advisory Group will be kept informed of the progress of this effort. It is important to settle on the USMARC mapping early in the process, so that if there are any changes needed to USMARC to better accommodate the metadata elements, work can begin. Additional information about the Dublin Metadata Workshop is available at the following URL: http://www.oclc.org:5046/conferences/metadata. See Attachment A for examples of metadata elements for some Internet resources. ------------------------------------------------------------------ ATTACHMENT A Example 1: Resource description for electronic versions of a print OCLC Research Report Element Name USMARC Field Content subject 650 scheme=LCSH Internet (Computer network) Cataloging of computer files Information networks Computer networks Libraries--Communication systems Information storage and retrieval systems title 245$a Assessing Information on the Internet: Toward Providing Library Services for Computer Mediated Communication Responsible agent 700? role=author Martin Dillon role=author Erik Jul role=author Mark Burge role=author Carol Hickey Note that this is a deviation from the core set, in that it does not distinguish between author and OtherAgent. This issue is under discussion. Publisher 260$b OCLC Date 260$c 1994 Identifier 856 Scheme=URL http://ftp.rsch.oclc.org/pub/ internet_resources_project/report/ internet.ps Object type: Leader/06=a Scheme=USMARC Language material Form: 538 7 postscript files 1 Unix tar file Note: Field 256 could also be used for this data. Language: 041 Scheme=USMARC English Source: 786 ? Subject: same as above $a Responsible agent: same above $t Title same as above: $d Date: 1993 ? Object type: same as above $h Form: Scheme=AACR2; 1 v. (various pagings) : ill. ; 29 cm. $d Publisher: same as above $b Edition: NA Note that subfield $h was defined with Proposal No. 94-17A. Other information known about this resource not in included in Dublin Metadata record: URL, names, and file sizes of 7 postscript URL, name, and file size of tar file Example 2: Resource description for LC MARVEL Element Name USMARC Field Content subject scheme=LCSH 650 Library of Congress Library of Congress--Catalogs This points out a mapping problem, since Library of Congress should be coded as 610.) title 245 LC MARVEL: machine-assisted realization of the virtual electronic library Responsible agent 700? role=database provider? Library of Congress Publisher 260$b Library of Congress Date 260$c 19?? Identifier 856 scheme=URL gopher://marvel.loc.gov telnet://marvel.loc.gov Object type: Leader/06 online database This would map to code m for Computer file. "Online system or service" is available in 008/26. Form: 538 gopher server? telnet? Relation: NA Language: 041 English Coverage: 045 ?? Source: NA (This is an original work.) Other information known about this resource not in included in Dublin Metadata record: Contact for assistance: LC MARVEL Design Team, Library of Congress, Washington, DC 20541; email:[email protected] Example 3: Resource description for TEI tagged electronic text of The Haunted Hotel by Wilkie Collins Element Name USMARC Field Content subject 650 could apply a genre term here title 245 The haunted hotel: a mystery of modern Venice Responsible agent role=author 700? 100? Wilkie Collins role=creator 700? 710? of electronic text University of Virginia Library Electronic Text Center Publisher 260$b University of Virginia Library Date 260$c 1993 Identifier 856 ftp:\\etext.lib.virginia.edu\... Scheme=URL Object type: Leader/06 Language material or 008/26=d electronic text? Form: 538 1 ascii file with minimal TEI tagging Relation: 776$w [LCCN of original] Type=additional physical This is a machine readable version form of the item specified in source. Scheme=LCCN in source. Language: 041 English Coverage: NA Source: 786 ? Subject: same as above role=author $a Responsible agent: same above $t Title same as above: $d Date: 1992 ? Object type: same as above $h Form: Scheme=AACR2; 327 p. ; 29 cm. $d Publisher: Dover Publications $b Edition: NA Other information known about this resource not in included in Dublin Metadata record: Copies of the file are available to UVA faculty, staff, and students. Other details about the editorial principles and practices applied during the encoding of the text.