The Library of Congress >> Especially for Librarians and Archivists >> Standards

MARC Standards

HOME >> MARC Development >> Proposals List


MARC PROPOSAL NO. 2012-03

DATE: May 21, 2012
REVISED:

NAME: Data Provenance in the MARC 21 Bibliographic Format

SOURCE: Deutsche Nationalbibliothek, Library of Congress, and OCLC

SUMMARY: This proposal addresses two approaches for documenting data provenance in the MARC 21 Bibliographic Format.

KEYWORDS: Field 082 (BD); Field 083 (BD); Field 084 (BD); Field 883 (BD); Data provenance; Machine-generated data

RELATED:

STATUS/COMMENTS:
05/21/12 - Made available to the MARC community for discussion.

06/23/12 – Results of MARC Advisory Committee discussion: Option 2 was approved with the following amendments:  Change the definition of $u to include only URI.  Use $a for the name or some description of the process (formerly in $u).  Add also to the Authority and Classification formats.  Change the subfield code $1 to $c.

07/25/12 - Results of LC/LAC/BL review - Agreed with the MARBI decision.


Proposal No. 2012-03: Data provenance in the MARC 21 Bibliographic Format

1. BACKGROUND

Several use cases have emerged that require storage of information about provenance of classification data in MARC bibliographic records. Over the last several months, colleagues at Deutsche Nationalbibliothek, the Library of Congress, and OCLC have been meeting regularly to explore ways of indicating machine generation of classification metadata and identifying the underlying processes.

Two approaches are proposed: 1) a proposal that addresses the immediate need of documenting information about machine generation of classification data in 082, 083, and 084 fields by defining additional subfields; 2) a second approach that suggests a possible way of dealing with these issues in a more general and encompassing manner with the introduction of a new data provenance field. In both options, the intention is to describe provenance of data that are fully machine-generated, or generated by some named process other than intellectual assignment.

2. DISCUSSION

2.1 Option 1: Provenance of Machine-Generated Classification Data

On the most basic level, we want to document (1) whether a classification number was machine-generated, (2) the generating process or activity, (3) the agent responsible for the process, and (4) a basic confidence measure. The responsible agency is already addressed by subfield $q Assigning agency in the 082, 083, and 084 fields.

$i  - Method of assignment designator
Designates whether the classification number contained in the field was machine-generated (m) or generated by some other process (other than direct human assignment). The following codes are used: m (fully machine-generated), and x (not fully machine-generated).

$u - Process of assignment
Describes the process designated in subfield $i used to produce the classification number contained in the field.  The subfield may contain a URI, a process name, or some other description.

$1 - Confidence value
Describes the confidence of the assigning agency with regard to the classification number generated by the process described in subfield $u. The subfield contains a floating point value between 0 and 1. Either a comma or a point may be used as a decimal marker.

A dependency between or “cascade” of these subfields is assumed. In order for subfield $1 to be used, subfield $u has to be present. In order for subfield $u to be used, subfield $i has to be present. Also, if subfield $i is used and contains the code x, subfield $u should be present.

While the code x also includes named processes that do not involve direct machine assistance (e.g., assignment guided by information found in Classify; copying of a number from a parallel record by a cataloger), it should not be used to indicate direct intellectual assignment of a class number. Intellectual assignment is regarded as the standard process that does not require additional provenance information through subfields $i, $u, or $1 (with the implication that no existing records need to be changed for which the assignment method is either fully intellectual or unknown).

Examples

  1. DDC 23 number assigned by LC using AutoDewey*.  The AutoDewey process involves machine assistance followed by intellectual review:

    082 00 $a829/.3$223$ix$uautodewey$11

    *MARC style conventions call for lower case to identify the process in subfield $u.
  1. DDC 22 number assigned by OCLC in a fully automated way using information in Classify. OCLC has assigned a 0.5 confidence value to the process:

    082 04 $a394.12$222$im$uclassify$10.5$qOCoLC-D

  1. SAB class (MARC code “kssb”) assigned by LIBRIS using the SAB-DDC Conversion table.

    084 ## $aPud$2kssb$im$uhttp://export.libris.kb.se/DS/default.asp?
    lim=0&Text=006.3&Typ=Dewey$10.9$qSE-LIBR

2.2. Option 2: New MARC Field for Metadata Provenance

A possible second approach to metadata provenance would be the creation of a new field that would be repeatable and linkable to other fields via subfield $8. This new field could then be used to explicitly specify the provenance of metadata recorded in other fields. With this approach it would be considerably easier to (1) distinguish between provenance of information recorded and provenance of the recorded data itself, (2) align the way data provenance is documented with emerging standards from other communities, (3) provide a full account of data provenance without having to deal with the heavy use of subfields in certain fields like 6XX*.

* Because subfield $8 is repeatable, multiple instances might already be present in the record. However, the link to field 883 will be unambiguous by using (1) a different linking number and (2) the proposed new field link type code p for a subfield $8 that links a field with provenance information of its data.

883 - Data provenance (R)

First Indicator
Method of assignment
# - No information provided
0 – Fully machine-generated
1 – Not fully machine-generated

Second Indicator
Undefined
# - Undefined

$d - Date on which the linked field was generated
The date on which the linked field was generated. This also serves as the beginning of the period of validity. Date is recorded in the format yyyymmdd in accordance with ISO 8601, Representation of Dates and Times.

$u - Process used to generate linked field
Describes the process used to produce the data contained in the field to which the field is linked. The subfield may contain a URI, a process name, or some other description.

$q - Agency using the process/activity to generate the linked field
MARC organization code of the institution using the process/activity to generate the linked field. Code from: MARC Code List for Organizations.

$x - Ending date of validity
Date representing expected end of period of validity. Date is recorded in the format yyyymmdd in accordance with ISO 8601, Representation of Dates and Times.

$0 - Authority record control number or standard number
Subfield $0 contains the system control number of the related authority record, or a standard identifier such as an International Standard Name Identifier (ISNI). The control number or identifier is preceded by the appropriate MARC Organization code (for a related authority record) or the Standard Identifier source code (for a standard identifier scheme), enclosed in parentheses.

$1 - Confidence value
Describes the confidence of the agency using the process/activity described in subfield $u to generate the linked field. The subfield contains a floating point value between 0 and 1. Either a comma or a point may be used as a decimal marker.

$8 - Field link and sequence number
Identifies linked fields and may also propose a sequence for the linked fields. Subfield $8 may be repeated to link a field to more than one other group of fields. The structure and syntax for the field link and sequence number subfield is:

$8 [linking number].[sequence number]\[field link type]

If the sequence number is not needed, the syntax is:

$8 [linking number]\[field link type]

Field link type

p - Data provenance
Used in a record to link a field with another field containing information about provenance of the metadata recorded in the linked field.

Examples

  1. DDC 23 number assigned by LC on 7 April 2012 using AutoDewey:

    082 00 $81\p$a829/.3$223
    883 1# $81\p$uautodewey$d20120407$qDLC$11

  1. DDC 22 number added to record by OCLC on 7 April 2012 using information in Classify. OCLC has assigned a 0.5 confidence value to the process:

    082 04 $81\p$a394.12$222$qOCoLC-D
    883 0# $81\p$uclassify$d20120407$qOCoLC-D$10.5

  1. In the following example, the National Library of Norway has used a fictitious process (DeweyClassifierV0.1) to generate the Dewey number 004 in the 082 field using a set of GND / German DDC 22 mappings. The 883 field contains the process name (DeweyClassifierV0.1), the date the linked 082 field was generated, the expected end date of validity (in this case, when the German DDC 22 will be superseded by the German DDC 23), the agency generating the linked field, the confidence value (expressed with a comma instead of a decimal point), and the GND record for the heading “Informatik” that was the source of the mapping data.

    082 04 $81\p$a004$222/ger$qNO-OsNB
    883 0# $81\p$udeweyclassifierv0.1$d20120101$x20141231$qNO-OsNB$10,75$0(DE-101)040268942

    Example of source record indicated by subfield $0 (abbreviated):

    LDR 00774nz  a2200241n  4500
    001 040268942
    003 DE-101
    005 20090425195049.0
    008 880701n||azznnaabn           | ana    |c
    024 7# $ahttp://d-nb.info/gnd/4026894-9$2uri

    083 04 $a004$9d:3$9t:2007-01-01$222/ger
    150 ## $aInformatik
    ...

  1. In the following example, the National Library of Norway has copied the 082 field from the one assigned by DNB to the original German-language version of the work and associated it with the English-language version. The contents of the 082 field have been copied without any changes (including DNB as assigning agency).

    082 04 $81\p$a004$222/ger$qDE-101
    883 0# $81\p$uparallelrecordcopy$d20120101$x20141231$qNO-OsNB

  1. In the following example, a bibliographic record has been enriched with a BISAC subject heading by the OCLC Metadata Service for Publishers. The service derives BISAC headings from Dewey mappings, which are included in the MARC version of the BISAC records; an identifier for the record containing the mapping is again provided by $0. As is indicated through the subfield $8 linkage, both added fields have the same data provenance.

    072 #7 $81\p$aANT$x006000$2bisacsh   
    650 #7 $82\p$aANTIQUES & COLLECTIBLES$xBottles$2bisacsh
    883 0# $81\p$82\p$uhttp://publishers.oclc.org/en/metadata/​$d20120206$qOCoLC$10.85$0(OCoLC)ANT006000

  1. In the following example, a fully automatic process used information in a VIAF record to add an additional access point (improving retrieval for German information seekers) for a personal name as a subject.

    600 10 $aPeretz, Isaac Leib$d1851 or 2-1915
    600 17 $81\p$0(DE-588)119010046$aPerec, Icchok Leib$d1852-1915$2gnd
    883 0# $81\p$uviafgerman$d20110106$qOCoLC$11​$0(OCoLC)viaf27070050

3. PROPOSED CHANGES

• Option 1:

In the MARC 21 Bibliographic Format, define subfields $i, $u, and $1 in fields 082, 083, and 084 as described above.

• Option 2:

In the MARC 21 Bibliographic Format, define field 883 and field link type p for subfield $8 as described above.


HOME >> MARC Development >> Proposals List

The Library of Congress >> Especially for Librarians and Archivists >> Standards
( 07/25/2012 )
Legal | External Link Disclaimer Contact Us