The Library of Congress >> Especially
for Librarians and Archivists >> Standards
HOME >> MARC Development >> Proposals List
DATE: May 21, 2012
REVISED:
NAME: Data Provenance in the MARC 21 Bibliographic Format
SOURCE: Deutsche Nationalbibliothek, Library of Congress, and OCLC
SUMMARY: This proposal addresses two approaches for documenting data provenance in the MARC 21 Bibliographic Format.
KEYWORDS: Field 082 (BD); Field 083 (BD); Field 084 (BD); Field 883 (BD); Data provenance; Machine-generated data
RELATED:
STATUS/COMMENTS:
05/21/12 - Made available to the MARC community for discussion.
06/23/12 – Results of MARC Advisory Committee discussion: Option 2 was approved with the following amendments: Change the definition of $u to include only URI. Use $a for the name or some description of the process (formerly in $u). Add also to the Authority and Classification formats. Change the subfield code $1 to $c.
07/25/12 - Results of LC/LAC/BL review - Agreed with the MARBI decision.
Several use cases have emerged that require storage of information about provenance of classification data in MARC bibliographic records. Over the last several months, colleagues at Deutsche Nationalbibliothek, the Library of Congress, and OCLC have been meeting regularly to explore ways of indicating machine generation of classification metadata and identifying the underlying processes.
Two approaches are proposed: 1) a proposal that addresses the immediate need of documenting information about machine generation of classification data in 082, 083, and 084 fields by defining additional subfields; 2) a second approach that suggests a possible way of dealing with these issues in a more general and encompassing manner with the introduction of a new data provenance field. In both options, the intention is to describe provenance of data that are fully machine-generated, or generated by some named process other than intellectual assignment.
On the most basic level, we want to document (1) whether a classification number was machine-generated, (2) the generating process or activity, (3) the agent responsible for the process, and (4) a basic confidence measure. The responsible agency is already addressed by subfield $q Assigning agency in the 082, 083, and 084 fields.
$i - Method of assignment designator
Designates whether the classification number contained in the field was machine-generated (m) or generated by some other process (other than direct human assignment). The following codes are used: m (fully machine-generated), and x (not fully machine-generated).
$u - Process of assignment
Describes the process designated in subfield $i used to produce the classification number contained in the field. The subfield may contain a URI, a process name, or some other description.
$1 - Confidence value
Describes the confidence of the assigning agency with regard to the classification number generated by the process described in subfield $u. The subfield contains a floating point value between 0 and 1. Either a comma or a point may be used as a decimal marker.
A dependency between or “cascade” of these subfields is assumed. In order for subfield $1 to be used, subfield $u has to be present. In order for subfield $u to be used, subfield $i has to be present. Also, if subfield $i is used and contains the code x, subfield $u should be present.
While the code x also includes named processes that do not involve direct machine assistance (e.g., assignment guided by information found in Classify; copying of a number from a parallel record by a cataloger), it should not be used to indicate direct intellectual assignment of a class number. Intellectual assignment is regarded as the standard process that does not require additional provenance information through subfields $i, $u, or $1 (with the implication that no existing records need to be changed for which the assignment method is either fully intellectual or unknown).
Examples
*MARC style conventions call for lower case to identify the process in subfield $u.082 00 $a829/.3$223$ix$uautodewey$11
082 04 $a394.12$222$im$uclassify$10.5$qOCoLC-D
084 ## $aPud$2kssb$im$uhttp://export.libris.kb.se/DS/default.asp?
lim=0&Text=006.3&Typ=Dewey$10.9$qSE-LIBR
A possible second approach to metadata provenance would be the creation of a new field that would be repeatable and linkable to other fields via subfield $8. This new field could then be used to explicitly specify the provenance of metadata recorded in other fields. With this approach it would be considerably easier to (1) distinguish between provenance of information recorded and provenance of the recorded data itself, (2) align the way data provenance is documented with emerging standards from other communities, (3) provide a full account of data provenance without having to deal with the heavy use of subfields in certain fields like 6XX*.
* Because subfield $8 is repeatable, multiple instances might already be present in the record. However, the link to field 883 will be unambiguous by using (1) a different linking number and (2) the proposed new field link type code p for a subfield $8 that links a field with provenance information of its data.
883 - Data provenance (R)
First Indicator
Method of assignment
# - No information provided
0 – Fully machine-generated
1 – Not fully machine-generated
Second Indicator
Undefined
# - Undefined
$d - Date on which the linked field was generated
The date on which the linked field was generated. This also serves as the beginning of the period of validity. Date is recorded in the format yyyymmdd in accordance with ISO 8601, Representation of Dates and Times.
$u - Process used to generate linked field
Describes the process used to produce the data contained in the field to which the field is linked. The subfield may contain a URI, a process name, or some other description.
$q - Agency using the process/activity to generate the linked field
MARC organization code of the institution using the process/activity to generate the linked field. Code from: MARC Code List for Organizations.
$x - Ending date of validity
Date representing expected end of period of validity. Date is recorded in the format yyyymmdd in accordance with ISO 8601, Representation of Dates and Times.
$0 - Authority record control number or standard number
Subfield $0 contains the system control number of the related authority record, or a standard identifier such as an International Standard Name Identifier (ISNI). The control number or identifier is preceded by the appropriate MARC Organization code (for a related authority record) or the Standard Identifier source code (for a standard identifier scheme), enclosed in parentheses.
$1 - Confidence value
Describes the confidence of the agency using the process/activity described in subfield $u to generate the linked field. The subfield contains a floating point value between 0 and 1. Either a comma or a point may be used as a decimal marker.
$8 - Field link and sequence number
Identifies linked fields and may also propose a sequence for the linked fields. Subfield $8 may be repeated to link a field to more than one other group of fields. The structure and syntax for the field link and sequence number subfield is:
$8 [linking number].[sequence number]\[field link type]
If the sequence number is not needed, the syntax is:
$8 [linking number]\[field link type]
Field link type
p - Data provenance
Used in a record to link a field with another field containing information about provenance of the metadata recorded in the linked field.
Examples
082 00 $81\p$a829/.3$223
883 1# $81\p$uautodewey$d20120407$qDLC$11
082 04 $81\p$a394.12$222$qOCoLC-D
883 0# $81\p$uclassify$d20120407$qOCoLC-D$10.5
082 04 $81\p$a004$222/ger$qNO-OsNB
883 0# $81\p$udeweyclassifierv0.1$d20120101$x20141231$qNO-OsNB$10,75$0(DE-101)040268942
Example of source record indicated by subfield $0 (abbreviated):
LDR 00774nz a2200241n 4500
001 040268942
003 DE-101
005 20090425195049.0
008 880701n||azznnaabn | ana |c
024 7# $ahttp://d-nb.info/gnd/4026894-9$2uri
…
083 04 $a004$9d:3$9t:2007-01-01$222/ger
150 ## $aInformatik
...
082 04 $81\p$a004$222/ger$qDE-101
883 0# $81\p$uparallelrecordcopy$d20120101$x20141231$qNO-OsNB
072 #7 $81\p$aANT$x006000$2bisacsh
650 #7 $82\p$aANTIQUES & COLLECTIBLES$xBottles$2bisacsh
883 0# $81\p$82\p$uhttp://publishers.oclc.org/en/metadata/$d20120206$qOCoLC$10.85$0(OCoLC)ANT006000
600 10 $aPeretz, Isaac Leib$d1851 or 2-1915
600 17 $81\p$0(DE-588)119010046$aPerec, Icchok Leib$d1852-1915$2gnd
883 0# $81\p$uviafgerman$d20110106$qOCoLC$11$0(OCoLC)viaf27070050
• Option 1:
In the MARC 21 Bibliographic Format, define subfields $i, $u, and $1 in fields 082, 083, and 084 as described above.
• Option 2:
In the MARC 21 Bibliographic Format, define field 883 and field link type p for subfield $8 as described above.
HOME >> MARC Development >> Proposals List
The Library of Congress >> Especially for Librarians and Archivists >> Standards ( 07/25/2012 ) |
Legal | External Link Disclaimer | Contact Us |