The Library of Congress >> Especially for Librarians and Archivists >> Standards

MARC Standards

HOME >> MARC Development >> Discussion Paper List


MARC DISCUSSION PAPER NO. 2021-DP10

DATE: May 27, 2021
REVISED:

NAME: Recording Data Provenance in the MARC 21 Formats

SOURCE: MARC/RDA Working Group

SUMMARY: This paper discusses the potential for encoding data provenance in the MARC 21 Formats.

KEYWORDS: Data Provenance (All formats); Metadata Statement (All formats); Metadata Description Set (All formats); RDA Toolkit Restructure and Redesign Project; RDA

RELATED: 2021-DP06

STATUS/COMMENTS:
05/27/21 – Made available to the MARC community for discussion.

06/29/21 – Results of MARC Advisory Committee discussion: At the outset, MAC was generally unfavorable toward several of the options for recording data provenance information explored in the paper. However, various voices from the RDA development constituency made the triple points that: this data would be optional in RDA, up to individual communities and agencies to deploy or not; that such communities and agencies would be unnecessarily constrained in their use if MARC did not support it; and that this data would likely need to be round-tripped through MARC even if such communities and agencies did not use MARC natively. It was therefore decided that work on a solution should move forward. Options for expanding the use of field 883 and deploying non-standard subfield coding were unpopular and will not be taken forward at this time. Instead, a follow up proposal or discussion paper from the MARC/RDA Working Group will set out the case for developing the hitherto underused subfield $7 to accommodate data provenance with other subfields being used where $7 is no longer available (hereafter referred to as $7+) . This approach will advocate the usage of coded values in combination with $7+ in order to designate the category of data provenance information and the part of the field content to which a data provenance value relates.


Discussion Paper No. 2021-DP10: Recording Data Provenance in the MARC Formats

1. BACKGROUND

The new RDA Toolkit glossary defines data provenance as: "Information about the metadata recorded in an element or set of elements. Metadata about metadata, or metametadata." The Toolkit guidance chapter on data provenance explains that: "This information can be used to infer the context and quality of the metadata". Data provenance information, as conceived by RDA, can already be recorded in MARC 21 at the record, field and subfield level using a variety of elements. However, coverage at the field and subfield level is often sparse and uneven. This paper sets out the parameters of data provenance in RDA and its current coverage in MARC 21. It goes on to make a case for expanding MARC 21's accommodation of data provenance in order to better support established and emerging applications. Finally, it puts forward several alternatives for how MARC 21 might be adapted as regards its coverage of data provenance. These are based upon some of the options which were set out in the previous MARC discussion paper 2021-DP06. Other options from 2021-DP06 which met with less favor in the community at the 2021 Midwinter MAC meetings are not being developed further at this time. RDA does not make the recording of data provenance information a required aspect of resource description.

2. DISCUSSION

2.1. RDA and Data Provenance

Data provenance information can be recorded in various ways using the new RDA Toolkit. It offers a number of elements, levels of granularity and recording methods for doing so.

2.1.1. RDA and data provenance elements

RDA supports the recording of data provenance information using a range of different elements. Some elements can only be used for the purpose of recording data provenance information, while others can be used for this and other purposes. The following elements can be used exclusively to record data provenance information and may, collectively, be referred to as "meta-elements":

The following elements can be used exclusively to record data provenance information for Nomen entity-related elements:

The following elements can be used to record data provenance information in addition to other aspects of resource description. In each case the circumstance under which an element may be used for recording data provenance is given in parentheses:

These 21 elements or a selection of them could be accommodated at a greater level of granularity in MARC 21.

2.1.2. RDA and data provenance granularity

Data provenance information can be recorded using different levels of granularity in RDA. It does this by specifying two subcategories of metadata work: metadata statements and metadata description sets.

A metadata statement is defined in the glossary as:

"A piece of metadata that assigns a value to an RDA element that describes an individual instance of an RDA entity."

In a MARC 21 context, examples of metadata statements include the values recorded in individual subfields.

A metadata description set is defined in the glossary as:

"One or more metadata statements that describe and relate individual instances of one or more RDA entities."

In a MARC 21 context, examples of metadata description sets include the values recorded collectively in records and in the fields belonging to those records which are composed of more than one subfield.

2.1.3. RDA and data provenance recording methods

Besides providing a range of elements and degrees of specificity for recording data provenance information, RDA also offers several methods with which to record it. These are listed below:

2.2. RDA Data Provenance Information in MARC 21

At the present time, MARC 21 can be used to record all of the RDA elements previously mentioned, using varying degrees of granularity and selective recording methods. Some examples are given below. In each case, the RDA element label, granularity and recording method of the data provenance information is provided for context.

Example 1

005 20190129073611.0

Element : date of publication (recording a timespan when metadata are published)
Granularity : metadata description set (any MARC 21 format)
Recording method : structured

Example 2

040 ## $d DLC

Element : author agent (recording an agent who records metadata)
Granularity : metadata description set (any MARC 21 format)
Recording method : identifier

Example 3

LDR / 18 a

Element : source consulted (recording a content standard used for metadata)
Granularity : metadata description set (MARC 21 bibliographic format)
Recording method : identifier

Example 4

667 ## $a Machine –derived non-Latin script reference project.

Element : context of use
Granularity : metadata description set (MARC 21 authority format)
Recording method : unstructured

Example 5

382 0# $a clarinet $n 1 $a piano $n 1 $s 2 $2 lcmpt

Element : source consulted (recording a content standard used for metadata)
Granularity : metadata description set (MARC 21 bibliographic and authority formats)
Recording method : identifier

Example 6

338 ## $a volume $2 rdact

Element : source consulted (recording a content standard used for metadata)
Granularity : metadata statement (MARC 21 bibliographic and holdings formats)
Recording method : identifier

2.2.1. Benefits of recording data provenance in MARC 21

Recording data provenance information in a resource description context is beneficial from the perspectives of managing the metadata creation process and supporting end users. It serves library staff engaged in collection-related cataloging activities as well as library patrons whose goal it is to access holdings. For example, a cataloger may choose to reuse one shared record in preference to another based on the organization which created it. Equally, a researcher may choose to order one holding in preference to another based on identifying features such as the source of transcribed information on a manifestation. Apart from these more traditional functions, data provenance information also supports the development of emerging products and services which are based on the selective transformation of cataloging metadata into non-MARC formats. For example, a dataset may be generated from a library's catalog using individual field or subfield contents rather than whole MARC records. If data provenance information can be included in such a dataset, then this may lead to a better understanding of what it contains.

2.2.2. Recording data provenance and granularity in MARC 21

More value can be derived from data provenance information when it is recorded to a greater rather than lesser degree of granularity. Specificity reduces the need for interpretation on the part of a human user. It also increases the machine actionability of data provenance information. Conversely, recording data provenance information at the record or field rather than at the subfield level can result in its being of limited value. This has a bearing on both the utility of data within and outside the library catalog. Hence, the following example of an 040 field demonstrates an instance in which multiple organizations have contributed to the same MARC 21 record in the NACO authority file over time:

040 ## $a DLC $b eng $e rda $c DLC $d DLC $d MH $d OCoLC $d NjP $d IU $d WU $d DLC $d OCoLC $d WU $d DLC $d DLC-OK $d Uk $d SG-SiILA $d CU-HE $d Uk $d DLC $d InU $d OCoLC $d CSt $d OCoLC $d CSt $d UPB $d DFo $d WU $d DFo $d DLC $d MdRoLAC $d CaOONL $d DeU

Under these circumstances, the last organization sequenced in the 040 subfield $d string may be regarded as the latest author agent. Equally, it may be understood that the author agent subscribes to a shared set of policies for creating metadata. However, without subfield level linkage, it may prove difficult to establish what exactly has been contributed by the latest author agent as opposed to other author agents listed in the same 040 string. If shared policy has been applied incorrectly over time or is open to question, then feedback may be harder to deliver and clarification may be harder to seek.

To take another example, a data set may be composed of field values derived from bibliographic records in a catalog which has been created using different content standards. The MARC 21 records which were mined to produce this data set may carry either of the following values:

LDR / 18 a

OR

040 ## $e rda

Equally, if some records were created using AACR and later enhanced using RDA, then they may contain the following values in combination:

LDR / 18 a

AND

040 ## $e rda

Where such different sources consulted are reflected in the same data set, then the values recorded in selected subfields are likely to show a significant degree of variation: Latinizations, abbreviations and corrections will occur in some, but not in others; general material designations will occur in some, but not in others, etc. Without the contextualization provided by a source consulted at the field level, any discrepancies in cataloging policy may appear random and incomprehensible to the researcher or linked data practitioner. Equally, the detection and resolution of these discrepancies by a program / computer script may prove all the more challenging.

The MARC 21 formats do provide some coverage of data provenance information at the field and subfield as well as record level. In the case of an author agent, this may be located in subfield $5. The following provides one such example in an authority record:

680 ## $i May be combined with geographic name in the form $a Baroque sculpture-Germany. $5 CaQMCCA

However, the presence of $5 is piecemeal across the formats.

2.2.3. Recording data provenance and efficiency in MARC 21

It may be considered that expanding the coverage of data provenance information in the MARC formats will result in a more labor intensive process of resource description. However, this could be minimized at the start of the cataloging process by using templates which prepopulate data provenance information in authority, bibliographic and other format records. In cases where records are derived from an external source and enhanced locally, then data provenance information could be added using macros which enter standardized strings of metadata. In the current cataloging environment, these processes may already be used to populate data provenance information at the record level and, selectively, at the field level. Standardized LDR values and values contextualized by a source code in subfield $2 are two current examples of this approach.

2.2.4. Recording data provenance and interoperability in MARC 21

It may also be considered that the full range of data provenance elements and recording methods available for recording them are surplus to any single, local need. However, it should be noted that the recording of data provenance information is not a requirement in RDA. Rather, RDA sets out the practical benefits of recording data provenance information and a variety of options for doing so. One cataloging agency may choose to record a specific type or types of data provenance information, while another chooses to record others. The key requirement is that the data which they share remains interoperable. In order to avoid conflicts or duplicated effort in the recording of additional data provenance information, cataloging cooperatives may need to agree upon a set of best practises for doing so.

2.3. Options for Expanded Data Provenance Coverage in MARC 21

There are several ways in which the existing coverage of data provenance information in MARC 21 could be expanded from its present state. The preceding discussion paper 2021-DP06 put forward various options for doing so. Some of these met with more approval than others from the MARC 21 community. A series of straw polls conducted at the MAC Midwinter meetings in 2021 indicated that the following approaches should not be taken forward at this time: defining a new MARC format or formats with which to record data provenance information; noting the recording method which has been used to record data provenance information in MARC; making distinctions between vocabulary and string encoding schemes in MARC. Further straw polls indicated that, of the remaining options set out, a further development of field 883 or the definition of a new subfield across the formats may be acceptable ways to better accommodate data provenance information in MARC. These options are further developed in the analysis which follows. In each case, the option considers coding data provenance information at the field and subfield level. In the case of options 2 and 3, subsequent sections consider how the subcategories of data provenance could be coded.

2.3.1. Deployment of Additional Subfields in Field 883

Field 883 is already defined to carry data provenance information in all of the MARC 21 formats. It is currently defined as follows:

"Used to provide information about the provenance of metadata in data fields in the record. Field 883 contains a link to the field to which it pertains."

Field 883 contains the following subfields:

$a - Creation process
$c - Confidence value
$d - Creation date
$q - Assigning or generating agency
$x - Validity end date
$u - Uniform Resource Identifier
$w - Bibliographic record control number
$0 - Authority record control number or standard number
$1 - Real World Object URI
$8 - Field link and sequence number

Although some of the subfield codes which field 883 currently contains do not align with RDA's categories of data provenance information, other subfields do. For example, a correspondence exists between subfield $q (Assigning or generating agency) on the one hand and an author agent on the other. Further subfields could be added to field 883 in order to record RDA's other categories of data provenance information on the proviso that it is not necessary to record either subfield $a (Creation process), $c (Confidence value) or $u (Uniform Resource Identifier) in the same string as these. The concept of a process, a confidence value, and a URI used to identify a process do not correspond to RDA's present element set or its guidance on the recording of data provenance information.

An advantage of expanding field 883 with subfields which support the recording of additional data provenance information is that it would offer the means to provide coverage at the field level in combination with subfield $8. In addition, the same approach could be used consistently across the MARC 21 formats. Sufficient alphabetic subfield codes are still available to represent all the subcategories of data provenance information set out by RDA. Of the 21 RDA data provenance elements listed and grouped in section 2.1.1., 17 could form a new subfield in field 883, e.g.:

$b      Note on metadata work
$e       Recording source
$f       Scope of validity*
$g      Source consulted**
$h      Assigned by agent**
$i       Context of use
$j       Date of usage**
$k      Reference source**
$l       Status of identification
$m     Undifferentiated name indicator
$n      Publisher agent (recording an agent who publishes metadata)*
$o      Author agent (recording an agent who records metadata)*
$p      Related work of work (recording a string encoding scheme used for metadata)
$r       Language of expression (recording a language of description)
$s       Script (recording a script of description)
$t       Date of publication (recording a timespan when metadata are published)*
$v      Related timespan of work (recording a timespan for validity of metadata)

* Note that some elements could be recorded using existing 883 subfields if less granularity is considered sufficient: "scope of validity" could be recorded using subfields $d and $x; "date of publication" could be recorded using subfield $d;  "author agent", "assigned by agent" and "publisher agent" could be recorded using subfield $q.

** Note that the inverse relationship elements "source consulted of", "assigned by agent of", "date of usage of" and "reference source of" are not included in the list of subfield labels modeled above. If subfield $i were used to record "Relationship information" as it is elsewhere in the formats, then it might be possible to record all RDA data provenance relationships in a MARC 21 context. However, application of the relationships "source consulted of", "assigned by agent of",  "date of usage of" and "reference source of" appears to be predicated on the existence of authority records for sources consulted (e.g., RDA), assigning agencies (e.g., DNB),  timespans (e.g., the year 2021) and reference sources (e.g., a dictionary of names) from which links are generated to any record which has been created using those sources, by those agencies and on those dates. Given the number of linked fields which would be required in any such a scenario, this may be unachievable in a MARC context. The exclusion of these inverse relationships also applies to the alternative coding for data provenance elements which is modeled in subsequent sections of this paper.

Using field 883, subfield $8 would be mandatory and used to pair two fields together, i.e., the main field and field 883 containing meta-information about the content of the main field.

Subfield $8 has an internal structure:

$8 [linking number].[sequence number]\[field link type]

The linking number has to be a numeric value that is unique in the context of the record as a whole. Usually, only one main field and one field 883 share the same linking number. The linking number doesn’t necessarily start with the value "1", and it doesn’t necessarily follow the sequence of MARC fields in one record.

For reasons of provenance data, the second part of the $8 content, the sequence number, is omitted. The third part of the $8 content, the field link type, is always set to the value "p", for "Metadata provenance".

Although subfield $8 would provide field level linkage, data provenance information may relate to a single subfield within a string, rather than the string as a whole. If this degree of specificity is required, then a second subfield would be necessary in order to identify the subfield within a string to which data provenance information recorded in field 883 applies. If 883 subfields $a - $x, $0, $1 and $8 are taken up for other purposes, then the following subfields remain available in order to support subfield linkage:

$y, $z, $2, $3, $4, $5, $6 and $7

For example, 883 subfield $y could contain characters a-z and 0-9 which match the code of a subfield for which data provenance information is the target.

A main disadvantage of using field 883 to record data provenance information at field level is that it would require implementation of subfield $8. Records with fields 883 are somewhat complex and fragile artifacts. A system always has to keep track of the links between the main fields and the 883 fields. In a cooperative environment, data maintenance, e.g., matching and merging of records with subfields $8 in them, poses specific problems. Whenever a new main field with $8 and a corresponding field 883 is inserted, the remaining structure has to be handled very carefully. The sequence of linking numbers can have gaps in it, but the links must remain unique on both sides. Whenever a main field with $8 "\p" is deleted, the corresponding field 883 has to be deleted, too.

There are situations where the need for pairing fields are to be combined with other cases where field pairs or even field groups might be needed: Using fields 082 or 083 with fields 085 to provide single parts of a DDC notation is already an ambitious task. However, in combination with metadata provenance using field 883, this provides specific challenges. While in theory subfield $8 is repeatable, a combination of two use cases for pairing fields in one record does not look like a good practice. A similar practical restriction seems to apply to the combination of primary data in original script and in transliterated form with meta-information: Having a main field with both a subfield $6 and a field 880 (following model A of "Appendix D - Multiscript Records") and a subfield $8 and a field 883 might be possible in theory, but not in practice.

2.3.2. Deployment of Non Standard Subfield Delimiters

MARC 21's record structure is partially based upon the Format for Information Exchange (ISO 2709) which, in combination with the American Standard Code for Information Interchange (ASCII), allows for the use of capitalization, punctuation and mathematical symbols as identifiers within data fields. A range of such characters could be defined as subfields across the formats in order to record data provenance information.

The idea of a new subfield "$_" was brought into the discussion in January 2021. In search for one or more subfield code(s), the MARC/RDA Working Group did a broader and thorough analysis and explored this approach further, looking specifically at the first block of 128 characters used in ISO/IEC 10646 / Unicode, "C0", also called "Basic Latin (ASCII)" , and equivalent to ISO/IEC 646, of which a code chart is available via https://unicode.org/charts/ or directly at: http://www.unicode.org/charts/PDF/U0000.pdf .

Sorting out those characters which are already defined or in use, as an interim result, the following characters seem to be available (with comments on expectations):

Control characters                     will likely / possibly cause records to break
Space / Blank                             hard to recognize, will likely cause confusion
"@"                                            possible candidate
"A" through "Z"                        risk of conflicts with lowercase equivalents
"[", "\", "]", "^", and "_"            possible candidates
"`"                                              hard to recognize, will likely cause confusion
"{", "|", "}", and "~"                  possible candidates
Control character DEL              will likely cause data loss

A closer examination by different criteria (the character for the new subfield code should not carry too much "meaning" in itself, it should not be used in regular expressions, it should not be at high risk of getting mixed up with a diacritic, a solution has to be feasible and valid in the context of MARCXML) narrowed the list down to 3 subfield codes which may be discussed further:

[email protected]     Commercial at
$_      Underscore, Low line
$~      Tilde

Although a single non-standard subfield delimiter would provide field level linkage for data provenance information, this information may relate to a single part of a $_ string, rather than the string as a whole. If such specificity is required, then a second non-standard subfield delimiter would be necessary in order to identify that part of a string to which data provenance information applies. If, for example, $_ were used to express the category of data provenance information (see sections 3.4. and 3.5.), then the following subfields remain available in order to support subfield linkage:

[email protected], $~

Subfield $~ or [email protected] could contain characters a-z and 0-9 which match the code of a subfield for which data provenance information is the target.

Preliminary questions asked to technical experts about which character(s) to prefer and which one(s) to exclude in this scenario lead to mixed results. The MARC/RDA Working Group is aware that it might be useful to canvass the MARC data programming community, e.g., the maintainers of the widely used MARC-oriented programming language libraries, as well as the authors of MARC-oriented toolboxes.

An advantage of using non-standard characters in order to record data provenance information is that it would offer field/subfield level coverage. Such an approach would also provide a consistent solution across the MARC 21 formats. In addition, adopting this as a method of recording data provenance information would avoid the requirement to implement $8, a subfield which remains unconfigured within many MARC 21 based library systems.

A disadvantage of introducing non-standard characters for subfield coding purposes is that it would go beyond the current scope of what the MARC 21 Formats: Background and Principles state may be used in this context. In setting out the range of delimiters which may be used, section 8.4.2 of the document states the following:

"Subfield codes in the MARC 21 formats consist of two characters--a delimiter [1F(16), 8-bit], followed by a data element identifier. Data element identifiers may be a lowercase alphabetic or a numeric character."

Even if it were decided to broaden the range of characters which can be used for subfield coding purposes in the MARC 21 formats, this may still result in problems for library management systems which are not set up to allow for a greater variety than that which is already possible. This situation may result in the new character or characters introduced to support data provenance becoming a form of local coding analogous to $9.

2.3.3. Deployment of Various Subfields for the Same Purpose Across Different Fields

Although no lower case alphabetic or numeric character remains wholly undefined in the MARC 21 formats, there are some characters which have rarely been used up until this point. For example, subfield $7 has hitherto only been deployed in two fields belonging to the Authority format and only twenty-two fields belonging to the Bibliographic format. In both cases alternative subfield characters are still free for use in fields where $7 has already been defined. The use of different subfields to record equivalent values is generally avoided in the MARC 21 formats, but is not entirely without precedent. For example, subfields $e and $j in the Bibliographic format may both carry a value for a relator term in circumstances where that term describes the relationship between a name and a work.

In the MARC 21 Authority Format, no fields apart from the following contain subfield $7:

856 (Electronic Location and Access)
Note: Subfields $e, $0, $1, $4 and $5 are still available for use and have not been previously made obsolete.

880 (Alternate Graphic Representation)
Note: With the exception of subfield $6 (Linkage), 880 subfielding follows that in linked fields, so this tag is excluded from further analysis.

In the MARC 21 Bibliographic Format, no fields apart from the following contain subfield $7:

533 (Reproduction Note)
Note: Subfields $g-$l, $0, $1, $2, $4 and $5 are still available for use)

760-787 (Linking Entries)
Note: Subfields $l (lowercase ell), $0, $1, $2, $3 and $5 are still available for use.

800-830 (Series Added Entries)
Note: Subfields $i, $y and $z are still available for use.

856 (Electronic Location and Access)
Note: Subfields $e, $0, $1, $4 and $5 are still available for use and have not been previously made obsolete.

880 (Alternate Graphic Representation)
Note: With the exception of subfield $6 (Linkage), 880 subfielding follows that in linked fields, so this tag is excluded from further analysis.

Although $7 and another subfield by exception (hereafter called 7+ for short) would provide field level linkage for data provenance information, this information may relate to a single part of a string, rather than the string as a whole. If such specificity is required, then additional coding would be necessary in order to identify that part of a string to which the data provenance information applies. If $7 were used to express the category of data provenance information (see sections 3.4. and 3.5.), then an identifier from a controlled list containing the characters a-z and 0-9 could be appended to this category which matches the code of a subfield for which data provenance information is the target.

For example, the following values model a selection of codes for such a controlled list:

dpsfa - Data provenance target subfield $a
dpsfb - Data provenance target subfield $b
dpsfc - Data provenance target subfield $c

Characters a-z and 0-9 could be used in isolation when linked to the category of data provenance (in a similar manner to the scenarios modeled in sections 2.3.1. and 2.3.2.) The formulation set out above, which consists of an abbreviation "dp" for data provenance, "sf" for subfield and an alphanumeric character for the subfield value, offers an alternative approach. 

An advantage of using $7+ coding across the formats for recording data provenance information is that it could offer field/subfield level coverage. Adopting this as a method of recording data provenance information would also avoid the requirement to implement $8 and would avoid the need to amend MARC 21's Background and Principles.

A disadvantage of using $7+ to record data provenance at the field level across the MARC 21 formats is that it would make the machine processing of such information a more complex task. Subfield sequencing of data provenance information may also not be consistent on a field by field basis.

2.3.4. The Technique of "Sublabels" in a Subfield

Both the option of deploying non-standard subfield delimiters and $7+ then need a way to designate more specifically which element of data provenance according to RDA is expressed in a given subfield.

One option is the definition of "sublabels": The subfield itself is defined as repeatable. The character instantly following the name of the subfield is the sublabel and specifies the remaining content of the subfield. The range is broad enough to cover each kind of element that has been defined in RDA as a belonging to the group of provenance data elements.

Using the RDA elements of data provenance listed and grouped in section 2.1.1. (with the exception of those elements noted in section 2.3.1.), the following syntax is suggested modeling $_:

$_a     Note on metadata work
$_b    Recording source
$_c     Scope of validity
$_d    Source consulted
$_e     Assigned by agent
$_f     Context of use
$_g    Date of usage
$_h    Reference source
$_i     Status of identification
$_j     Undifferentiated name indicator
$_k    Publisher agent (recording an agent who publishes metadata)
$_l     Author agent (recording an agent who records metadata)
$_m   Related work of work (recording a string encoding scheme used for metadata)
$_n    Language of expression (recording a language of description)
$_o    Script (recording a script of description)
$_p    Date of publication (recording a timespan when metadata are published)
$_q    Related timespan of work (recording a timespan for validity of metadata)

There is room enough for many more sublabels, i.e., the remaining lowercase letters, all uppercase letters, and numeric characters.

Some advantages of this option are that the context (the field) is always clear (as opposed to the 883 approach), that one subfield can more easily be extracted / filtered out if needed, and that punctuation issues are avoided.

On the other hand, the technique of sublabels has disadvantages: There is the risk of mixing up "$a" and "$_a" occurring in the same field (or similarly, the risk of mixing up "$a" and "$7a"). Regarding the format design of this solution, it seems unclear whether to look at the "a" in "$_a[Note on metadata work]" as being part of the content of "$_", or as being part of the content designation of "$_a". Furthermore, the sublabels would be a defined part of the syntax, listed in the MARC documentation, which may be less flexible than the option using coded values and punctuation, described in the next section.

2.3.5. The Technique of using Coded Values in a Subfield

If a new subfield is to be used (e.g. $_ or $7+) to specify the subcategory of data provenance information and the technique of sublabels is avoided, there is the option of defining an internal structure by using coded values and punctuation. As an analogy, subfield $0 defined as "Authority record control number or standard number", is described as follows:

"Subfield $0 contains the system control number of the related authority or classification record, or a standard identifier such as an International Standard Name Identifier (ISNI). These identifiers may be in the form of text or a Uniform Resource Identifier (URI). If the identifier is text, the control number or identifier is preceded by the appropriate MARC Organization code (for a related authority record) or the Standard Identifier source code (for a standard identifier scheme), enclosed in parentheses. When the identifier is given in the form of a Web retrieval protocol, e.g., HTTP URI, no preceding parenthetical is used."
(cf. e.g., MARC 21 Bibliographic, Appendix A-Control Subfields, https://www.loc.gov/marc/bibliographic/ecbdcntf.html).

So, the information about the context is given in parentheses, either taken from the list of "MARC Organization Codes", or from the list of "Standard Identifier Source Codes", and the main information follows.

For the accommodation of data provenance, a similar design may be developed for the new subfield (regardless of whether "$7+" or "$_" is chosen): The type of the data provenance element is given in parentheses, and the data provenance information itself follows as the remaining content of the subfield. To provide context, for the type of the data provenance element, a code is used. This code would have to be taken from a code list, consisting of all the types of data provenance elements, according to RDA, and possibly beyond. Such a code list would have to be created and maintained by NDMSO at the Library of Congress. The following codes are suggested here (with the exception of those elements noted in section 2.3.1.):

dpenmw       Note on metadata work
dpers            Recording source
dpesv           Scope of validity
dpesc           Source consulted
dpeaba         Assigned by agent
dpecou         Context of use
dpedou         Date of usage
dperf            Reference source
dpesoi          Status of identification
dpeuni         Undifferentiated name indicator
dpepa           Publisher agent (recording an agent who publishes metadata)
dpeaa           Author agent (recording an agent who records metadata)
dperwow      Related work of work (recording a string encoding scheme used for metadata)
dpeloe          Language of expression (recording a language of description)
dpes             Script (recording a script of description)
dpedop         Date of publication (recording a timespan when metadata are published)
dpertow       Related timespan of work (recording a timespan for validity of metadata)

In the future, more types of data provenance elements may be added to the code list, either according to their adaption by the RDA community, or beyond. One example here might be the "norm or standard used for transliteration", cf. https://www.loc.gov/marc/mac/2016/2016-dp26.html.

In addition to this approach, the information in parentheses may be extended to express which subfield of the main field is covered, by adding a second code to the first one. So for example, the code "dpsfb" would represent the subfield specified. Inside the parentheses, the two codes may be distinguished from each other by a "/".

Some advantages of this option are that the context (the field) is always clear (as opposed to the 883 approach), and that one subfield can more easily be extracted / filtered out if needed. In comparison to the sublabel technique, there does not seem to be an instant risk of mixing up content and content designation. In addition, and maybe more importantly, this option is more flexible than the sublabel technique, because the code (designating which type of provenance data is given) does not have to be defined and documented in the main part of the MARC documentation, but in a code list. This would make extensions and the maintenance part a task easier to fulfill.

A possible disadvantage is that punctuation might be seen as a means that should be avoided, because it has to be handled quite carefully to extract the information and its context from the subfield.

3. SUMMARY OF OPTIONS

In summary, the following options and their possible combinations are to be discussed:

OPTION 1: Deployment of additional subfields in field 883, as described in section 2.3.1. ("field 883")

OPTION 2: Deployment of non-standard subfield delimiters, as described in section 2.3.2., in combination with the technique of sublabels, as described in section 2.3.4. ("$_ with sublabels")

OPTION 3: Deployment of non-standard subfield delimiters, as described in section 2.3.2., in combination with the technique of coded values, as described in section 2.3.5. ("$_ with coded values")

OPTION 4: Deployment of various subfields to the same purpose across different fields, as described in section 2.3.3., in combination with the technique of sublabels, as described in section 2.3.4. ("$7+ with sublabels")

OPTION 5: Deployment of various subfields to the same purpose across different fields, as described in section 2.3.3., in combination with the technique of coded values, as described in section 2.3.5. ("$7+ with coded values")

4. EXAMPLES

The following examples model the changes which are discussed under each of the options set out above. These are intended to be illustrative rather than prescriptive. In each case, the RDA element label, granularity and recording method of the data provenance information is provided for context. Additional notes are supplied where necessary.

4.1. Deployment of Additional Subfields in Field 883

Example 1

100 1# $8 3\p $0 (DE-588)1215943776 $0 https://d-nb.info/gnd/1215943776 $0 (DE-101)1215943776 $a Tolkiehn, Niels $d 1985- $e Verfasser $4 aut $2 gnd
883 ## $8 3\p $q DE-101 $r ger

Element : language of expression (recording a language of description)
Granularity : metadata description set (MARC 21 bibliographic format)
Recording method : identifier

Additional Notes:
A value for language of expression is modeled in 883 $e.

Example 2

041 ## $81\p $a eng
883 0# $81\p$a aep-lc $c 1,00000 $d 20190913 $q DE-101$u https://d-nb.info/provenance/plan#aep-lc $g MARC Code List for Languages

Element : source consulted (recording a content standard used for metadata)
Granularity : metadata statement (MARC 21 bibliographic format)
Recording method : unstructured

Additional Notes:
A value for source consulted is modeled in 883 $g

Example 3

700 1# $84\p $a Rapp, Christof $e Akademischer Betreuer $ 4dgs
883 1# $84\p $a npi$d20200824 $q DE-101 $ $r ger $y e

Element : language of expression (recording a language of description)
Granularity : metadata statement (MARC 21 bibliographic format)
Recording method : identifier

Additional Notes:
A value for language of expression is modeled in 883 $r ; a value for the subfield specified is modeled in $y.

4.2. Deployment of Non-Standard Subfield Delimiters (here exemplified in combination with the technique of sublabels)

Example 1

370 ## $aRadzimyn, Poland $b Surfside, Fla. $_ lDLC

Element : author agent (recording an agent who records metadata)
Granularity : metadata description set (MARC 21 authority format)
Recording method : identifier

Additional Notes:
A value for author agent is modeled in 370 $_ l

Example 2

500 1# $a Bland, Robin $_ p20190904105359.0

Element : date of publication (recording a timespan when metadata are published)
Granularity : metadata statement (MARC 21 authority format)
Recording method : structured

Additional Notes:
A value for date of publication is modeled in 500 $_ p

Example 3

373 ## $a Faculty of Life Science, Manchester University $s 2005 $_ lUk $~ s

Element : author agent (recording an agent who records metadata)
Granularity : metadata statement (MARC 21 authority format)
Recording method : structured

Additional Notes:
A value for author agent is modeled in 370 $_ 1 ; a value for the subfield specified is modeled in $~ s.

4.3. Deployment of Various Subfields for the Same Purpose Across Different Fields (here exemplified in combination with the technique of coded values)

Example 1

245 00 $a Thorium, preparation and properties / $c J. F. Smith ... [et al.]. $7 (dpesc) AACR 2

Element : source consulted (recording a content standard used for metadata)
Granularity : metadata description set (MARC 21 bibliographic format)
Recording method : structured

Additional Notes:
A code for source consulted, a code for the recording method and an associated value are modeled in 245 $7. The code "dpesc" represents the data provenance element "source consulted"; "AACR 2" represents the value associated with the element.

Example 2

500 $a Title should read : Hierarchy in organizations.  $7 (dpesc) rda

Element : source consulted (recording a content standard used for metadata)
Granularity : metadata statement (MARC 21 bibliographic format)
Recording method : identifier

Additional Notes:
A code for source consulted, a code for the recording method and an associated value are modeled in 245 $7. The code "dpesc" represents the data provenance element "source consulted"; "rda" represents the value associated with the element.

Example 3

245 10 $a Songs, duets, trios, &c. in Fontainbleau; or, our way in France. $b A [sic] comic opera. As performed at the Theatre-Royal in Covent-Garden. Written by Mr. O'Keeffe.  $7 (dpesc/dpsfb) AACR 2

Element : source consulted (recording a content standard used for metadata)
Granularity : metadata statement (MARC 21 bibliographic format)
Recording method : structured

Additional Notes:
A code for source consulted, a code for the recording method, a code for the subfield specified and an associated value are modeled in 245 $7. The code "dpesc" represents the data provenance element "source consulted"; the code "dpsfb" represents the subfield specified; "AACR2" represents the value associated with the element.

5. BIBFRAME DISCUSSION

(Carried over from 2021-DP06) BIBFRAME allows the recording of provenance-type information and when the change to MARC is determined, further analysis of any needed changes would be made.

6. QUESTIONS FOR DISCUSSION

6.1. Is the case for expanding MARC 21's accommodation of data provenance to better support established and emerging applications sufficiently articulated? (See 2.2.1. - 2.2.2.)

6.2. Have the overall challenges and mitigating strategies for expanding accommodation been sufficiently articulated? (See 2.2.3. - 2.2.4)

6.3. Of the five options listed in Section 3 and described in 2.3., are there any advantages or disadvantages which have not been addressed?

6.4. Of the five options listed in Section 3 and described in 2.3., which one is considered preferable, and why?

6.5. Should inverse relationships for data provenance be included in any coding choice which is preferred and, if so, how?

6.6. Should the provision of coding for note on metadata work be sufficient as a catch all for information which does not fall into a more specific category of data provenance information?

6.7. Should it be permissible to record URIs using the notation modeled using non-standard subfields (e.g., $_and $7?) or should coding be confined to $0 and $1?   

6.8. If another option is preferred, then what would this be?

6.9. Is there anything else which should be taken into account?


HOME >> MARC Development >> Discussion Paper List

The Library of Congress >> Especially for Librarians and Archivists >> Standards
(10/22/2021)
Legal | External Link Disclaimer Contact Us