Machine-Derived Name Authority Records

presented by
Deta S. Davis
Team Leader
Special Materials Cataloging Division
for the
Authorities Subcommittee
Bibliographic Control Committee
Music Library Association

February 13, 1998
Rev. July 1998
Boston, Massachusetts

Background

In order to deploy its resources more efficiently, the Library of Congress plans to use purchased OCLC bibliographic records for sound recordings, largely without cataloger intervention. We feel this is possible because a study at the Library of Congress has shown that access points in OCLC records are fairly accurate for all but uniform titles. Consequently, a special joint project between the Library of Congress and OCLC was created to correct uniform titles via machine process using correction algorithms, the LC-OCLC Uniform Title Correction Project. As an essential part of this project, the Library of Congress requested that OCLC develop procedures for creating machine-derived authority records and this part of the process has been completed.

As one of the steps in the critical path for the preparation for the uniform title correction project, the machine-derived authority records were created and added to the Name Authority File. These new authority records, in addition to headings already in the Name Authority File, are the foundation of the uniform title correction project. In the correction project, as each title is evaluated, it is first compared against headings in the Name Authority File. We needed as many correct headings in the Name Authority File as possible for this step to work optimally. In addition, the records needed to be specially created because the standard operating procedures in music cataloging at the Library of Congress are not to create an authority record for a uniform title if the authority record was not needed to record research or references. The creation of the machine-derived authority records has provided substitute authority records for these valid headings that had no previous authority records. No one knew how many authority records would be generated by this project until the process was completed. We had never kept statistics on how many headings were not covered by an authority record, and we had no mechanism to find out. Guesses ranged from 50,000 or 60,000, up to 93,000 records. On the day before the records began the loading process, December 15, 1997, we learned there were about 66,000 authority records created. On the final day of loading the actual total was 64,194.

Preparations

The machine-derived authority records were created after OCLC provided Authority Control Service (ACS) and uniform title processing on the MUMS Music File. The first step was to ensure that all headings were accurate so the derived headings would be accurate. In total, we manually corrected more than 3,000 headings in the MUMS Music File, and examined thousands of others, as part of the process to prepare for the authority record creation. OCLC provided LC with this correction service so that the best results could be attained in the machine-derived authority processing. OCLC provided the heading corrections of the MUMS Music File in two steps. The first step covered personal names, corporate bodies, series, and subject headings. The latter two types of corrections (series and subjects) did not affect the uniform title creation, but we took the opportunity to have all access points included in the file cleaned up. The first round of review and corrections, done in June and July of 1997, took about five weeks to complete and involved about 2,500 corrections. The process was focused on eliminating variant forms of the same headings, typos, incorrect spacing, etc., which could adversely affect the machine-derived authority record creation. For example, if any heading was off by even one character in the text or tagging, it would generate an incorrect name authority record. In October 1997, we received the first trial output from the uniform title corrections algorithms which was also run against the MUMS Music File. While it produced numerous false hits because it wasn't fully refined yet, we still found another 570 records to correct. After these corrections were made, OCLC then proceeded with the creation of machine-derived name authority records from the verified bibliographic records in the LC Music File for headings that had never been covered by an authority record. The records finally began the load process into the LC Authority File on December 16. The last records were loaded on February 4, 1998. The uniform title processing, which this was all done to support, will be sometime later in 1998.

An additional objective of creating the derived authority records was to allow an efficient, but reliable, replacement for the procedures currently used by the music catalogers in Special Materials Cataloging Division. The derived authority records will enable music catalogers to find authoritative uniform titles by searching the Names File alone. Because the MUMS Music File has been limited to AACR2 records since its creation in 1984, catalogers have been able to assume that access points in the MUMS Music File are in correct AACR 2 form even if they are not represented by name authority records in the Names File. The usual practice is for the music catalogers to search a combination of the MUMS Music File and Names File to find all current and valid forms of music names and titles. When the purchased sound recording bibliographic records from OCLC are loaded into the MUMS Music File, this reliance on the two files will not be possible. While the uniform title correction project by OCLC will clean up a significant number of music uniform titles on these records, not all will be corrected by this process. Why not? Because all undesirable variations in uniform titles cannot be predicted nor does it make sense to write code for very infrequently occurring errors. Some of these will inevitably slip through the several checks in the system.

After we had evaluated and corrected the personal names and corporate bodies, I returned information to OCLC regarding conflicts in usage between their internal authority file and headings we used in our bibliographic records. OCLC made adjustments in their internal authority file so that the additional forms of each name are linked to one usage. We hope this will reduce the number of false corrections to name headings in the future.

After the MDARs were loaded, we found many errors such as tagging errors (a 100 tagged as a 110, missing subfields) and dates not matching, that the ACS didn't find. We have discussed this with OCLC. They feel that these problems could be resolved with additional enhancements in the future.

Process
I. LC Decisions Regarding Format

This project began as an exploratory LC-OCLC joint effort. OCLC created machine-derived authority records from name and uniform title headings in verified bibliographic records currently in LC's MUMS Music File that were not already represented by an authority record. Verified records are those which have had all cataloging work completed on them and the cataloger is satisfied that all aspects of the record are correct. The headings from verified records that were used as the source of the authority records are all 1XX and 7XX names not already represented by authority records, all 100/240 and 700 $t combinations, all 130 and 730's, and any 710 $t combinations. The machine-derived authority records can contain a 1XX field with any of the following subfields in cases of uniform titles:

     t    Uniform title (converted from subfield a of 240)
     d    Date of treaty signing
     f    Date of a work
     k    Form subheading
     l    Language of a work
     m    Medium of performance for music
     n    Number of part/section of a work
     p    Name of part/section of a work
     r    Key for music

The authority records should not include subfield o (Arranged statement for music) nor subfield s (Version). If a uniform title in the Music File included a subfield "o" or "s", an authority record should have been established for the title if it did not already exist, but without these subfields. The name portion of a uniform title was also established if it was not already represented in the Authority File.

In addition to the 1XX field, each record also contains a 670 which identifies the item from which the heading was taken. To make these records clearly identifiable as coming from Library of Congress records, each 670 begins with the legend "LCCN" and is followed by the LCCN of the source of the heading. While current name authority creation practice is to generally exclude the main entry from the 670 citation, for this project, we decided that the surname and first initial from the 100, or the 110, be included in the 670. This is because in many cases 245 titles for music records are not distinctive and this addition will provide sufficient identification of the work in these cases. The title for the 670 was taken from the 245 subfield a, and the date taken from subfield c in the 260. If usage of the 1XX heading occurred in the subfield c of the bibliographic record, it was included in the subfield b of the 670. The 040 field has OCLC's symbol in both the "a" and "c" subfields because OCLC created the records.

To identify these records as machine derived, each record has a 667 with the text: "Machine-derived authority record," as a short-, medium-, or even long-term solution. At a later date these could be converted by OCLC to either an 008/08 or 008/33 (if decided by the Program for Cooperative Cataloging and MARBI) or an 042. Presently none of these options is possible yet; the former because it has not yet been determined and the latter because the 042 has not been implemented in the MUMS Authority File. Until a decision is made by MARBI, the 008/33 will be coded as "d" for "preliminary." The 008/39, Cataloging Source will be "c" because these records are actually being generated at OCLC and they are a NACO participant. The source code "c" indicates that the creator of the authority data is a participant (other than the National Agricultural Library and the National Library of Medicine) in a cooperative cataloging program with the Library of Congress.

II. What OCLC Did

OCLC consulted with the Library of Congress in testing the results of the creation of derived authority records. Together we reviewed sample records representing all types of authority records including personal names, corporate bodies, and name/uniform titles. OCLC made several useful additions to the records, such as appropriate cross references which can be structured from the rules. An example of this is providing a cross-reference from the second surname in a compound surname. As we had needed, the machine-derived authority records were delivered and distributed before any bibliographic corrections processing started. The records were not reviewed by OCLC staff prior to delivery to LC. However, OCLC staff reported any errors that they discovered while loading.

When the machine-derived authority records were created, OCLC processed the file in the following manner:

The file was pre-processed to check for certain types of errors before being loaded into the OCLC Batchload Authority Save File (in increments of less than 10,000 records). From the Batchload Authority Save File, a manually applied macro added 1,000 to 3,000 records each weekday (or five days per week, excluding holidays) to the authorities contribution file. The daily loading continued until all machine-derived authority records (MDARs) were loaded and distributed to the Library of Congress.

III. The LC Load

The process for OCLC to submit machine derived authority records to the Library of Congress is described below:

By 5:30 a.m. (EST), name authority records from OCLC and other NACO nodes are retrieved (via FTP) and loaded in LC's online Name Authority File. Records are assigned an LCCN by the contributing NACO node.

OCLC contributes records each day, seven days a week. LC loads records six days a week because input/update is not operational in the LC system on Sundays. Records contributed by OCLC are the records that were "approved" for contribution by NACO participants during the previous day. These records could have been input or updated at an earlier date.

By 10:30 p.m. (EST) the same day, reports are generated by the LC mainframe that list all non-LC contribution and response activity. The reports are based on symbols present in the 040 subfield a and the last 040 subfield d of each record. A listing of all the records in the daily distribution queue is also generated.

Beginning at 1:30 a.m. (EST) the following morning, FTP files are created from LC online files. The following FTP files are generated:

All NACO loads, except ODE and NLS, retrieve the LC Distribution Queue each day. All NACO nodes retrieve their Response Queue each day.

Records remain in the LC online files for fourteen calendar days and in FTP files for twenty calendar days. Problems with the size of the online files--resulting from loading large numbers of records--were a possibility at LC, but fortunately never a reality. (High numbers had never been tested.) In addition, it should also be noted that other internal load procedures also utilize the same queue processing, such as CONSER load, daily OCLC bibliographic load, CJK load, CDS internal distribution, etc.

There were a few technical problems with loading these records at OCLC, which caused only a few days' delay in reaching the Library of Congress. The problems were all resolved quickly.

Usage

These records may be used, modified, and upgraded according to normal Library of Congress and NACO authority procedures. If appropriate, the "Preliminary" encoding "d" in the 008/33 should be replaced. The 667, "Machine-derived authority record," should be retained in the record regardless of additional changes that are made to the record. There are some downsides to creating authority records on such a large scale. Even though OCLC provided the Authority Control Service processing for names, corporate bodies, and uniform titles of the Music File, and we have been systematically searching for other problems, and even though we would like to think the bibliographic records in the Music File are perfect, they all aren't. We're human, and we have been known to make mistakes. Some of these have been enshrined in the machine-derived authority records. We had no realistic human way of finding all the errors. If you encounter any of these errors, please report them through usual channels, or let me know. My E-mail address is ddav@loc.gov. We do want to eliminate as many incorrect headings as possible, as quickly as possible.

Benefits

Unlike the music catalogers at the Library of Congress, music catalogers on the outside had not had the ability to readily or easily access all authoritative music headings, and particularly music uniform titles. This part of the LC/OCLC Uniform Titles Corrections Project will make all names and uniform titles used by the Library of Congress readily available to all users of the Name Authority File. This benefits music catalogers significantly by decreasing the amount of authority work they need to do locally.

Lessons Learned

There were many duplicate and incorrect MDARs created during this first experience in the production of machine-derived authority records. If we can learn how to improve the process from what we have achieved so far, then even the problematic MDARs provide essential information. An analysis of the types of errors show that they fall into about five basic types:

  1. Duplicates created of existing headings already covered by a NAR
  2. Duplicates created by the delay in loading the records
  3. Typos, mis-tagging, and judgement errors
  4. Records with non-European diacritics
  5. Other

While there may be a tendency to believe that the category of typos, mis-tagging, and judgement errors would generate the largest number of errors, it didn't. I have ranked the errors above with only experiential evidence in the order of frequency, though the first two categories are rather close in the number of records. So how did so many duplicates get created for headings that were already in the Names File? It appears that it's possible that the copy of the Name Authority File was not retrieved in sync with the copy of the Music File which was sent to OCLC. If this was so, it would explain why the Music File had already-established headings which did not appear in the copy of the Authority File that OCLC used. With help from Gary Strawn at Northwestern University, we found these records and more, and removed them relatively quickly. When we do this again, we need to make sure that all files are concurrent. The second category of duplicates, those caused by the delay in loading the records, was seen as inevitable with the current system limitations between OCLC and LC. I had also expected that the first two types of duplicates listed above would be found and corrected with already established procedures that are in place at the Library of Congress. This particular set of duplicates show how many records our NACO-Music libraries are contributing, the largest source of this type of duplicate. We would want to implement pre-coordination with our NACO-Music contributors for future batch processing. The third category revealed some limitations of the OCLC ACS process as it previously existed. For example, while it found many errors which we did correct before the MDAR creation, it did not detect any MARC tagging or subfield coding errors. It also did not provide any corrections to the d subfield for personal names. We have asked that OCLC provide a normalized comparison of headings in their authority control processing and they are currently looking into it. There wasn't much more we could have done about the judgement and lack-of-searching errors other than upholding the usual high standards of the music catalogers. Regarding the records with non-European diacritics, I have forwarded many of these records to OCLC and they are currently working on eliminating this problem. The other problems are those that occur rather infrequently and cannot be addressed on a large scale, such as NARs not verified in a timely way and records that are out in "Cyberia," but not in LC's files. The latter are referred to as "zombie NARs" by Larry Dixson, a Senior Network Specialist in the Network Development and MARC Standards Office.

In summary, I think this project proceeded very well, especially considering that large-scale creation of MDARs had never been tried before in any way. We encountered a few problems that are solvable with experience. We are also able to correct the problems generated by the MDAR programming. In addition, OCLC is being responsive in improving the process so that it will work better the next time it is used.


Cataloging Directorate Home Page
Library of Congress Home Page
Library of Congress
Library of Congress Help Desk (08/12/98)