Task Group on Normalization
Computer systems often need to determine whether two pieces of text are identical or may be treated as if identical. (Example: systems can compare headings in bibliographic records against established headings and see reference tracings in authority records.) Such pieces of text may vary in ways that are not considered significant for the purposes of the comparison; they may for example reflect differences in punctuation or spacing. If systems were to compare texts character-by-character as they appear in their sources, systems would be compelled to create multiple entities where a single entity was intended. To help systems ignore minor variations and treat pieces of text as the same despite those variations, systems normalize texts before comparing them. This process of normalization has typically consisted of the conversion of all alphabetic characters to the same case (for example, all uppercase), the removal of diacritics, the substitution or removal of special characters and marks of punctuation, and the regularization of spacing.
Catalogers and others who formulate headings are likewise often asked to compare texts that might contain minor differences, mentally to submit those texts to a process of normalization, and to act appropriately depending on the results of a normalized comparison. (Example: catalogers do not create a title access point in a bibliographic record that is the same, after normalization, as another title access point in the same record.)
Normalized texts play various roles in library work. Here are three important roles:
1. When a library system receives a user search query, it typically normalizes the search query, and compares it to the normalized form of texts extracted from the records of interest.
2. When a system compares two headings, the system normalizes the headings before deciding that they are the same, or different.
3. When a library system prepares a search results screen, it typically uses a normalized form of relevant texts to sort the items in the result set into some meaningful order.
Each of the roles to which normalization is put calls for the preparation of a normalized form tailored to that role. The normalized forms best suited to one role may not be at all appropriate for other roles. For example, unless the searcher can be expected to include punctuation, the normalized form used in searching must remove punctuation; but punctuation may play an important role in the identification of unique headings and the sorting of search results. Unfortunately, up to this point library systems seem to be content to use a single normalized form for all purposes.
Various standards have been formulated to serve the purpose of the operation now called normalization. For example roughly 40% of the ALA filing rules (1968) is devoted to the recognition and treatment of variant forms. The most important current normalization standard is the set of Authority file comparison rules (NACO normalization) issued by the Program for Cooperative Cataloging. (1) This standard however is only concerned with the normalization performed to determine whether two headings are to be considered the same or different; it is not concerned with normalization performed for other purposes.
Because there is no accepted standard for all purposes, system designers, operating in isolation, have developed their own schemes for their own needs. This causes problems when moving records from one system to another: a title added entry that would be redundant in one system may be required in another.
Systems may need to be aware of their operating context in order to prepare the most suitable normalized form. For example, a system operating in one country may need to handle the Polish slash-l (£) as a character distinct from the character 'L' without a slash, while a system operating in another country may need to handle the two characters as if they were the same.
The ongoing implementation of Unicode™ adds further complexity. Existing normalization standards (such as the NACO standard) typically treat only the core portion of the MARC-8 character set (i.e., omitting Arabic, Chinese, Cyrillic, Greek, Hebrew, Japanese and Korean characters). Designers of systems that need to handle the full MARC repertoire (MARC-8, UTF-8, or eventually full Unicode) have been left to their own best lights, with predictable variations in application.
The PCC Policy Committee charges the Standing Committee on Automation Task Group on Normalization to investigate the normalization issue in all of its aspects. The outcome of this group will be:
1. An identification of the various purposes to which normalization has been or may be used in library systems and the kind of normalization appropriate to each.
2. A detailed normalization scheme (intended to supplant the existing NACO scheme) for the handling of the extended Latin character set, together with a description of the work required on the part of library system vendors to implement it.
3. An extension of the normalization scheme for the extended Latin character set to Arabic, Cyrillic, Greek and Hebrew characters, together with a description of the work required on the part of library system vendors to implement it.
4. Principles for the extension of the normalization scheme for other alphabetic scripts.
5. Principles for the extension of the normalization scheme to other scripts.
While recognizing the need for different normalized forms in different contexts, the Task Group will limit its recommendations to normalized forms to be employed by systems used by institutions building their catalogs under the English version of the Anglo-American cataloging rules.
The Task Group will limit its recommendations to the left-anchored access fields in authority and bibliographic records. (The Task Group will consider all bibliographic access fields--those under authority control, and those not under authority control.) The Task Group is not asked to consider the normalization of call numbers, or normalization used to arrange items by date. The task group is asked to update the recommendations of the Task Group on Series Numbering (2) as necessary.
A draft final report is due no later than May 2006, with the final report to follow no later than Dec. 2006.
Members and e-mail addresses:
University of Florida
Library of Congress