Standing Committee on Automation
Task Group on Normalization
Background:
Computer systems often need to determine whether two pieces of text
are identical or may be treated as if identical. (Example: systems
can compare headings in bibliographic records against established
headings and see reference tracings in authority records.)
Such pieces of text may vary in ways that are not considered significant
for the purposes of the comparison; they may for example reflect
differences in punctuation or spacing. If systems were to compare
texts character-by-character as they appear in their sources, systems
would be compelled to create multiple entities where a single entity
was intended. To help systems ignore minor variations and treat pieces
of text as the same despite those variations, systems normalize texts
before comparing them. This process of normalization has
typically consisted of the conversion of all alphabetic characters
to the same case (for example, all uppercase), the removal of diacritics,
the substitution or removal of special characters and marks of punctuation,
and the regularization of spacing.
Catalogers and others who formulate headings are likewise often
asked to compare texts that might contain minor differences, mentally
to submit those texts to a process of normalization, and to act appropriately
depending on the results of a normalized comparison. (Example: catalogers
do not create a title access point in a bibliographic record that
is the same, after normalization, as another title access point in
the same record.)
Normalized texts play various roles in library work. Here are three
important roles:
1. When a library system receives a user search query, it typically
normalizes the search query, and compares it to the normalized form
of texts extracted from the records of interest.
2. When a system compares two headings, the system normalizes the
headings before deciding that they are the same, or different.
3. When a library system prepares a search results screen, it typically
uses a normalized form of relevant texts to sort the items in the
result set into some meaningful order.
Each of the roles to which normalization is put calls for the preparation
of a normalized form tailored to that role. The normalized forms
best suited to one role may not be at all appropriate for other roles.
For example, unless the searcher can be expected to include punctuation,
the normalized form used in searching must remove punctuation; but
punctuation may play an important role in the identification of unique
headings and the sorting of search results. Unfortunately, up to
this point library systems seem to be content to use a single normalized
form for all purposes.
Various standards have been formulated to serve the purpose of the
operation now called normalization. For example roughly 40% of the ALA
filing rules (1968) is devoted to the recognition and treatment
of variant forms. The most important current normalization standard
is the set of Authority file comparison rules (NACO normalization) issued
by the Program for Cooperative Cataloging. (1)
This
standard however is only concerned with the normalization performed
to determine whether two headings are to be considered the same or
different; it is not concerned with normalization performed for other
purposes.
Because there is no accepted standard for all purposes, system
designers, operating in isolation, have developed their own schemes
for their own needs. This causes problems when moving records from
one system to another: a title added entry that would be redundant
in one system may be required in another.
Systems may need to be aware of their operating context in
order to prepare the most suitable normalized form. For example,
a system operating in one country may need to handle the Polish
slash-l (£) as a character distinct from the character 'L' without
a slash, while a system operating in another country may need to
handle the two characters as if they were the same.
The ongoing implementation of Unicode™ adds further complexity.
Existing normalization standards (such as the NACO standard) typically
treat only the core portion of the MARC-8 character set (i.e.,
omitting Arabic, Chinese, Cyrillic, Greek, Hebrew, Japanese and
Korean characters). Designers of systems that need to handle the
full MARC repertoire (MARC-8, UTF-8, or eventually full Unicode)
have been left to their own best lights, with predictable variations
in application.
Charge:
The PCC Policy Committee charges the Standing Committee on
Automation Task Group on Normalization to investigate the normalization
issue in all of its aspects. The outcome of this group will be:
1. An identification of the various purposes to which normalization
has been or may be used in library systems and the kind of normalization
appropriate to each.
2. A detailed normalization scheme (intended to supplant the
existing NACO scheme) for the handling of the extended Latin character
set, together with a description of the work required on the part
of library system vendors to implement it.
3. An extension of the normalization scheme for the extended
Latin character set to Arabic, Cyrillic, Greek and Hebrew characters,
together with a description of the work required on the part of
library system vendors to implement it.
4. Principles for the extension of the normalization scheme
for other alphabetic scripts.
5. Principles for the extension of the normalization scheme
to other scripts.
While recognizing the need for different normalized forms in
different contexts, the Task Group will limit its recommendations
to normalized forms to be employed by systems used by institutions
building their catalogs under the English version of the Anglo-American
cataloging rules.
The Task Group will limit its recommendations to the left-anchored
access fields in authority and bibliographic records. (The Task
Group will consider all bibliographic access fields--those under
authority control, and those not under authority control.) The
Task Group is not asked to consider the normalization of call numbers,
or normalization used to arrange items by date. The task group
is asked to update the recommendations of the Task Group on Series
Numbering (2) as necessary.
Time frame:
A draft final report is due no later than May 2006, with the
final report to follow no later than Dec. 2006.
Members and e-mail addresses:
1. Available at: http://www.loc.gov/catdir/pcc/naco/normrule.html
2. Available at: http://www.loc.gov/catdir/pcc/tgsernum02_rpt.pdf
|