DATE: May 1, 1997

NAME: Addition of new characters to existing USMARC sets

SOURCE: Research Libraries Group

SUMMARY: This proposal suggests defining three new characters in the existing USMARC character set for the basic Arabic script

KEYWORDS: Arabic script; Character sets



5/1/97 - Forwarded to USMARC Advisory Group for discussion at the June 1997 MARBI meetings.

6/28/97 - Results of USMARC Advisory Group discussion - Approved.

8/21/97 - Result of final LC review - Approved.

PROPOSAL NO. 97-14:  Addition of new characters to existing USMARC sets


Several non-Latin character sets have been developed for use in
USMARC records since the late 1970's.  All USMARC character sets
are based on standards, when available.  An attempt has always
been made to keep in the USMARC character sets in sync with other
standards when possible.  This proposal recommends the addition
of three Arabic script characters to the USMARC Basic Arabic set
to synchronize it with several standards as well as the Arabic
implementations of USMARC users.


The USMARC character sets for the Arabic script were developed in
the late 1980's and early 1990's to support cataloging in the
vernacular for Arabic and Persian (Farsi) language materials. 
The basic and extended Arabic script character sets provide
sufficient characters to support vernacular cataloging in other
languages as well including Kashmiri, Kurdish, Moplah, Turkish
(Ottoman period), Pushto, Sindhi, Uighur, and Urdu.  The basic
USMARC Arabic script set was based on two standards, ASMO
Standard Specification 449 (an Arab standard), and ISO 9036
(Information Processing--Arabic 7-bit Coded Character Set for
Information Interchange).  The extended USMARC Arabic script set
was developed at the same time as ISO 11822 (Information and
documentation--Extension of the Arabic Alphabetc Coded Character
Set for Bibliographic Information Interchange).  The USMARC and
ISO extended Arabic script sets are completely in sync.

Due to slightly different requirements for bibliographic
applications, the USMARC basic Arabic set has some differences
from ISO 9036 and ASMO 449.  The most noteworthy difference is
the inclusion of Arabic style digits 0 through 9 in the USMARC
set rather than the Indic style digits.  The USMARC basic Arabic
script set also includes two additional characters, defined in
character code positions that are unassigned in ISO 9036 and ASMO
449.  The extra characters correspond to letters that sometimes
appear in cataloging data.

During the recent work of the MARBI Character Set Subcommittee,
which has been developing a mapping of the existing USMARC
character sets to ISO 10646 (Information Technology--Universal
Multiple-Octet Coded Character Set (UCS)), discrepancies between
the published USMARC Arabic set and USMARC implementations of
that set were discovered.  The Research Libraries Group (RLG),
which has worked closely with MARBI and the Library of Congress
in the development and implementation of non-Latin character
sets, implemented the Arabic sets in November 1991.  Their
implementation has been the source of the largest number of
Arabic vernacular cataloging records in the U.S.  Differences
between RLG's Arabic implementation and the USMARC sets are
considered important because of the number of vernacular Arabic
script cataloging records they maintain.  The MARBI Character Set
Subcommittee came to the conclusion that the three extra Arabic
characters in the RLG Arabic implementation should be added to
the USMARC set.  Mappings to the universal coded character set
have already been determined and will be added to the mapping

The basic Arabic characters in question are the following (UCS
character names have been used):

    ARABIC THOUSANDS SEPARATOR: USMARC character code '78' (maps
        to +U-066C)
        code '79' (maps to +U-00BB)
        code '7A' (maps to +U-00AB)

Initial justification for including these characters in the
Arabic set was that they were found to occur on Arabic script
title pages.  For example, the Arabic thousands separator occurs
in the classic title "1001 Nights".  The characters are also
included in the DOS and Windows code pages for the Arabic script,
upon which most Arabic implementations are based.  Other USMARC
implementers such as VTLS and Innovative Interfaces use the
Arabic version of Windows as the platform for their Arabic
systems.  These characters are also, of course, in the universal
code character set, which was based on independent assessments of
the existence and usefulness of characters for a variety of

It is unclear how many occurrences of these characters can be
found in existing USMARC records.  (Statistics of this sort are
not available for all USMARC databases.)   Considering the number
of years that USMARC Arabic implementations have existed, and the
presence of these characters in user documentation, it is likely
that they have been used and will need to be handled in
conversion someday to universal character encodings.  It is best
to add this missing characters to the USMARC sets now.  It is
important to note that the addition of characters to the USMARC
Arabic set will not affect most USMARC users.  Only a small
number of USMARC implementations support the Arabic sets.  It is
suspected that those implementations already support the
characters suggested in this proposal.


The following is presented for consideration:

    -   Defined the following three new characters in the
        existing USMARC Basic Arabic set:
             ARABIC THOUSANDS SEPARATOR: USMARC character code
                 '78' (maps to +U-066C)
                 character code '79' (maps to +U-00BB)
                 character code '7A' (maps to +U-00AB)

Go to:

Library of Congress
Library of Congress Help Desk (09/02/98)