The Library of Congress >> Especially for Librarians and Archivists >> Standards
HOME >> MARC Development >> Proposals List
DATE: May 31, 2006
NAME: Lossless technique for conversion of Unicode to MARC-8
SOURCE: Unicode-MARC Forum and MARC advisory committee
SUMMARY: This paper specifies a lossless technique utilizing Numeric Character References for converting unmappable characters when going from Unicode to MARC-8 for systems that cannot handle Unicode encoding. It is intended to be an alternative to the lossy technique approved in 2006-04. The MARC advisory committee recommended that both a lossy and a lossless technique be officially adopted.
KEYWORDS: Unicode (all formats); MARC-8 (all formats); character sets; UCS/Unicode
RELATED: Assessment of Options for Handling Full Unicode Character Encodings in MARC 21 - Part 1: New Scripts (January 2004); Assessment of Options for Handling Full Unicode in Character Encodings in MARC 21 - Part 2: Issues (June 2005); 2006-04 (January 2006)
5/31/06 - Made available to the MARC 21 community for discussion
06/24/06 - Results of the MARC Advisory Committee discussion - Approved. Some editorial changes will be made to the documentation, such as declaring the length of the Unicode code point that should be used and whether the alphabetic characters used as hexadecimal digits are to be upper or lowercase.
10/12/06 - Results of LC/LAC/BL review - Approved
One major issue that needed resolution for the adoption of full Unicode was how the mapping from Unicode with 95,000+ characters to MARC-8 with 16,000+ was to be handled. From August to October 2005 the Unicode-MARC Forum (discussion list) visited that issue. The discussion focused on three main techniques before reaching a consensus that was based on input from the systems that would be most effected by implementing a standard method. The discussions and conclusions from the Unicode-MARC Forum were summarized in 2006-04, and are only repeated here when they pertain specifically to a lossless technique. (The archive of all the messages is available via the MARC home page www.loc.gov/marc/. (Under General Information, click on Unicode-MARC Forum.))
In Proposal 2006-04, two classes of techniques were considered, "lossy," where data would be converted to a substitute that does not support recovery of the original data, and "lossless," where data was carried over into the MARC-8 environment in a coded form and could therefore be recovered on reconversion to Unicode. When 2006-04 was discussed at MARBI, there was agreement and approval of the lossy technique that was recommended in the Unicode-MARC Forum, but there was also a recommendation that a lossless technique be standardized for the situations when the receiver of a record would prefer to have the unmappable character identified so that it could be used in the future, for display or for round-tripping. Both the lossy and lossless methods will have official status, and systems can choose to implement either or both, as their needs and the needs of their users dictate.
The technique approved in January 2006 was the use of a placeholder character to substitute for any character not mappable to MARC-8. This would produce records that could not be converted back to Unicode encoding without loss. It would be relatively simple and cheap to implement by both the receiver and the source. The placeholder character that was approved is the vertical bar (ASCII hex 7C) rather than a new character since it is already implemented in most MARC systems.
Proposal 2006-04 also considered preprocessing that might be employed prior to conversion to MARC-8. At the least, decomposition of most of the precomposed Unicode characters would be desirable since MARC-8 has very few precomposed forms, however, other normalizations could also be employed to enable more characters to convert to MARC-8. One approach would be to use the Unicode algorithm NFKD, which both decomposes and converts some characters to compatible equivalents. The library community would need to develop an appropriate exceptions list. NFKD has the advantage of being maintained by the Unicode Consortium and being incorporated into products because its specifications are downloadable from the Unicode site.
There was agreement that the goal of preprocessing Unicode is to maximize mapping to MARC-8, not just for decomposition. OCLC and RLG have both been doing Unicode record processing long enough to have developed similar translation lists to increase Unicode to MARC-8 mapping. As a first step, after reconciling their approaches, OCLC and RLG intend to give the resulting translation list to LC to coordinate review by the MARC 21 community. While converting compatible equivalents may be useful, especially for the lossy technique, a much stricter translation may be desirable when using the lossless technique since many compatible equivalents would not be reversible.
The leading lossless technique favored in the Unicode-MARC Discussion Forum was the use of Numeric Character Reference (NCR) to substitute for each unmappable MARC-8 character. This technique preserves the Unicode value information and when reconverting to Unicode, the character can be restored. In some situations display software might also be able to take the reference and display the correct Unicode character even when the basic system is MARC-8.
An NCR for one character has the following structure:
Some of the discussion of this technique pointed out that each unmappable character could become as many as 8 ASCII characters, increasing the length of a field and record. While this could logically be a problem with maximum field (9999 characters) and record (99999 characters) lengths, it was considered not likely to cause serious problems. An alternative decimal specification for representation of the character's code point was briefly considered but the discussion favored using only one type of NCR, the hexadecimal one. It was noted that this technique is easy for the source to implement as there is a 1-1 substitution, in the sense of 1 Unicode character to 1 NCR.
Adopt a standard lossless technique for the distribution of records to MARC-8 systems from Unicode systems, with the following specifications.
*** For each Unicode character that cannot be converted to MARC-8, substitute a Numeric Character Reference constructed as follows:
where XXXX is the Unicode character code point in hexadecimal.
HOME >> MARC Development >> Proposals List
|The Library of Congress >> Especially for Librarians and Archivists >> Standards
( 12/21/2010 )
|Legal | External Link Disclaimer||Contact Us|