NAME: Non-filing characters
SOURCE: USMARC electronic list
SUMMARY: This discussion paper presents problems and solutions for dealing with non-filing characters associated with variable field data in USMARC records.
KEYWORDS: Non-filing characters; Field 245, 2nd indicator (Bibliographic); Field 246 (Bibliographic)
5/1/97 - Forwarded to USMARC Advisory Group for discussion at the June 1997 MARBI meetings.
6/30/97 - Results of USMARC Advisory Group discussion - There were different preferences for the technique to be used -- subfields, graphic characters, and control characters. Subfields would be easy to implement; graphic characters would be difficult to identify; and control characters are theoretically desirable but have some system drawbacks. There was a preference for two distinct characters, to be used before and after the non- filing part.
Several participants pointed out that the function under question is non-filing, not non-indexing. For example, the English word "the" might not be indexed any place in a string but the characters we are identifying are only when the word occurs at the beginning of a string. There was consensus that the discussion continue at Midwinter.
DISCUSSION PAPER NO. 102: Non-filing characters 1. BACKGROUND A technique using one of the indicator positions is currently provided in USMARC for dealing with non-filing characters that appear at the beginning of certain variable fields. Generally, the handling of non-filing characters by an indicator value works well in those fields for which it is defined, but such an indicator is not defined for all fields where initial articles and other non- filing characters might occur. The use of both available indicator positions in many variable fields prevents the extension of this technique to all fields where it is needed. Stimulated by a December 1996 message to the USMARC list from a USMARC users in Israel, this problem has come to the forefront, particularly with regard to the wider use of field 246 (Variant Title) in USMARC following Format Integration. Titles recorded in field 246, like those in field 245 (Title Statement), sometimes have initial articles. Field 246 does not have a non-filing indicator and both indicator positions are already defined for other users. This discussion paper presents the problems and issues surrounding the handling of non-filing characters in MARC records. It describes techniques suggested for dealing with initial articles with the advantages and disadvantages of each. This paper is intended to foster discussion and lead to a solution to the problem that guarantees the least negative impact on USMARC systems and users. 2. DISCUSSION This paper deals with non-filing characters that appear at the beginning of cataloging data in access fields. The current USMARC technique for identifying non-filing characters retained in records involves the use of an indicator position that carries a digit (0 through 9) representing the number of characters to be ignored. In the USMARC Format for Bibliographic Data, a non-filing indicator is defined in the following eleven fields: 130 Main Entry--Uniform Title 222 Key Title 240 Uniform Title 242 Translation of Title by Cataloging Agency 243 Collective Uniform Title 245 Title Statement 440 Series Statement/Added Entry--Title 630 Subject Added Entry--Uniform Title 730 Added Entry--Uniform Title 740 Added Entry--Uncontrolled Related/Analytical Title 830 Series Added Entry--Uniform Title A similar indicator was defined for the X00 (Personal Name), X10 (Corporate Name), X11 (Meeting Name), and X30 (Uniform Title) fields in the USMARC Format for Authority Data. The indicator was made obsolete in 1993 for all except the X30 fields. The change was made in the USMARC Authority format because the X00, X10, and X11 fields in the bibliographic format did not have corresponding non-filing indicators. For library systems with integrated authority control, authority format indicators with no bibliographic equivalents served no practical use. Since it was not possible to add the indicator to the X00, X10, and X11 fields in the Bibliographic format, it was made obsolete in the Authority format in those fields. MARC records are created with data elements to support the processing of the information in a variety of ways. MARC records are processed to create printed output products (e.g., catalog cards, book catalogs, and COM catalogs), and for online applications. Online applications center on the indexing of certain fields to provide access to records using predetermined search keys. Search keys provide access through titles, named persons and corporate bodies, subject terms, classification, and standard numbers. It is access points for titles and names that sometimes include parts of speech or other character strings which are not always significant for output and retrieval. 3. INITIAL ARTICLES The most common non-filing characters in MARC data are initial definite and indefinite articles, "the" and "a"/"an" in English and their foreign language counterparts. Non-filing characters can include other character strings that are to be ignored in processing. Articles play an important role in many languages but are often dropped or ignored in processes such as filing. Articles are almost universally ignored in sorting and filing when they appear at the beginning of a name or title because they tend to be used intermittently. Titles and names may be found with or without an initial article. For example, the political leader, Anwar al-Sadat is usually listed by a surname that omits the initial Arabic article "al-". Likewise, titles that might be spoken or written with an article (for example: The Meaning of Life), are almost always listed without the definite article "the". Not all languages possess parts of speech such as articles (all Slavic languages except Bulgarian lack articles), or the articles associated with the first word of a title may be enclitic (for example, articles in Bulgarian and Romanian are appended to the end of a word). For languages with independent, initial articles, their use can be very important grammatically. German, for example, expresses grammatical case through a variety of initial definite and indefinite articles. Arabic and Hebrew use initial definite articles with both nouns and adjectives. Articles are used less often in English but still play an important role in grammar. For example, English speakers only use initial articles with personal names when applied to inanimate objects (for example: "The Henry" would be grammatically correct if referring to a hotel or ship). In German it is grammatically possible to say "der Heinrich" (that is "the Henry"), even when refering to a living person. It suffices to say that articles are important enough that many cataloging rules allow them to be included in bibliographic data in MARC records 4. OTHER NON-FILING CHARACTERS Initial articles are not the only non-filing characters that might appear at the beginning of cataloging data. It is common for special marks to occur at the beginning of access points, particularly titles. An opening quotation mark is perhaps the most common non-filing character to be found at the beginning of titles. For some languages, other marks can occur. For example, in Spanish the inverted question mark and inverted exclamation mark occur at the beginning of phases that also end in the regular question mark ("?") and exclamation point ("!"). Other non-filing characters found in MARC data include the opening square bracket ("[", signifying a cataloger-supplied title]), the opening parenthesis ("("), as well as initial periods ("...") or dashes ("--") used to replace them. Alphanumeric characters that are not articles can also be ignored in some cases. In MeSH (Medical Subject Headings), for example, name of chemical compounds, when including prefixed letters or numbers, are sorted and filed ignoring the prefixes. If not handled in some way, these characters can affect the proper placement of names, titles, and descriptors in alphabetic indexes. Examples of Other Nonfiling Characters ...and then I said [Book title] [inverted]Baile comigo! [Song title] [inverted¨]Quien es quien en el Peru? [Book title] 16,16-Dimethylprostaglandin E2 [Subject descriptor for a chemical compound] N,N-Dimethyltryptamine [Subject descriptor for a chemical compound] 5. HANDLING OF NON-FILING CHARACTERS Use of Indicators The current USMARC solution for dealing with non-filing characters has been described briefly already. It makes use of a indicator position to signal the number of initial characters in a field to be ignored in processing. This technique has these advantages: - Creator of the data can decide on the number of characters to be skipped in filing. - The data itself in the first indexed subfield is not polluted with extraneous graphic or control characters. Disadvantages to this solution include: - Some variable fields do not have an available position to use for a non-filing indicator. This is why field 246, which has no available indicator position, is used so often as an example of the problem. - As currently defined, nine (9) is the largest value possible in any of the non-filing indicators. (Note: higher values could be coded if alphabetic characters were allowed as indicator values, and these were assigned decimal values). - Use of an indicator cannot identify characters to be ignored in other parts of a field, for example, at the end of words. Use of Graphic Characters as Delimiters Other solutions have been suggested or used in other formats. For example, it has been reported that the unused graphic character SPACING UNDERSCORE ("_") is used in some German systems to set off non-filing characters wherever they appear. The advantages to this are: - The character is available in most computer systems, and - All non-filing characters can be easily delimited. Disadvantages include: - Regular cataloging data is polluted with additional graphic characters which must be omitted in printed output and displays. - The SPACING UNDERSCORE character is now found as part of legitimate cataloging data, particularly in Internet addresses and file names, which make its exclusive use as a non-filing character delimiter questionable. Use of Special Control Characters A pair of special control characters, such as the NON-SORTING CHARACTER(S), BEGIN and NON-SORTING CHARACTER(S), END characters defined in ISO 6630 (Bibliographic control set) could be used as delimiters for strings of non-filing characters. Use of such characters have the following advantages: - As specially-defined control characters, they are unique and do not conflict with graphic characters that might occur in data. - The control characters can be used anywhere, initially, medially, and finally. This allows the demarcation of initial articles in subfields, which could be useful in subfield $t of the USMARC linking entry (76X-78X) fields and elsewhere. Disadvantages of the control character solution include: - Special characters require system implementation that affects hardware, software, and existing data. - Cataloging data include special control characters which must be handled in printed output and displays. - They are not mappable to universal character set encodings 6. OTHER SOLUTIONS System Recognition of Articles A commonly-suggested solution to dealing with non-filing initial articles is to program library systems to recognize grammatical articles automatically. In theory this solution sounds attractive. Machine handling of articles, it is suggested, would not be subject to human error. Unfortunately, in practice it is very difficult for a computer to identify initial articles. Character strings such as "the" and "a"/"an", which are certainly English articles, are also legitimate non-article words in other languages. For example, in French, "the" means "tea", "a" means "to", and "an" means "year". If these strings occurred initially in a French title they should be filed upon. The language coding in USMARC records is not designed to control computer handling of initial articles in access fields. The variety of languages which might be represented in a single record, both in the description and access points, make machine determination of articles based on language coding impractical. Many systems already deal programmatically with non-filing characters to a limited degree. Special marks (for example, quotation marks) are already ignored in sorting, indexing, and retrieval. Otherwise the case (i.e., upper or lower) of alphabetic characters is also ignored, as are associated diacritical marks which in some cases are counted as "non-filing characters". Subfield for Articles It has also been suggested that a special subfield could be defined in USMARC for non-filing characters. Subfield $i is often recommended. Unfortunately, subfield $i is already defined for other data in several variable fields. Subfield $1 (one) is the only subfield currently undefined in all variable fields but there would likely be considerable opposition to using a control subfield at this level for initial articles. The implementation of a subfield-level data element for non-filing characters would also have the disadvantage of separating pieces of titles and names that belong together for other processing. Omission of Articles One of the most widely used solutions for dealing with initial articles has been to omit them from cataloging data altogether. This solution has been particularly widespread in the treatment of initial articles associated with personal names, so much so that the non-filing indicator for personal, corporate, and meeting names was made obsolete in USMARC in 1993. Rules for the inclusion or omission of initial articles in other access points vary but have tended to favor omission in recent years. In field 240 (Uniform Title) and field 740 (Added Entry--Uncontrolled Title), although a non-filing indicator is defined, generally it is not used (i.e., it is always set to value 0) and any initial articles are omitted. This solution has been suggested for field 246 as well. The omission of initial articles to deal with not being able to handle them otherwise is not totally acceptable to some USMARC users. European and Middle Eastern libraries have been particularly vocal in their call for a generalizable technique, like the UNIMARC control character technique, for indicating non-filing characters. Their chief argument has been that the simple omission of articles corrupts the cataloging data grammatically and yields title strings that the public finds unacceptable. 7. QUESTIONS Whatever the ultimate solution to this problem, if it involves a change to the USMARC formats themselves, current USMARC-based systems, users, and data would have to accommodate the change. Many worry that the impact would be considerable. Some of the questions that have been raised are: - Would a new technique for dealing with initial articles replace or supplement the existing techniques in USMARC? - Would existing USMARC records have to be modified to reflect the new technique? (If not, how would new systems deal with old records, and vice versa?) - How important is dealing with this problem, considering the increasing international use of USMARC?