MARC-SGML and SGML-MARC Conversion Programs Maintenance Guide

1. Introduction

1.1 Organization of this Manual

1.2 User Manual

1.3 Typographic Conventions

2. MARC Description File

2.1 Mapping Leader 06 and 07 Values to Document Types, Format Types, and Record Types

2.2 Field Group Specifications

2.3 Control Field Specifications

3. Character - Entity Map File

4. Files in the Distribution

Appendix A: MARC-to-SGML Conversion Specification10

Appendix B: SGML-to-MARC Record Conversion Specification22

1. Introduction

The term - MARC DTD - (MAchine Readable Cataloging Document Type Definition), refers to implementations of Standard Generalized Markup Language (SGML). SGML is a technique for representing documents in machine-readable form which was approved as an international standard, ISO 8879 (Information processing - Text and office systems - Standard Generalized Markup Language). It was developed to fill the need for a non-proprietary standard for text encoding so that machine-readable data could be exchanged between dissimilar text encoding environments. SGML is widely used in the publishing industry where documents are created using various computer systems. SGML supports the definition of sets of elements, some of them abstract, that constitute specific document types (for example, journal articles). The MARC DTDs treat machine-readable cataloging records as a distinct type of document. They define all the elements that might constitute a MARC record in parallel with the lists of data elements defined in the five USMARC formats.

The MARC-SGML structure was designed to be an alternate format for the information in MARC structures, with full mappability between the two formats. MARC-SGML was developed because there are some situations in which users find SGML a more appropriate format that MARC. Because some users will find that for one task or process they prefer to use MARC structures and for another they prefer MARC-SGML, it is very helpful to have tools to convert from one to the other as needed.

The Network Development and MARC Standards Office of the Library of Congress has funded development of two conversion programs for converting between MARC records and MARC-SGML. The programs are:

mrc2sgm.pl - MARC-to-SGML converter

sgm2mrc.pl - SGML-to-MARC record converter

Go to top of document

1.1 Organization of this Manual

Chapter 1, Introduction - This introduction.

Chapter 2, MARC Description File - Describes the format of the MARC Description File that controls the conversions.

Chapter 3, Character - Entity Map File - Describes the format of the Character - Entity Map File that specifies the conversions from character numbers to SGML entities and from SGML entities to character numbers.

Chapter 4, Files in the Distribution - Describes each of the files in the distribution.

Appendix A, MARC-to-SGML Conversion Specification - A complete copy of the specification for the mrc2sgm.pl program.

Appendix B, SGML-to-MARC Conversion Specification - A complete copy of the specification for the sgm2mrc.pl program.

Go to top of document

1.2 User Manual

Additional information about the operation of the two programs plus the complete error message list is contained in the companion User Manual.

Go to top of document

1.3 Typographic Conventions

-help	Text that is input to a computer or output by a computer, command line option, SGML attribute name, or program name
file	Example text that, in user input, should be replaced by a correct value or that, in computer output, will be replaced by an actual value
<marcdesc>	SGML element
[dtd_type]	Description of a portion of an SGML element - s tag name. In the SGML produced or read by the conversion programs, this portion of the tag name will be replaced by an actual value, e.g. - mrcb

Go to top of document

2. MARC Description File

The MARC Description File is an SGML file containing information about the makeup of MARC records and of the SGML representation of those MARC records.

The MARC Description File logically divides into three sections:

Mapping of Leader 06 and 07 values to specific document types, format types, and record types;
Field group specifications, one for each document type; and
Control Field specifications.

These sections are described in detail later in this chapter.
The structure of the MARC Description File is shown in the following figure.

The following table lists the elements in the MARC Description File in alphabetical order.

Element Tag Name	Descriptive Name	Description
< control.fields >	Control Fields	Description of the control fields of MARC records
<doctype.selector>	Document Type Selector	Each contains the mapping between a particular Leader 06 and 07 value and its related document type, format type, and record type
<dtd.groups>	DTD Groups	Set of field spans and group names for a grouping of MARC fields. For example, 100 to 199 is mrcb-main-entry
<field.span>	Field Span	Tag name for a grouping of MARC fields (field span) and the start and end field numbers for its range
<key.cluster.group>	Key-Cluster Group	A key value and the subtype and cluster information associated with that value
<ldr.cluster.group>	Leader Cluster Group	A Leader 06 and 07 value and the subtype and Leader cluster information associated with that value
<ldr.to.doctype>	Leader-to-Document Type Map	Container element to hold the <doctype.selector> elements
<ldr>	Leader	Container element to hold the <ldr.cluster.group> elements
<marc.desc>	MARC Description	Root element of theMARC description. Just a container for the other elements.
<marc.tag>	MARC Tag	MARC field tag
<noindsf>	Field Without Indicators or Subfields	Names a field that has no indicators or subfieldsⁱ
<posdef>	Positionally-defined Field	Names a positionally-defined field. Each element contains a <marc.tag> element and a set of <key.cluster.group> elementsⁱⁱ

________________________________

ⁱ Note that the fields without indicators or subfields specified in the MARC Description File are not limited to values specified in valid USMARC.

ⁱⁱ Note that the positionally-defined fields specified in the MARC Description File are not limited to values specified in valid USMARC.

Go to top of document

2.1 Mapping Leader 06 and 07 Values to Document Types, Format Types, and Record Types

The <marc.desc> element contains one <ldr.to.doctype> element, which contains one or more <doctype.selector> elements.

<doctype.selector> is an empty element with four attributes:

leader.06.07 - Data in Leader 06 and 07 character positions of the MARC record
doctype - Document type. This must have the same value as the doctype attribute of one of the <dtd.groups> elements (although case is not important).
format.type - Format type
record.type - Record type

The <doctype.selector> elements are used during conversion from MARC-to-SGML to determine the prefixes to be used when creating SGML element names. When the data in Leader 06 and 07 character positions matches the value of the leader.06.07 attribute, the remaining attributes specify the document type (mrca or mrcb), the format type (e.g. bd, ci, or hd), and the record type (e.g. bk, mu, mx, or vm) of the generated element names.

An example portion of a complete <leader.to.doctype> is shown below:

<ldr.to.doctype>
<doctype.selector leader.06.07="aa" doctype="mrcb" format.type="bd"; record.type="bk">;
<doctype.selector leader.06.07="ac" doctype="mrcb" format.type="bd"; record.type="bk">;
<doctype.selector leader.06.07="ad" doctype="mrcb" format.type="bd"; record.type="bk">;
<doctype.selector leader.06.07="am" doctype="mrcb" format.type="bd"; record.type="bk">;
</ldr.to.doctype>

When converting from MARC to SGML, the data in Leader character positions 06 and 07 is extracted from each MARC record and used as the key for selecting which <doctype.selector> element applies to that record. If the Leader data matches the value of a leader.06.07 attribute, then the values of that element - s doctype, format.type, and record.type attributes are used when generating tags for the SGML version of the MARC record. Using the example above, when the data in Leader character positions 06 and 07 is aa, it matches the first <doctype.selector> element, so the generated top-level tag is <mrcb>, and the element generated for the Leader is <mrcbldr-bd>, etc. If there is no matching leader.06.07 attribute value, an error is signaled.

This information is not used when converting from SGML to MARC records.

Go to top of document

2.2 Field Group Specifications

To avoid a completely flat hierarchy, the MARC DTDs divide the MARC tags into named ranges, and each <dtd.groups> element specifies these ranges for one of the MARC DTDs. Each named range provides a grouping element (for example, <mrcb-control-fields>) that acts as a container for a series of MARC fields. Each <field.span> element in the MARC Description File names one of these grouping elements generated in the output SGML and provides a start number and end number for the group - s MARC field range.

The <marc.desc> element contains one or more <dtd.groups> elements, which in turn contain one or more <field.span> elements. The <dtd.groups> element has a single doctype attribute, and the value of the doctype attribute of the <doctype.selector> element selected using the Leader 06 and 07 data of a MARC record must have the same value as the doctype attribute of one of the <dtd.groups> elements in the MARC Description File.

The <field.span> element is empty, and it has three attributes:

start - The start field number for the range of MARC fields;
end - The end field number for the range. This attribute should be omitted when the range is a single field, in which case the field number is specified in the start attribute; and
label - The complete element name used to label the field group.

An example usage is shown below.

<dtd.groups doctype="mrcb">
<field.span start="001" end="009" label="mrcb-control-fields">
<field.span start="010" end="099" label="mrcb-numbers-and-codes">
<field.span start="100" end="199" label="mrcb-main-entry">
<field.span start="200" end="259" label="mrcb-title-and-title-related">
<field.span start="260" end="299" label="mrcb-edition-imprint-etc">
<field.span start="300" end="399" label="mrcb-physical-description">
<field.span start="400" end="499" label="mrcb-series-statement">
<field.span start="500" end="599" label="mrcb-notes">
<field.span start="600" end="699" label="mrcb-subject-access">
<field.span start="700" end="759" label="mrcb-added-entry">
<field.span start="760" end="799" label="mrcb-linking-entry">
<field.span start="800" end="849" label="mrcb-series-added-entry">
<field.span start="850" end="851" label="mrcb-holdings-notes">
<field.span start="852" label="mrcb-location">
<field.span start="853" end="855" label="mrcb-captions-and-patterns">
<field.span start="856" end="859" label="mrcb-access">
<field.span start="860" end="865" label="mrcb-enumeration-and-chron">
<field.span start="866" end="868" label="mrcb-textual-holdings">
<field.span start="870" end="875" label="mrcb-variant-names">
<field.span start="876" end="879" label="mrcb-item-information">
<field.span start="880" end="889" label="mrcb-linkages">
<field.span start="900" end="999" label="mrcb-local">
</dtd.groups>

When converting from MARC to SGML, the Leader 06 and 07 values selects which document type, and which set of grouping tags, to use, and the appropriate grouping tags are output around the elements output for the fields in the input MARC record. If the MARC record does not contain any fields within the range for one of these groups, then the grouping tag is not output. When converting from SGML to MARC records, the top-most tag for each MARC record indicates the document type, which is used to select which <dtd.groups> element applies, and the conversion program checks that each of the grouping tags in the SGML is defined for the document type. If a grouping tag is not defined in the MARC Description File, an error is signaled.

Go to top of document

2.3 Control Field Specifications

The <marc.desc> element contains one <control.fields> element which contains one <ldr> element followed by one or more <noindsf> and <posdef> elements, in any combination. The <ldr> element is used to obtain a description of the logical clustering of the bytes in the Leader field. The <noindsf> element is used to name the MARC fields that are not expected to have indicators or subfields. The <posdef> element both names the MARC fields that are expected to be positionally-defined and provides a description of the logical clustering of the bytes in each field.

_________________________

ⁱⁱⁱ Note that the clustering of Leader character positions specified in the MARC Description File are not limited to values specified in valid USMARC.

^iv Note that the fields without indicators or subfields specified in the MARC Description File are not limited to values specified in valid USMARC.

^v Note that the positionally-defined fields and the clustering of character positions specified in the MARC Description File are not limited to values specified in valid USMARC.

2.3.1 Leader

The <ldr> element contain one or more <ldr.cluster.group> elements, which are empty elements with two attributes:

key - Key value used when selecting which cluster arrangement to use. If the value of Leader 06 matches this value, then the data in the clusters attribute determines the division of clusters of characters into separate elements in the output SGML; and
clusters - Definition of clusters of character positions. This is a sequence of space- separated numbers or number ranges. They determine the division of the character positions within the Leader into elements in the output SGML.

An example portion of a <ldr> element is shown below:

<ldr> <ldr.cluster.group key="a" clusters="05 06 07 08 09 17 18 19"> <ldr.cluster.group key="b" clusters="05 06 07 08 09 17 18 19"> <ldr.cluster.group key="c" clusters="05 06 07 08 09 17 18 19"> <ldr.cluster.group key="e" clusters="05 06 07 08 09 17 18 19"> </ldr> When converting from MARC records to SGML, the Leader 06 value selects which Leader cluster group to use, and the appropriate tags are output around the data from the clusters indicated by the value of the clusters attribute. This information is not used when converting from SGML to MARC records.

2.3.2 Control Fields

The <noindsf> element contains a <marc.tag> element that contains the tag number of a MARC record field.

The <posdef> element contains a <marc.tag> element followed by one or more <key.cluster.group> elements. The <posdef> element has a key attribute with possible values leader0607 and field00 that indicates what data from the MARC record to use when selecting a cluster group arrangement for the positionally- defined field.

The <key.cluster.group> element is an empty element with three attributes:

key - Key value used when selecting which cluster arrangement to use. If the value of either Leader 06 and 07 or the 00 character position of the current field (as selected by the key attribute of the containing <posdef> element) matches this value, then the data in the clusters attribute determines the division of clusters of characters into separate elements in the output SGML;
subtype - String to be output in the subtype portion of the tag names in the output tag- valid SGML
clusters - Definition of clusters of character positions. This is a sequence of space- separated numbers or number ranges. They determine the division of the character positions within the positionally-defined field into elements in the output SGML.

An example portion of a <control.fields> element containing <noindsf> and <posdef> elements is shown below:

<control.fields>
<ldr>
...
</ldr><noindsf><marc.tag>001</marc.tag></noindsf>
<noindsf><marc.tag>002</marc.tag></noindsf>
<noindsf><marc.tag>003</marc.tag></noindsf>
<noindsf><marc.tag>004</marc.tag></noindsf>
<noindsf><marc.tag>005</marc.tag></noindsf>
<posdef key="field00"><marc.tag>006</marc.tag>
<key.cluster.group key="a" subtype="bk" clusters="00 01-04 05 06 07-10 11 12 13 14 15 16 17">
<key.cluster.group key="t" subtype="bk" clusters="00 01-04 05 06 07-10 11 12 13 14 15 16 17">
</posdef>
<posdef key="field00"><marc.tag>007</marc.tag>
<key.cluster.group key="a" subtype="a" clusters="00 01 02 03 04 05 06 07">
<key.cluster.group key="c" subtype="c" clusters="00 01 02 03 04 05">
<key.cluster.group key="d" subtype="d" clusters="00 01 02 03 04 05">
</posdef>
</control.fields>

When converting from MARC records to SGML, if a field is specified in a <marc.tag> element, it is treated as a control field rather than as a data field. If it is specified within a <noindsf> element, it is treated as a control field without indicators and subfields, and the field data is output between a start-tag and end-tag for the field. If it is specified within a <posdef> element, it is treated as a positionally-defined field. For positionally-defined fields, either the Leader 06 and 07 value or the 00 value from the current field selects which cluster group to use, and the appropriate tags are output around the data from the clusters indicated by the value of the clusters attribute.

When converting from SGML to MARC records, if an element matches the pattern for a field without indicators or subfields, then it is processed without reference to the information from the MARC Description File, but if it matches the pattern for a positionally-defined field, then the length of the output field data is fixed at the length of the last number in the last cluster specified in the clusters attribute, and any unused character positions in the field are set to the - fill - character.

Go to top of document

3. Character - Entity Map File

The Character - Entity File is an SGML file containing information about the mapping between character numbers in the MARC Extended Latin character set and SGML entities which are, or should be, defined in the MARC DTDs. The structure of the Character - Entity File is shown in the following figure.

entitymap	+ character~	+	entity
			desc

Element Tag Name Descriptive Name Description

<character> Character Information Information about a single (possibly compound) character, including any entities that are equivalen

t<desc> Entity Description Description of the entity

<entity> Entity Name Entity name

<entitymap> Character - Entity Map Map of characters to entities

Element Tag Name	Descriptive Name	Description
<character>	Character Information	Information about a single (possibly compound) character, including any entities that are equivalen
t<desc>	Entity Description	Description of the entity
<entity>	Entity Name	Entity name
<entitymap>	Character - Entity Map	Map of characters to entities

The <entitymap> element contains one or more <character> elements, which in turn contain one or more pairs of <entity> and <desc> elements.

The <character> element has two attributes:

hex - Hexadecimal representations of one or more character codes, separated by spaces. Although there are two attributes for two representations of the character number, this attribute is the only one used when setting up the character - entity or entity - character conversion.
dec - Decimal representations of one or more character codes, separated by spaces. This should be the decimal representation of the hexadecimal number in the hex attribute.

The <entity> element contains the text of the entity name for one of the entities corresponding to the character code, and the <desc> contains a description of the entity. Since it is possible that more than one entity may represent a character, the <character> element may contain more than one pair of <entity> and <desc> elements.

An example portion of an <entitymap> element is shown below:

<entitymap>
<character hex="E2 61" dec="226 097">
<entity>aacute</entity><desc>latin small letter a with acute</desc>
</character>
<character hex="E2 41" dec="226 065">
<entity>Aacute</entity><desc>latin capital letter a with acute</desc>
</character>
<character hex="8D" dec="141">
<entity>joiner</entity><desc>joiner control character</desc>
<entity>x8D</entity><desc>hex value 8D</desc>
</character>
</entitymap>

When converting from MARC records to SGML, the character number or sequence of character numbers in the hex attribute of each <character> element is converted to the entity named in the first <entity> element. Compound characters are converted ahead of single characters so, for example, the sequence - E2 41 - is converted to "Á" rather than "´A".

When converting from SGML to MARC records, every entity declared in the Character-Entity Map File is converted to its corresponding character number or sequence of character number. If an entity is declared multiple times in the file, only the first definition has any effect.

Go to top of document

4. Files in the Distribution

mrc2sgm.pl - MARC-to-SGML conversion program in Perl
sgm2mrc.pl - SGML to MARC record conversion program in Perl
MANIFEST - List of files in the distribution. Used by the install.me installation program
install.me - Installation program in Perl. See the installation instructions in the user - s guide.
sgmlspl.pl - SGML conversion Perl function library
sgmlspl.pm - Object-oriented version of sgmlspl.pl
SGMLS.pm - SGML conversion Perl function library
charconv.sgm - Character conversion Character-Entity Map File selected by the -charconv command line option
register.sgm - Upper-register character to entity conversion Character-Entity Map File selected by the -registerconv command line option
entmap.dtd - Character - Entity Map File DTD
marcdesc.dtd - MARC Description File DTD
marcdesc.sgm - Default MARC Description File
marcdesc.pl - MARC Description File Perl function library
Output.pm - SGML conversion Perl function library
Refs.pm - SGML conversion Perl function library

Go to top of document

Appendix A. MARC Record to Tag-valid SGML Conversion Specification

CONVERSION FROM MARC RECORDS TO UNPARSED SGML DATA

1.  FUNCTIONALITY
---------------------------

The program(s) will serve as a generalized facility for converting
from MARC (ISO 2709) records to unvalidated SGML data.  The output
data cannot be considered SGML until it has been parsed by a
validating SGML parser.


2.  INPUT/OUTPUT
--------------

2.1  Input
----------

Input to the conversion utility is a string of one or more
MARC-structured records.  Ideally these will be valid USMARC records,
but other varieties of MARC and locally-extended MARC can also be
converted.

The input must be and the output will be "well-formed", but neither is
guaranteed to be valid.  "Well-formed" means that the MARC input files
must contain MARC records with correctly structured Leader, Directory,
and variable control and data fields, followed by an End-of-Record
mark.  Leader fields 06 and 07 must be present and contain values that
are legal according to the MARC Description File.* With this small
exception, no particular fields in the Leader, control, or data fields
are required nor are particular subfields required within a given
field.  Field and subfield order need not follow the USMARC standard.
In short, the record must be well-structured MARC but need not meet
the full requirements of valid USMARC records.  [*Footnote: Note that
06 and 07 in the Description File can be modified by the person
running the conversion and are thus not limited to values specified in
valid USMARC.]

2.2  Output
-----------


The output will be a datastream of tagged but unvalidated SGML data,
with one tagged element (containing many
sub-elements) for each properly-structured MARC record in the input
file.  Each record element will contain unique subelements for its
Leader, each of its variable control fields, variable data fields, and
subfields, in addition to any grouping elements specified.  Ideally,
the resulting tagged element will be a MARC SGML stream that is valid
according to one of the MARC SGML DTDs, but validation is not required
for conversion.

The input must be and the output will be "well-formed", but neither is
guaranteed to be valid.  "Well-formed" means that in the tagged output
all start-tags and end-tags will match up and that all attributes will
be quoted.  (Note: This is not "well-formed" in the XML sense since
XML uses a different syntax for empty tags and requires that allentities
are declared).  Element and attribute names will be
constructed according to the mechanism established in the MARC
Bibliographic and Authority DTDs, and grouping elements will be
constructed according to the MARC Description file.  The specific
elements, any grouping elements, and the relationships between the
elements will not necessarily be those in the current MARC DTDs.
Therefore the output instances are not guaranteed to parse cleanly.
[Note that the rules of SGML 8879 will be followed and therefore a DTD
could be written that was valid for any particular tagged instance.]
In addition, as far as is known, no DTD exists for a collection of
MARC instances, only DTDs for individual instances, so there is no
parser validation for the collected output.

Unless an output filename is specified as a command option, the
programs output will be written to the file "stdout.sgm".

2.3  Control of the Conversion Process
--------------------------------------

The conversion process will be table-driven, with top-level
controlling data coming directly from the conversion operator and
detailed controlling data provided in the (user modifiable) MARC
Description File.  MARC input is verified (but not validated) and
SGML output is constructed according to the MARC Description
File. There is no direct connection to the MARC DTDs except as is
built into this file.


3.  USER INTERFACE
------------------

The conversion utility will be a Perl script executed from the command
line.  Command options may be entered on the command line.  One of the
command-line options will be the name of a command file containing
information equivalent to some or all of the command options.  Options
specified on the command line will override the options specified in
the command file.  Options controlling processing each have a default
value that will be used if not specified either on the command line or
in the command file.

The format of the command line syntax is:

mrc2sgm.pl [-command file]
   [-sgmlconv | -registerconv | -charconv | -userconv file]
   [-log file] [-o file] [-marcdesc file] input-file

where:
-command file
        Read program command options from "file"

-sgmlconv
        Perform minimal, "SGML sanity" character conversion using the
        built-in conversion table

        This is the default character conversion.

-registerconv
        Convert upper-register characters to lower-register characters
        using the built-in conversion table.

        The minimal SGML conversion will also be performed.

-charconv        Convert characters to entities using the built-in
conversion
        table

        The minimal SGML conversion will also be performed.

-userconv file
        Perform character conversion using the user-supplied
        conversion specification in "file"

        An error will be signaled if "file" is not specified or if
        "file" cannot be opened, or if "file" is not a file of the
        correct format.

        The minimal SGML conversion will also be performed.

-log file
        Write the output log to "file".  If this option is not
        specified, the log will be written to "mrc2sgm.log" in the
        current directory.

-o file
        Write the unvalidated SGML output to "file" instead of to
        the default file "stdout.sgm".

-marcdesc file
        Read the MARC Description File named "file" instead of the
        default MARC Description File that the program automatically
        reads on initialization.

input-file
        The name of the input MARC record file


4.  ASSUMPTIONS
---------------

4.1  Hardware/Operating Systems
-------------------------------

Any IBM PC or clone that runs DOS, either natively, or under Windows
(3.1x, 95, 97, NT, etc) or OS/2, or any UNIX system for which Perl
and nsgmls are available.

4.2  Software
-------------

The conversion program will be written in Perl 5 (Practical Extraction
and Report Language).  The advantages of Perl are:

 - Perl is free;

 - Perl is available for a wide variety of
   platforms, including DOS, Windows 3.1x, Windows
   95, Windows NT, OS/2, Macintosh, and UNIX;

 - Perl scripts are interpreted, so it is not
   necessary to recompile the conversion script to
   use it on a different platform;

 - There is a free SGML function library that
   simplifies the manipulation of SGML text.

 - Perl is optimized for scanning text; and
 - Perl is good with binary data.

 - Perl can read in external data files, such as a
   user-defined character conversion table, and
   evaluate it as part of the program rather than
   just store it as data;

Being interpreted, Perl does not execute as fast as a compiled
language such as C, but the script should still execute sufficiently
quickly for the satisfaction of most operators.  A large part of the
speed difference between Perl and a compiled program is the startup
time while the script is being interpreted into byte codes, so the
program will be more efficient, per-record, for large data sets.


4.3  File Size
--------------

MARC record files could be 1 MByte and could even be 1 GByte.  The
traditional Perl approach of slurping everything into memory and
spitting it out again is going to break for somebody sometime.
Instead, the program will read in a chunk of data, find the first
record and process it, find the next record and process it, and so on
until you're left with a partial record (or reach the end of the file
in the first chunk).  If you have a partial record, append the next
chunk of the file and carry on.


4.4  Hard-wired Knowledge About MARC Records
--------------------------------------------

4.4.1    MARC Record Start and End

4.4.1.1  MARC records begin with a Leader field
 
4.4.1.2  MARC records end with an End Of Record character (hexadecimal
         1D, or "0x1D" in Perl syntax).

4.4.1.3  There may be blank characters at the beginning a file but not
         between records or at the end of the file. The blank
         character is " " (0x20).

4.4.2 Leader

4.4.2.1  The Leader is the first 24 characters in the record.

4.4.2.2  The format of the Leader is fixed.

4.4.2.3  The Leader lacks indicators and subfields.

4.4.2.4  The Leader is a positionally-defined field, and Leader
         character positions 06 and 07 are used as the key for much of
         the processing of the MARC record.  The Leader is processed
         similarly to other positionally-defined fields.

4.4.2.5  Descriptions of both Leader character positions 06 and 07 are
         required in the MARC Description File.

4.4.2.6  Content is required for both Leader character positions 06
         and 07 in the MARC input record.

4.4.2.7  The value of Leader positions 06 and 07, concatenated        
together, will be checked against data read from the MARC
         Description File, and an error will be signaled if the Leader
         data does not match a value from the MARC Description File.

4.4.2.8  Leader character positions 06 and 07 will be used to
         determine (from the MARC Description File) the DTD type, the
         record type and the format type to be used in the generated
         SGML.

4.4.2.9  The Leader is followed immediately by the Directory.
 
4.4.3 Directory

4.4.3.1  The Directory begins in the first character position after
         the Leader.

4.4.3.2  The Directory ends with an End Of Field character (0x1E).

4.4.3.3  The length of the Directory (excluding the EOF character) is
         a multiple of 12 characters.  An error will be signaled if
         the length of the Directory (excluding the EOF character) is
         less than 12 characters.  An error will be signaled if the
         length of the Directory is not a multiple of 12 characters.

4.4.3.4  The Directory consists of multiple 12-character entries, each
         of which identifies the tag of a variable field, the length
         of that field (including EOF character), and the starting
         position of the field.

4.4.3.5  The field numbers of entries in the Directory are not
         necessarily in increasing numerical order.  The entries will
         be sorted in increasing numerical order before the fields are
         processed.

4.4.3.6  The Directory entries are used to locate fields within the
         body of the MARC record.

4.4.3.7  The tagged text for the fields in the MARC record will be
         output in the order in which they appear in the Directory.

4.4.4 Variable Fields

4.4.4.1  Variable fields may be either control or data fields. 

4.4.4.2  The control and variable data fields follow the Directory EOF
         character.  The starting position of each variable fields is
         calculated based on the position immediately following the
         Directory EOF character.

4.4.4.3  An undetermined number of variable control fields may be
         present between the end of the Directory and the start of the
         first data variable field.

4.4.4.4  Each control or variable data field ends in a EOF character.

4.4.4.5  MARC variable fields may be numbered 001 to 999.

4.4.4.6  None of the variable fields are specifically required to be
         present, but there must be at least one variable field
         present in the MARC input record.

4.4.4.7  All variable fields are repeatable, and all subfields in
         variable data fields are repeatable.
4.4.5 Grouping Tags

4.4.5.1  Tagged output for fields will be grouped by field number.
         The grouping tag and the range of field numbers for the group
         are specified in the MARC Description File.

4.4.5.2  The MARC Description File may contain multiple specifications
         for grouping fields.  The specification to use will be chosen
         based upon the values of Leader character positions 06 and
         07.

4.5.5.3  An error will be signalled if a MARC record contains a field
         for which the MARC Description File does not designate a
         grouping tag.

4.4.6 Fields Lacking Both Indicators and Subfields

4.4.6.1  Any field that lacks indicators and subfields (excluding the
         Leader, since it is always present in the MARC record) must
         be explicitly named in the MARC Description File.

4.4.6.2  The tag name generated for fields that lack indicators and
         subfields that are not also positionally-defined fields, will
         be the concatenation of the document type (for example,
         "mrcb"), the minus character ("-"), and the field number (for
         example, "003").

4.4.6.3  A start tag and an end tag are generated for each field that
         lacks indicators and subfields that is not also a
         positionally-defined field.  The contents of the MARC field
         are output as the content of the element.

4.4.7 Positionally-defined Fields

4.4.7.1  Any field that lacks indicators and subfields may be also be
         specified as a positionally-defined field in the MARC
         Description File.

4.4.7.2  It is an error if a field is specified as a
         positionally-defined field and not also defined as a field
         lacking indicators and subfields.

4.4.7.3  Positionally-defined fields are divided into clusters of
         character positions.

4.4.7.4  The arrangement of the clusters within a positionally-defined
         field may vary.

4.4.7.5  For all positionally-defined fields, including the Leader,
         any clustering of data from multiple character positions into
         a single tag in the output tagged text is specified in the
         MARC Description File.

4.4.7.6  For the Leader, the key for determining the arrangement of
         clusters to use is the data in Leader character position 06.

4.4.7.7  For each positionally-defined field except the Leader: 
         
         1) The key for determining which arrangement of clusters to
         use is either the value of the 00 character position of that
         field or the value of Leader character positions 06 and 07.
         2) The choice between using the 00 position or using the
         Leader 06 and 07 is specified in the MARC Description File.

         3) The same key also determines the selection of a string to
         be used when constructing tag names for the tagged text
         output.  The string selected by the key is used to construct
         tags which group the tags of the clusters.  For example, if
         the document type is "mrcb", the string selected by the key
         is "AA", and the field number is "006", then the name of the
         grouping tag for each of the clusters is "mrcb006-AA"

         4) The tag name for each cluster is generated by appending
         the cluster's character position (or character position
         range) to the name of this grouping tag.  For example, if the
         cluster is the single character position 06, then (continuing
         from the previous example) its tag name is "mrcb006-AA-06".
         If the cluster is a range of character positions 06 to 09,
         then its tag name is "mrcb-006-AA-06-09".

4.4.7.8  For the Leader:

         1) The name for a tag that groups tags for the
         positionally-defined fields is generated by concatenating the
         document type, the string "ldr-", and the format type
         (specified previously).  For example, if the document type is
         "mrcb" and the format type is "ci", then the tag name for the
         grouping tag for the positionally-defined fields in the
         Leader is "mrcbldr-ci".

         2) The tag name for each cluster is generated by appending
         the minus character ("-") and the cluster's character
         position (or character position range) to the name of the
         enclosing tag.  For example, if the cluster is the single
         character position 06, then (continuing from the previous
         example) the tag name is "mrcbldr-ci-06".  If the cluster is
         the range of character positions 06 to 09, then its tag name
         is "mrcbldr-ci-06-09".

4.4.7.9  The elements for all clusters within positionally-defined
         fields are all EMPTY (an SGML keyword meaning that there is
         no content and they have a start tag but no end tag).  The
         data in the character positions for the cluster is inserted
         into the start tag as the value of a "value" attribute.  For
         example, if the data in Leader character position 06 is "a"
         and Leader character position 06 is a single-character
         cluster, then (continuing from the previous example) its tag
         is ''.  If the data in Leader
         character position 06 is "a", in 07 is "b", in 08 is "c",
and
         in 09 is "d" and the cluster is the range of character
         positions 06 to 09, then its tag is ''.

4.4.8 Variable Data Fields

4.4.8.1  The variable data fields contain indicators and subfields.

4.4.8.2  All fields that are not listed as lacking indicators and
         subfields are treated as variable data fields containing
         indicators and subfields.

4.4.8.3  Variable data fields comprise two single-character indicators
         then one or more subfields and an End of Field mark.  A
         subfield comprises a delimiter character (0x1F), a        
single-character subfield code, and a sequence of data
         characters.  The End of Field mark terminates both the last
         subfield and the variable data field.

4.4.8.4  Start and end tags are output for each variable data field.
         The tag name is the concatenation of the document type
         (e.g. "mrcb") and the field number (e.g. "010").

4.4.8.5  The values of the indicators are output as attributes in the
         start tag for the data variable field.

4.4.8.6  The attribute names are "I1" for Indicator 1 and "I2" for
         Indicator 2.  Indicator 1 attribute values are prefixed with
         "i1-", and Indicator 2 values are prefixed with "i2-".  Blank
         indicator values (" ", or 0x20) are output as "blank", and
         fill indicator values ("|") are output as "fill" (plus the
         appropriate prefix).

         Note that blank indicator values are represented by "#" in
         the initial specification from the Network Development and
         MARC Standards Office (and in many MARC-related documents),
         but the blank character appearing in MARC records is always
         " " (0x20).

4.4.8.7  Start and end tags are output for each subfield of a variable
         data field.  The tag name is the concatenation of the tag
         name of the enclosing tag (e.g. "mrcb010"), "-", and the
         subfield identifier (e.g. "a").

4.4.8.8  Subfields of variable data fields do not have any
         attributes.



5.  CHARACTER CONVERSION
------------------------

5.1      The program will "come with" three possible levels of
         character conversion -- minimal, upper register to lower
         register, and all special characters.  The user will have the
         option of specifying one of those three levels or of
         supplying the filename for a user-supplied conversion
         specification.

5.2      User-supplied character conversion specifications will be
         formatted as "well-formed" SGML data.



6. ERROR HANDLING
-----------------

6.1      All error messages will be written to the log file.

6.2      MARC records with errors will be skipped and not written to
         the output.


7.  INCOMPATIBILITIES BETWEEN PROGRAM ASSUMPTIONS AND USMARC
------------------------------------------------------------

7.1      Every field (except the Leader and the Directory) and every
         subfield is assumed to be repeatable, although USMARC        
explicitly specifies the repeatability or nonrepeatability of
         each field and subfield.

7.2      The number of, and field numbers of, the variable control
         fields (which are the only fields that lack indicators and
         subfields)can vary from USMARC and must therefore be
         specified in the MARC Description File. (USMARC explicitly
         specifies that fields 001 to 009 are the only variable
         control fields.).

7.3      The number of, and field numbers of, the fields that are
         positionally-defined can vary from USMARC and must be
         specified in the MARC Description File, although USMARC
         explicitly specifies that the Leader and fields 006, 007, and
         008 are the only positionally-defined fields.


8.  REQUIRED AUXILIARY DATA FILES
---------------------------------

8.1 MARC Description File
-------------------------

This conversion utility is controlled by a single SGML file containing
a description of the MARC record format.  The program itself is not
hard-wired for any implementation of MARC records, and it reads the
description file to find out what to expect in the MARC records.  An
error is signaled if an input MARC record does not conform to the
description.

8.2 Character to Entity Conversion Files
----------------------------------------

These files control the conversion of characters in the MARC data to
entities in the SGML output.  Two conversion files are
required by the program -- an upper-register to entity conversion file
and a character to entity conversion file.  An additional file may be
specified by the user.  The mapping in the selected conversion file is
converted into program code executed by the program to perform the
character to entity conversion.

The same conversion specification file format is used to specify
entity to character conversion for the SGML to MARC record
conversion program.


9.  PROCESS FLOW
----------------

9.1 Main Process Flow
---------------------

 - Initialize and read command-line arguments

 - Open log file

 - Read control data

 - Open input file

 - Open output file

 - Get additional user input (if required)
 - Process file, one record at a time.

 - If anything left at the end, warn about junk at
   end of file 

 - Close input file

 - Close output file

 - Write log file end message

 - Close log file


9.2 Per-record processing
-------------------------

For each MARC record:

 - Split record into fields

 - Validate leader

 - Determine document type from data in Leader
   character positions 06 and 07 

 - Check fields against MARC Description File data

 - Generate top-level start-tag and required "format-type" attribute as
   determined from data in Leader character positions 06 and 07

 - Process fields

 - Generate top-level end-tag

 - Output unvalidated SGML data


9.3 Per-field processing
------------------------

For each field:

 - If field starts new group

    - Generate end tag for previous group, if
      necessary

    - Generate start tag for new group

 - Process individual field according to field
   type

 - If field is the last field, generate end tag for
   last group


9.4 Non-positionally-defined fields without subfields and indicators
--------------------------------------------------------------------

For each field without indicators and subfields
that is not positionally-defined:
 - Generate start tag

 - Generate field contents

 - Generate end tag


9.5 Positionally-defined fields
-------------------------------

For each positionally-defined field:

 - Generate start tag

 - For each positionally-defined subfield:

    - Generate start tag with subfield contents as
      "value" attribute

 - Generate end tag


9.6 Variable data fields
------------------------

For each variable data field

 - Generate start tag with indicator values as "i1"
   and "i2" attributes

 - For each subfield

    - Generate start tag

    - Generate field contents

    - Generate end tag

 - Generate end tag

 
11. TEST FILES
--------------

 - One clean MARC record

 - Three or more clean MARC records

 - Junk at end of file

 - Junk at beginning of file

 - Junk at both ends of file

 - Corrupt leader

 - Corrupt directory

 - Record with multiple EOR marks 

 - Invalid funny character
 - Whitespace at beginning of file

 - Whitespace at end of file

Go to top of document

Appendix B. Tag-valid SGML to MARC Record Conversion Specification

CONVERSION FROM TAGGED TEXT TO MARC RECORDS
===========================================

1.  FUNCTIONALITY
-----------------

The program(s) will serve as a generalized facility for converting
from tagged text to MARC (ISO 2709) records.


2.  INPUT/OUTPUT
----------------

2.1  Input
----------

Input to the conversion utility is a string of one or more tag-valid
SGML instances of MARC data, marked up in the style of the MARC SGML
DTDs.  The input data should be, and is assumed to be, valid parsed
SGML, marked up according to a DTD.  However, since this conversion
utility is table-driven and does not reference a DTD, the utility
cannot verify the validity of the input data.


2.2  Output
-----------

The output will be a datastream of MARC record data.

The input must be and the output will be "well-formed", but neither is
guaranteed to be valid.  "Well-formed" means that the MARC output
files will contain MARC records with correctly structured Leader,
Directory, and variable control and data fields, followed by an
end of record mark.  Leader fields 06 and 07 will be present and will
contain values that are legal according to the MARC Description File.*
With this small exception, no particular fields in the Leader,
control, or data fields are required nor are particular subfields
required within a given field.  Fields will be listed in the Directory
in ascending numerical order.  In short, the record will bewell-structured MARC but
need not meet the full requirements of valid
USMARC records.  [*Footnote: Note that 06 and 07 in the Description
File can be modified by the person running the conversion and are thus
not limited to values specified in valid USMARC.]

Unless an output filename is specified as a command option, the
programs output will be written to the file "stdout.mrc".


2.3  Control of the Conversion Process
--------------------------------------

The conversion process will be table-driven, with top-level
controlling data coming directly from the conversion operator and
detailed controlling data provided in the (user modifiable) MARC
Description File.  MARC SGML input is verified and a full
MARC record is constructed according to the MARC Description
File. There is no direct connection to the MARC DTDs except as is
built into this file.


3.  USER INTERFACE
------------------

The conversion utility will be a Perl script executed from the command
line.  Command options may be entered on the command line.  One of the
command-line options will be the name of a command file containing
information equivalent to some or all of the command options.  Options
specified on the command line will override the options specified in
the command file.  Options controlling processing each have a default
value that will be used if not specified either on the command line or
in the command file.

The format of the command line syntax is:

sgm2mrc.pl [-command file]
   [-sgmlconv | -registerconv | -charconv | -userconv file]
   [-log file] [-o file] [-marcdesc file] input-file

where:
-command file
        Read program command options from "file"

-sgmlconv
        Perform minimal, "SGML sanity" character        conversion using the built-in
conversion table

-registerconv
        Convert upper-register characters to
        lower-register characters using the
        built-in conversion table

        The minimal SGML conversion will also be performed.

-charconv
        Convert characters to entities using the
        built-in conversion table

        The minimal SGML conversion will also be performed.

-userconv file
        Perform character conversion using the
        user-supplied conversion specification
        in "file"

        An error will be signaled if "file" is not
        specified or if "file" cannot be opened,
        or if "file" is not a file of the correct
        format.

        The minimal SGML conversion will also be performed.

-log file
        Write the output log to "file".  If this
        option is not specified, the log will be
        written to "mrc2sgm.log" in the current
        directory.

-o file
        Write the tagged-text output to "file"
        instead of to the file "stdout.mrc".

-marcdesc file
        Read the MARC Description File named
        "file" instead of the default MARC
        Description File that the program automatically
        reads on initialization.

input-file
        The name of the input MARC record file

4.  ASSUMPTIONS
---------------

4.1  Hardware/Operating Systems
-------------------------------

Any IBM PC or clone that runs DOS, either natively, or under Windows
(3.1x, 95, 97, NT, etc) or OS/2, or any UNIX system for which Perl
and nsgmls are available.

4.2  Software
-------------

The conversion program will be written in Perl 5 (Practical Extraction
and Report Language).  The advantages of Perl are:

 - Perl is free;

 - Perl is available for a wide variety of
   platforms, including DOS, Windows 3.1x, Windows
   95, Windows NT, OS/2, Macintosh, and UNIX;

 - Perl scripts are interpreted, so it is not
   necessary to recompile the conversion script to
   use it on a different platform;

 - There is a free SGML function library that
   simplifies the manipulation of SGML text.

 - Perl is optimized for scanning text; and

 - Perl is good with binary data.

 - Perl can read in external data files, such as a
   user-defined character conversion table, and
   evaluate it as part of the program rather than
   just store it as data;

Being interpreted, Perl does not execute as fast as a compiled
language such as C, but the script should still execute sufficiently
quickly for the satisfaction of most operators.  A large part of the
speed difference between Perl and a compiled program is the startup
time while the script is being interpreted into byte codes, so theprogram will be more
efficient, per-record, for large data sets.


4.3  File Size
--------------

It is not known how large the SGML input files will be.
MARC record files could be 1 MByte and could even be 1 GByte, and with
the addition of SGML-style tags which include attribute values, the
tagged-text versions of MARC records will be significantly larger.
The traditional Perl approach of slurping everything into memory and
spitting it out again is going to break for somebody sometime.
Instead, the program will read in a chunk of data, find the first
record element and process it, find the next record element and
process it, and so on until you're left with a partial record (or
reach the end of the file in the first chunk).  If you have not found
the end tag for the current record element, append the next chunk of
the file and carry on.



5. HARDWIRED KNOWLEDGE ABOUT MARC AND TAGGED TEXT DATA
------------------------------------------------------

5.1 Record, Field, and Subfield Delimiters in MARC Records

5.1.1    A MARC record will be terminated by a end of record (eor)
         character (0x1D).

5.1.1    All numbered fields and the Directory (but not the Leader)
         will be output followed by an end of field (eof) character
         (0x1E).

5.1.1    All subfields will be output preceded by a subfield delimiter
         character (0x1F).


5.2 Leader

5.2.1    Some parts of the Leader are contained within the tag-valid
         SGML text; other parts are constant for all MARC records and
         will be automatically inserted in the output.  The length
         portion of the Leader must be calculated once the rest of the
         MARC record is constructed.
5.2.2    The Leader contains the length of the MARC record, from the
         first character of the Leader to the end of record character
         at the end of the record.  The length is calculated after all
         of the fields and subfields have been processed and the
         Directory has been built and is then inserted into character
         positions 0 to 5 of the Leader.

5.2.3    The length portion of the Leader is right-justified, with
         unused character positions padded with zeroes.

5.2.4    It is an error if the length of the MARC record exceeds
         99,999, which is the maximum length that may be recorded in
         the Leader.

5.2.5    Some of the Leader character positions are filled with
         constant data as follows:

                Character Position      Data
                        10                2
                        11                2
                        20                4
                        21                5
                        22                0
                        23                0

5.2.6    The Leader contains the combined length of the Leader and the
        Directory (including the eof character at the end of the
        Directory).  Since the Leader has a fixed length, this is
        effectively the length of the Directory plus 24 characters.
        This length is inserted into character positions 12 to 16 of
        the Leader.

5.3 Top-level Tags (DTD Level)

5.3.1    The start-tag format is:

         <[dtd_type]>

5.3.2    The top-level tags specifying the DTD type, e.g.  or
          and their required "format-type" attribute are
         discarded.  They do not produce any output, they do not
         affect any processing, and they are not used in any
         cross-checking with any other tags or with the MARC
         Description File.
5.3.3    The descriptions within this specification of other tags
         includes a "dtd_type" portion.  This portion of tag names has
         the same format as the top-level tag name, and it, too, does
         not affect any processing or output.


5.4 Leader Elements and Subelements

5.4.1    The start-tag format of the Leader element is:

         <[dtd_type]ldr-[format_type]>

5.4.2    The start-tag format of the subelements of the Leader element
         is:

         <[dtd_type]ldr-[format_type]-[cp] value="[cp value]">

         where "cp" indicates the character position of the data in
         the "value" attribute.

5.4.4    The contents of the "value" attributes of the elements for
         Leader character positions 06 and 07 are used to determine,
         from the MARC Description File, the document type and format
         type, and are used when selecting which portions of the MARC
         Description File to use.

5.4.5    It is an error if the MARC Description File does not contain
         a value corresponding to the values in the tagged-text for
         Leader character positions 06 and 07.

5.4.6    The data from each "value" attribute of each subelement of
         the Leader element is inserted in the Leader of the output
         MARC record in the character position indicated by the "cp"
         portion of the subelement tag name.

5.4.7    If the "value" attribute contains the word "fill" or
"blank",
         then the character position is output containing the fill
         character ("|") or the blank character (" " or 0x20),
         respectively.

5.4.8    The contents of the "value" attribute will otherwise be
         treated as a string.

5.4.9    If the start-tag does not contain a "value" attribute, the
         character position is output containing the fill character         ("|").

5.4.10   Character positions within the Leader (that are not constants
         or the record length) for which no data is included in the
         SGML instance will be output containing the fill
         character ("|").


5.5 Grouping Tags

5.5.1    The SGML instance contains "grouping tags" that are
         not present in the output MARC records.  The start-tag format
         is:

         <[dtd_type]-[group_name]>

5.5.2    The valid grouping tags are listed in the MARC Description
         File.

5.5.3    The MARC Description File may contain multiple lists of valid
         grouping tags.  The grouping tag list is selected based upon
         the values of the contents of the "value" attributes of the
         elements for Leader character positions 06 and 07.  The value
         of these character positions determines a document type, and
         the document type determines which list is used.

5.5.4    It is an error if any grouping tags are present in the tagged
         text input but are not in the selected list in the MARC
         Description File.

5.5.5    No output is generated for the grouping tags.


5.6 Positionally-defined Fields

5.6.1    The start-tag format is:

         <[dtd_type][marc_tag]-[subtype]>

5.6.2    The only information from the start-tag that is used is the
         "marc_tag" portion, which is the field number.


5.7 Clusters Within Positionally-defined Fields
5.7.1    The start-tag format is:

         <[dtd_type][marc_tag]-[subtype]-[cluster]
                value="[content of position]">

5.7.2    The contents of the "value" attribute are inserted into the
         data for the field at the character offset specified by the
         "cluster" portion of the tag name.

5.7.3    The "cluster" portion of the tag name may be a number or a
         number range.

5.7.4    The range of numbers in the "cluster" portion determines how
         many character positions in the field data are used for the
         field data.  If the data in the "value" attribute does not
         use all of the available character positions, it will be
         padded with fill ("|") characters, and if it exceeds the
         number of available character positions, it will be
         truncated.

5.7.5    If the "value" attribute contains the words "fill" or
         "blank", then all of the character positions are output
         containing the fill character ("|") or the blank character
         (" " or 0x20), respectively.

5.7.6    The contents of the "value" attribute will otherwise be
         treated as a string, so, for example, the values "010" and
         "10" are different strings: one is three characters long,
         and the other, two.  If three character positions are used
         for the field data, the two values would appear as "010" and
         "10|", respectively.  If two character positions are
         available, the two values would appear as "01" and "10,
         respectively.

5.7.7    If the start-tag does not contain a "value" attribute, the
         character positions are output containing the fill character
         ("|").

5.7.8    Character positions within the positionally-defined field for
         which no data is included in the SGML instance will
         be output containing the fill character ("|").

5.7.9    Information in the MARC Description File determines the
         maximum length of the positionally-defined field, but no
         match is made between the cluster numbers and ranges         specified in the
MARC Description File and the cluster
         numbers and ranges in the tag names in the SGML
         instance.


5.8 Other Fields Without Indicators or Subfields

5.8.1    The start-tag format is:

         <[dtd_type][marc_tag]>

5.8.2    Fields without indicators or subfields are listed as such in
         the MARC Description File.  They also do not have "i1" and
         "i2" attributes.

5.9 Fields with Subfields

5.9.1    The start-tag format is:

         <[dtd_type][marc_tag] i1="i1-[1st ind. value]"
                i2="i2-[2nd ind. value]">

         where "marc_tag" is the MARC field number.

5.9.2    Start tags for fields with subfields have two indicator
         attributes, "i1" and "i2", and are not listed in the MARC
         Description File as fields that lack indicators and
         subfields.

5.9.3    The attributes may appear in the start-tag in any order.

5.9.4    The indicators are output as the first two characters of the
         field.

5.9.5    The "i1" and "i2" attribute values start with "i1-"
and "i2-"
         prefixes, respectively.  These prefixes are ignored when
         processing and do not appear in the output.

5.9.6    "i1" or "i2" attribute values (wthout prefix) of "fill"
are
         output as the fill character ("|"), and values of "blank" are
         output as the blank character (" " or 0x20).

5.9.7    Alphabetic values of "i1" and "i2" attributes (without
         prefix) are output as lowercase letters.
5.9.8    "i1" and "i2" attribute values (without prefix and other than
         "blank" or "fill") that are longer than one character will be
         truncated to a single character.

5.9.9    An End Of Field character (0x1D) is output after the field
         contents.

5.10 Subfields in Fields with Subfields

5.10.1   The start-tag format is:

         <[dtd_type][marc_tag]-[subfield_code]>

5.10.2   The subfield code is extracted from the tag name.

5.10.3   Subfield codes that are longer than one character will be
         truncated to a single character.

5.10.4   The data between the start and end tags is the subfield data.

5.10.5   The subfield code is output preceded by a subfield delimiter
         (0x1F) and followed by the subfield data.

5.10.6   Subfields will be output in order of subfield code.

5.11 Directory

5.11.1   The Directory is not present in the SGML instance
         of the MARC record and must be generated on output.

5.11.2   The Directory contains one 12-character entry for each
         numbered field element in the record.

5.11.3   The 12 characters of the Directory entry comprise three
         characters for the field number, four characters for the
         field length, and five characters for the relative starting
         position of the field.

5.11.4   The field number is extracted from the start-tag for the
         field.

5.11.5   The field length, for fields with indicators and subfields,
         is the total of the indicators (2 character positions) and
         the subfield delimiter, subfield code, and subfield data of
         each subfield plus the EOF delimiter; and, for fields without         indicators
and subfields, is the length of the field data
         plus the EOF delimiter.

5.11.6   The relative starting position is calculated based on the
         starting position of the previous field (if any).  In the
         output MARC record, the first character of the first field is
         at offset 00000, so the field's starting position is 00000.
         The second and subsequent fields begin immediately following
         their previous record, so their starting postions are the
         starting position of the previous record plus the length of
         the previous record.

5.11.7   All numbers in the Directory entry are right-justified with
         unused positions containing zeroes.


6. ERROR HANDLING
-----------------

6.1      All error messages will be written to the log file.

6.2      SGML records with errors will be skipped and not
         written to the output.


7.  CHARACTER CONVERSION
------------------------

7.1      The program will "come with" three possible levels of
         character conversion -- minimal, upper register to lower
         register, and all special characters.  The user will have the
         option of specifying one of those three levels or of
         supplying the filename for a user-supplied conversion
         specification.

7.2      User-supplied character conversion specifications must be
         formatted as "well-formed" SGML data with specific tag names.



8.  Incompatibilities Between Program Assumptions and USMARC
------------------------------------------------------------

8.1      Every field (except the Leader and the Directory) and every
         subfield is assumed to be repeatable, although USMARC         explicitly
specifies the repeatability or nonrepeatability of
         each field and subfield.

8.2      The number of, and field numbers of, the variable control
         fields (which are the only fields that lack indicators and
         subfields)can vary from USMARC and must therefore be
         specified in the MARC Description File. (USMARC explicitly
         specifies that fields 001 to 009 are the only variable
         control fields.).

8.3      The number of, and field numbers of, the fields that are
         positionally-defined can vary from USMARC and must be
         specified in the MARC Description File, although USMARC
         explicitly specifies that the Leader and fields 006, 007, and
         008 are the only positionally-defined fields.


9.  REQUIRED AUXILIARY DATA FILES
---------------------------------

9.1 MARC Description File
-------------------------

This conversion utility is controlled by a single SGML file containing
a description of the MARC record format.  The program itself is not
hard-wired for any implementation of MARC records or existing MARC
DTDs, and it reads the description file to find out what to expect in
the SGML text and what to output for the MARC records.  An
error is signaled if an input tagged record does not conform to the
description, and SGML records with errors are not converted
to MARC records.

9.2 Character to Entity Conversion Files
----------------------------------------

These files control the conversion of entities in the SGML
to characters or sequences of characters in the MARC data.  Two
conversion files are required by the program -- an entity to
upper-register character conversion file and an entity to character
conversion file.  An additional file may be specified by the user.
The mapping in the selected conversion file is converted into program
code executed by the program to perform the entity to character
conversion.

The same conversion specification file format is used to specifycharacter to entity
conversion for the MARC record to SGML
conversion program.


10.   Process Flow
-----------------

The program will operate based on the hard-wired knowledge of MARC and
SGML data listed previously.

10.1  Main Process Flow
----------------------

 - Initialize and read command-line arguments

 - Open log file

 - Read control data

 - Open input file

 - Open output file

 - Get additional user input (if required)

 - Process file, one SGML record element at a time

 - If anything left at the end, warn about junk at
   end of file 

 - Close input file

 - Close output file

 - Write log file end message

 - Close log file


10.2  Per-record-element processing
-----------------------------------

For each record element:

 - Process each start and end tag as an "event" and build up MARC   record data
structure

   If the tag is not a top-level or Leader tag, then the "type" of the
   tag and the action taken is determined by the information in the
   MARC Description File for the field number in the "marc_tag"
   portion of the tag name.

 - Discontinue processing if the SGML record element has an
   error

 - Output MARC record


10.3 Top-level (, ) tag
----------------------------------

10.3.1 Start tag

 - Reset per-record variables

10.3.1 End tag

If an error has not occurred:

 - Calculate Directory

 - Insert fixed portions of Leader data into correct character
   positions in Leader data structure

 - Calculate total record length and output Leader

 - Output Directory

 - Output variable control and data fields in ascending numerical
   order

 - Output end of record character


10.4 Grouping tag: <[dtd_type]-[group_name]>
-------------------------------------------

10.4.1 Start tag

 - Extract group name from tag name
 - Check group name in list from MARC Description File

10.4.2 End tag

 - No action


10.5 Leader tag: <[dtd_type]ldr>
-------------------------------

10.5.1 Start tag

 - No action

10.5.2 End tag

 - Check that Leader 06 and 07 character positions have valid data


10.6 Other positionally-defined field: <[dtd_type][marc_tag]-[subtype]>
-----------------------------------------------------------------------

10.6.1 Start tag

 - Extract field number from tag name

 - Check that field number is listed in MARC Description File

 - Create data structure for field data with length from MARC
   Description File and pre-fill with fill characters

10.6.2 End tag

 - Save field number and field data in MARC record data structure


10.7 Positionally-defined field data cluster:
--------------------------------------------

     <[dtd_type][marc_tag]-[subtype]-[cluster] value="">

10.7.1 Start tag

 - Extract cluster number and cluster contents from start tag
 - Convert "blank" or "fill" strings to blank or fill characters,
   respectively

 - Insert cluster contents into field data structure for the
   positionally-defined field.  Cluster data will be truncated or
   padded as defined previously.

10.7.2 End tag

There are no end tags


10.8 Other field without indicators or subfields: <[dtd_type][marc_tag]>
-----------------------------------------------------------------------

10.8.1 Start tag

 - Extract field number from tag name

 - Check that field number is listed in MARC Description File

   If not listed as field without indicators and subfields, process as
   field with indicators and subfields

10.8.2 End tag

 - Save field number and element contents in MARC record data structure


10.9 Fields with indicators and subfields: <[dtd_type][marc_tag] i1=""
i2="">
----------------------------------------------------------------------------

10.9.1 Start tag

 - Extract field number from tag name

 - Extract indicator attributes

 - Convert indicator attribute values into correct MARC values

10.9.2 End tag

 - Save field number, indicator values, and subfield data in MARC
   record data structure

10.10 Subfield of field with indicators and subfields:
----------------------------------------------------

      <[dtd_type][marc_tag]-[subfield_code]>

10.10.1 Start tag

 - Extract subfield code from tag name

10.10.2 End tag

 - Save subfield code and element content for later inclusion in MARC
   record data structure

 
11. Test Files
--------------

 - SGML text such that generated MARC record exceeds 99,999
   characters

 - One clean SGML record that would parse by one of the MARC
   DTDs

 - Three or more clean SGML record that would parse by one
   of the MARC DTDs

 - Junk at end of file

 - Junk at beginning of file

 - Junk between record elements

 - Junk at both ends of file

 - Corrupt Leader markup

 - Corrupt Directory markup

 - Corrupt markup for field without indicators or subfields

 - Corrupt positionally-defined field markup
 - Corrupt field with indicators and subfields

 - Missing "value" attribute on cluster tag

 - Too long "value" attribute on cluster tag

 - Too short "value" attribute on cluster tag

 - "blank" as "value" attribute data on cluster tag

 - "fill" as "value" attribute data on cluster tag

 - "blank" as "i1" and "i2" attribute data

 - "fill" as "i1" and "i2" attribute data

 - Invalid entity reference

 - Whitespace at beginning of file

 - Whitespace at end of file

Go to top of document

Go to the MARC Home Page
Go to the Library of Congress Home Page

Library of Congress

Library of Congress Help Desk (08/14/1998)

MARC-SGML and SGML-MARC Conversion Programs Maintenance Guide

1. Introduction

1.1 Organization of this Manual

1.2 User Manual

1.3 Typographic Conventions

2. MARC Description File

i Note that the fields without indicators or subfields specified in the MARC Description File are not limited to values specified in valid USMARC.

2.1 Mapping Leader 06 and 07 Values to Document Types, Format Types, and Record Types

The <marc.desc> element contains one <ldr.to.doctype> element, which contains one or more <doctype.selector> elements.

2.2 Field Group Specifications

2.3 Control Field Specifications

iii Note that the clustering of Leader character positions specified in the MARC Description File are not limited to values specified in valid USMARC.

3. Character - Entity Map File

4. Files in the Distribution

Appendix A. MARC Record to Tag-valid SGML Conversion Specification

Appendix B. Tag-valid SGML to MARC Record Conversion Specification

ⁱ Note that the fields without indicators or subfields specified in the MARC Description File are not limited to values specified in valid USMARC.

ⁱⁱⁱ Note that the clustering of Leader character positions specified in the MARC Description File are not limited to values specified in valid USMARC.