The ALTO v.2.0 element set
This document contains a listing of elements and their related attributes in ALTO version 2.0 with values or value sources where applicable. It is an "outline" of the schema, detailed by:
ALTO requires use of the <Layout>
element as a child under the root <alto>
element. The <Layout>
element requires use of a child <Page>
element, which must carry a valid ID attribute value and a PHYSICAL_IMG_NR attribute value.
The 2.0 schema now has a target namespace URI: http://www.loc.gov/standards/alto/ns-v2#, to reflect that the standard is now maintained by the Library of Congress. The previous namespace URI reflected maintenance by CCS.
- View a model of the schema.
Root element in ALTO element set
- alto
- Required: Yes.
- Usage: Root Element for bundling text layout technical metadata.
- Attributes: None.
- Contains AS SEQUENCE: Description, Styles, Layout.
- Contained by: None.
Top-level ALTO elements
These elements are direct children of the <alto> root element. The sorting is based on the accepted sequence in which they may be used.
- Description
- Required: No.
- Usage: Describes general settings of the alto file like measurement units and metadata.
- Attributes: None.
- Contains AS SEQUENCE: MeasurementUnit, sourceImageInformation, OCRProcessing.
- Contained by: alto.
- Styles
- Required: No.
- Usage: Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements.
- Attributes: None.
- Contains AS SEQUENCE: TextStyle, ParagraphStyle.
- Contained by: alto.
- Layout
- Required: Yes.
- Usage: The root Layout element.
- Attributes: STYLEREFS.
- Contains AS SEQUENCE: Page.
- Contained by: alto.
<Description> elements
These elements are contained by the <Description> element underneath <alto>. The sorting is based on the accepted sequence in which they may be used.
- MeasurementUnit
- Required: No.
- Usage: All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm.
- Attributes: none.
- Contains ENUMERATED VALUES: dpi, pixel, mm10, inch1200.
- Contained by: Description.
- sourceImageInformation
- Required: No.
- Usage: Information to identify the image file from which the OCR text was created.
- Attributes: none.
- Contains SEQUENCE: fileName, fileIdentifier
- Contained by: Description.
- OCRProcessing
- Required: No.
- Usage: Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history.
- Attributes: ID.
- Contains: preProcessingStep, ocrProcessingStep, postProcessingStep
- Contained by: Description.
<Styles> elements
These elements are contained by the <Styles> element underneath <alto>. The sorting is based on the accepted sequence in which they may be used.
- TextStyle
- Required: No.
- Usage: A text style defines font properties of text.
- Attributes: ID, FONTWIDTH, FONTTYPE, FONTSTYLE, FONTFAMILY, FONTCOLOR, FONTSIZE.
- Contains: EMPTY ELEMENT.
- Contained by: Styles.
- ParagraphStyle
- Required: No.
- Usage: A paragraph style defines formatting properties of text blocks.
- Attributes: ID, RIGHT, LEFT, ALIGN, LINESPACE, FIRSTLINE
- Contains: EMPTY ELEMENT.
- Contained by: Styles.
<Layout> elements
These elements are contained by the <Layout> element underneath <alto>. The sorting is based on the accepted sequence in which they may be used.
- Page
- Required: Yes.
- Usage: One page of a book or journal.
- Attributes: ID, PHYSICAL_IMG_NR, PRINTED_IMG_NR, PAGECLASS, PROCESSING, STYLEREFS, HEIGHT, WIDTH, QUALITY, POSITION.
- Contains SEQUENCE: TopMargin, LeftMargin, RightMargin, BottomMargin, PrintSpace
- Contained by: Layout.
textMD attributes
These attributes may appear on given elements within ALTO. The sorting is alphabetical.
- ID
- Usage: A valid identifier as defined by the XML Schema specification.
- Contained by: OCRProcessing.
- STYLEREFS
- Usage: To bind to IDREFs of various Text* elements.
- Contained by: Layout.