ALTO Technical Metadata for Optical Character Recognition (OCR)

The ALTO v.2.0 element set

This document contains a listing of elements and their related attributes in ALTO version 2.0 with values or value sources where applicable. It is an "outline" of the schema, detailed by:

ALTO requires use of the <Layout> element as a child under the root <alto> element. The <Layout> element requires use of a child <Page> element, which must carry a valid ID attribute value and a PHYSICAL_IMG_NR attribute value.

The 2.0 schema now has a target namespace URI: http://www.loc.gov/standards/alto/ns-v2#, to reflect that the standard is now maintained by the Library of Congress. The previous namespace URI reflected maintenance by CCS.

↑ Back to top ↑

Root element in ALTO element set

alto
Required: Yes.
Usage: Root Element for bundling text layout technical metadata.
Attributes: None.
Contains AS SEQUENCE: Description, Styles, Layout.
Contained by: None.
↑ Back to top ↑

Top-level ALTO elements

These elements are direct children of the <alto> root element. The sorting is based on the accepted sequence in which they may be used.

Description
Required: No.
Usage: Describes general settings of the alto file like measurement units and metadata.
Attributes: None.
Contains AS SEQUENCE: MeasurementUnit, sourceImageInformation, OCRProcessing.
Contained by: alto.
Styles
Required: No.
Usage: Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements.
Attributes: None.
Contains AS SEQUENCE: TextStyle, ParagraphStyle.
Contained by: alto.
Layout
Required: Yes.
Usage: The root Layout element.
Attributes: STYLEREFS.
Contains AS SEQUENCE: Page.
Contained by: alto.
↑ Back to top ↑

<Description> elements

These elements are contained by the <Description> element underneath <alto>. The sorting is based on the accepted sequence in which they may be used.

MeasurementUnit
Required: No.
Usage: All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm.
Attributes: none.
Contains ENUMERATED VALUES: dpi, pixel, mm10, inch1200.
Contained by: Description.
sourceImageInformation
Required: No.
Usage: Information to identify the image file from which the OCR text was created.
Attributes: none.
Contains SEQUENCE: fileName, fileIdentifier
Contained by: Description.
OCRProcessing
Required: No.
Usage: Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history.
Attributes: ID.
Contains: preProcessingStep, ocrProcessingStep, postProcessingStep
Contained by: Description.
↑ Back to top ↑

<Styles> elements

These elements are contained by the <Styles> element underneath <alto>. The sorting is based on the accepted sequence in which they may be used.

TextStyle
Required: No.
Usage: A text style defines font properties of text.
Attributes: ID, FONTWIDTH, FONTTYPE, FONTSTYLE, FONTFAMILY, FONTCOLOR, FONTSIZE.
Contains: EMPTY ELEMENT.
Contained by: Styles.
ParagraphStyle
Required: No.
Usage: A paragraph style defines formatting properties of text blocks.
Attributes: ID, RIGHT, LEFT, ALIGN, LINESPACE, FIRSTLINE
Contains: EMPTY ELEMENT.
Contained by: Styles.
↑ Back to top ↑

<Layout> elements

These elements are contained by the <Layout> element underneath <alto>. The sorting is based on the accepted sequence in which they may be used.

Page
Required: Yes.
Usage: One page of a book or journal.
Attributes: ID, PHYSICAL_IMG_NR, PRINTED_IMG_NR, PAGECLASS, PROCESSING, STYLEREFS, HEIGHT, WIDTH, QUALITY, POSITION.
Contains SEQUENCE: TopMargin, LeftMargin, RightMargin, BottomMargin, PrintSpace
Contained by: Layout.
↑ Back to top ↑

textMD attributes

These attributes may appear on given elements within ALTO. The sorting is alphabetical.

ID
Usage: A valid identifier as defined by the XML Schema specification.
Contained by: OCRProcessing.
STYLEREFS
Usage: To bind to IDREFs of various Text* elements.
Contained by: Layout.
↑ Back to top ↑
 

ALTO (Analyzed Layout and Text Object) is a XML Schema that details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper. It most commonly serves as an extension schema used within the Metadata Encoding and Transmission Schema (METS) administrative metadata section. However, ALTO instances can also exist as a standalone document used independently of METS.