ALTO Technical Metadata for Optical Character Recognition (OCR)

News Subscribe to ALTO news feed

Schema

Listserv



Documents using the PDF format can be read using free software like Adobe Acrobat Reader. External link: http://www.adobe.com/products/reader/

Get Acrobat Reader

The ALTO v.2.0 element set

This document contains a listing of elements and their related attributes in ALTO version 2.0 with values or value sources where applicable. It is an "outline" of the schema, detailed by:

ALTO requires use of the <Layout> element as a child under the root <alto> element. The <Layout> element requires use of a child <Page> element, which must carry a valid ID attribute value and a PHYSICAL_IMG_NR attribute value.

The 2.0 schema now has a target namespace URI: http://www.loc.gov/standards/alto/ns-v2#, to reflect that the standard is now maintained by the Library of Congress. The previous namespace URI reflected maintenance by CCS.

↑ Back to top ↑

Root element in ALTO element set

alto
Required: Yes.
Usage: Root Element for bundling text layout technical metadata.
Attributes: None.
Contains AS SEQUENCE: Description, Styles, Layout.
Contained by: None.
↑ Back to top ↑

Top-level ALTO elements

These elements are direct children of the <alto> root element. The sorting is based on the accepted sequence in which they may be used.

Description
Required: No.
Usage: Describes general settings of the alto file like measurement units and metadata.
Attributes: None.
Contains AS SEQUENCE: MeasurementUnit, sourceImageInformation, OCRProcessing.
Contained by: alto.
Styles
Required: No.
Usage: Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements.
Attributes: None.
Contains AS SEQUENCE: TextStyle, ParagraphStyle.
Contained by: alto.
Layout
Required: Yes.
Usage: The root Layout element.
Attributes: STYLEREFS.
Contains AS SEQUENCE: Page.
Contained by: alto.
↑ Back to top ↑

<Description> elements

These elements are contained by the <Description> element underneath <alto>. The sorting is based on the accepted sequence in which they may be used.

MeasurementUnit
Required: No.
Usage: All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm.
Attributes: none.
Contains ENUMERATED VALUES: dpi, pixel, mm10, inch1200.
Contained by: Description.
sourceImageInformation
Required: No.
Usage: Information to identify the image file from which the OCR text was created.
Attributes: none.
Contains SEQUENCE: fileName, fileIdentifier
Contained by: Description.
OCRProcessing
Required: No.
Usage: Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history.
Attributes: ID.
Contains: preProcessingStep, ocrProcessingStep, postProcessingStep
Contained by: Description.
↑ Back to top ↑

<Styles> elements

These elements are contained by the <Styles> element underneath <alto>. The sorting is based on the accepted sequence in which they may be used.

TextStyle
Required: No.
Usage: A text style defines font properties of text.
Attributes: ID, FONTWIDTH, FONTTYPE, FONTSTYLE, FONTFAMILY, FONTCOLOR, FONTSIZE.
Contains: EMPTY ELEMENT.
Contained by: Styles.
ParagraphStyle
Required: No.
Usage: A paragraph style defines formatting properties of text blocks.
Attributes: ID, RIGHT, LEFT, ALIGN, LINESPACE, FIRSTLINE
Contains: EMPTY ELEMENT.
Contained by: Styles.
↑ Back to top ↑

<Layout> elements

These elements are contained by the <Layout> element underneath <alto>. The sorting is based on the accepted sequence in which they may be used.

Page
Required: Yes.
Usage: One page of a book or journal.
Attributes: ID, PHYSICAL_IMG_NR, PRINTED_IMG_NR, PAGECLASS, PROCESSING, STYLEREFS, HEIGHT, WIDTH, QUALITY, POSITION.
Contains SEQUENCE: TopMargin, LeftMargin, RightMargin, BottomMargin, PrintSpace
Contained by: Layout.
↑ Back to top ↑

textMD attributes

These attributes may appear on given elements within ALTO. The sorting is alphabetical.

ID
Usage: A valid identifier as defined by the XML Schema specification.
Contained by: OCRProcessing.
STYLEREFS
Usage: To bind to IDREFs of various Text* elements.
Contained by: Layout.
↑ Back to top ↑
 

ALTO (Analyzed Layout and Text Object) is a XML Schema that details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper. It most commonly serves as an extension schema used within the Metadata Encoding and Transmission Schema (METS) administrative metadata section. However, ALTO instances can also exist as a standalone document used independently of METS.

Disclaimer