ALTO Technical Metadata for Optical Character Recognition (OCR)

News Subscribe to ALTO news feed

Schema

Listserv



Documents using the PDF format can be read using free software like Adobe Acrobat Reader. External link: http://www.adobe.com/products/reader/

Get Acrobat Reader

Structure of ALTO Files

An ALTO file consists of three major sections as children of the root <alto> element:

  • <Description>
  • <Styles>
  • <Layout>

The <Description> section contains metadata about the ALTO file itself and processing information on how the file was created.

The <Styles> section contains the text and paragraph styles with their individual descriptions:

  • <TextStyle> has font descriptions
  • <ParagraphStyle> has paragraph descriptions, e.g. alignment information

The <Layout> section contains the content information. It is subdivided into <Page> elements.

A page consists of margins and printspace, all of those are non-intersection rectangular areas within the page area. Each of these can contain any number of objects like lines, images or textblocks and more. A textblock is divided into textlines and those are divided furthermore in strings and spaces.

The global structure of the ALTO file is as follows:

<alto>
<Description>
<MeasurementUnit/>
<sourceImageInformation/>
<Processing/>
</Description>
<Styles>
<TextStyle/>
<ParagraphStyle/>
</Styles>
<Layout>
<Page>
<TopMargin/>
<LeftMargin/>
<RightMargin/>
<BottomMargin/>
<PrintSpace/>
</Page>
</Layout>
</alto>
↑ Back to top ↑

ALTO (Analyzed Layout and Text Object) is a XML Schema that details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper. It most commonly serves as an extension schema used within the Metadata Encoding and Transmission Schema (METS) administrative metadata section. However, ALTO instances can also exist as a standalone document used independently of METS.