ALTO Technical Metadata for Optical Character Recognition (OCR)

About ALTO


The Analyzed Layout and Text Object (ALTO) XML Schema was initially developed by the METAe project group External URL: for use with the Library of Congress' Metadata Encoding and Transmission Schema (METS). While METS excels in describing the structure of objects, a schema related to the content and layout information of each piece of the object was missing. Claus Gravenhorst, who helped create ALTO for the METAe project, states that:

"During the METAe project, we learned that there is no standard to handle word positions and physical layout information (print space, margins, etc.), an essential feature for high performance repositories that are able to highlight elements within documents. Therefore, the ALTO schema has been developed. In the METS file, there are file pointers to the ALTO files that contain the text, other elements (illustrations, etc.), and word positions. We would like ALTO or a similar schema to become a standard as we do not see an alternative right now." [1]

CCS Content Conversion Specialists GmbH maintained the ALTO standard, CCS having played a crucial role in ALTO's development dating back to its creation during the METAe project . Then in August 2009, the Library of Congress (LC) Network Development and MARC Standards Office became the official maintenance agency for the ALTO XML Schema. At that time LC set up an Editorial Board to help shape and advocate for ALTO. The Board thus oversees maintenance of the ALTO XML Schema and helps foster usage in the digital library community.

ALTO (Analyzed Layout and Text Object) is a XML Schema that details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper. It most commonly serves as an extension schema used within the Metadata Encoding and Transmission Schema (METS) administrative metadata section. However, ALTO instances can also exist as a standalone document used independently of METS.