ALTO Technical Metadata for Optical Character Recognition (OCR)

News Subscribe to ALTO news feed

Schema

Listserv



Documents using the PDF format can be read using free software like Adobe Acrobat Reader. External link: http://www.adobe.com/products/reader/

Get Acrobat Reader

ALTO <Styles> & <Layout> Usage

TextStyles

Textstyles have no content. The attributes are:

  • FONTFAMILY
  • FONTSIZE
  • FONTCOLOR
  • FONTWEIGHT
  • FONTSTYLE
  • FONTPITCH
  • FONTCHARSET

Only FONTFAMILY and FONTSIZE are required.

↑ Back to top ↑

Paragraph Styles

Paragraph styles have no content. The attributes are:

  • ALIGN (Left, Right, Center, Block)
  • LEFT [numeric]
  • RIGHT [numeric]
  • LINESPACE [numeric]
  • FIRSTLINE [numeric]
↑ Back to top ↑

Attributes of a Page Element

  • PAGECLASS
  • STYLEREFS
  • HEIGHT
  • WIDTH
  • PHYSICAL_IMG_NR
  • PRINTED_IMG_NR
  • QUALITY (OK, Damaged, Missing)
  • POSITION (Left, Right, Foldout, Single)
  • PROCESSING (A link to processing information)
↑ Back to top ↑

Page Areas

Each page is divided into different areas (TopMargin, LeftMargin, RightMargin, BottomMargin and PrintSpace). The margins may contain text or other objects that are not part of the main body.

The positions are given as HPOS, VPOS, WIDTH and HEIGHT.

↑ Back to top ↑

Margins

TopMargin
The area between the top line of print and the upper edge of the leaf. It may contain page number or running title.
InnerMargin
That margin of a page adjacent to the binding edge of a book.
OuterMargin
The space between the text and the outer extremity of the leaf of a book. May contain margin notes.
BottomMargin
The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word.
PrintSpace
Rectangle surrounding the printed area of a page. Page number and running title are not part of the print space.

The position of the margins on a page is illustrated in this picture:

Example of margins on a printed page.

↑ Back to top ↑

The structure of one of the page area (PageSpace) elements

The page area elements have the attributes:

HPOS
Horizontal position upper/left corner (1/10 mm)
VPOS
Vertical position upper/left corner (1/10 mm)
WIDTH
Width (1/10 mm)
HEIGHT
Height (1/10 mm)
ROTATION
In degrees as floating point number (optional)

All the subelements have those same attributes (except SP, where HEIGHT is missing) with the same meaning.

Each page area may contain any number of elements. Those elements are one of the following:

TextBlock
A block of text
ComposedBlock
A block that consists of other blocks
Illustration
A picture or image
GraphicalElement
A graphic used to separate blocks. Mostly a line or rectangle.

Each of them may have the following attributes:

ID
Unique ID
STYLEREFS
Reference to text or paragraph styles
HPOS
Horizontal position upper/left corner (1/10 mm)
VPOS
Vertical position upper/left corner (1/10 mm)
WIDTH
Width (1/10 mm)
HEIGHT
Height (1/10 mm)
ROTATION
In degrees as floating point number (optional)
IDNEXT
Reference to the next element related to reading order.

If the shape of the element is not rectangular an element SHAPE might be added:

Sample SHAPE XML fragment

Polygons are coded as X,y x,y ... with different coordinate pairs separated by spaces.

Circles and ellipses are, although allowed in principle, not supported by some vendor tools like docWORKS. Instead, such shapes are represented as polygons with sufficient accuracy.

A TextBlock is divided into lines and those are divided into strings, spaces and hyphens:

<TextBlock>
<TextLine>
<String/>
<SP/>
<HYP/>
</TextLine>
</TextBlock>

Meaning of those tags:

TextBlock
A paragraph of text
TextLine
A line of text
String
A single word
SP
White space
HYP
Hyphenation characteristics
↑ Back to top ↑

Additional Attributes of the tags

Element Attribute name Description

TextBlock  

language

ISO639-2 language character code

String

CONTENT

String content (word)

 

SUBS_TYPE

HypPart1

If content is the first part of a hyphenated word, applies only for the last word of a line if it is hyphenated

 

 

HypPart2

If content is the second part of a hyphenated word, applies only for the first word of a line if it is hyphenated

 

SUBS_CONTENT  

Complete content of a hyphenated word

 

WC

Word Confidence: Confidence level of the OCR results for this string. A float value between 0 (unsure) and 1 (confident)

 

CC

Confidence level of each character in that string. A list of numbers, one number between 0 (confident) and 9 (unsure) for each character

 

STYLEREFS

Text style used for this string, if it is different from the parent text block style

 

STYLE

Any combination of font style (italics, bold, …)

 

ALTERNATIVE

(element) Any number of alternative strings to be used instead

Illustration

TYPE

A user defined description of the type of the illustration

 

FILEID

A link to a seperate file that contains just the illustration.

ComposedBlock

TYPE

A user defined description of the type of the composed block

 

FILEID

A link to a separate file that contains just the composed block

↑ Back to top ↑

ALTO (Analyzed Layout and Text Object) is a XML Schema that details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper. It most commonly serves as an extension schema used within the Metadata Encoding and Transmission Schema (METS) administrative metadata section. However, ALTO instances can also exist as a standalone document used independently of METS.