Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

Microsoft Outlook PST 97-2002 (ANSI)

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name Microsoft Outlook 97-2002 Personal Folders File (ANSI)
Description

The Personal Folders File or PST is an open proprietary data file format used to store local copies of messages, calendar events, and other items within Microsoft software including Microsoft Office Outlook. PST files are used to store archived items and to maintain off-line availability of the items.

PST shares the same Personal Folders File format (PFF) structure as Offline Storage Table (OST) and Personal Address Book (PAB).

PST is a stand-alone, self-contained, structured binary file format that does not require any external dependencies. Each PST file represents a message store that contains an arbitrary hierarchy of Folder objects, which contains Message objects, which can contain Attachment objects. Information about Folder objects, Message objects, and Attachment objects are stored in properties, which collectively contain all of the information about the particular item.

A PST file is organized as two B-trees with 512 byte nodes and leaves. Its architecture is based on three logical layers.

  • A NDB (Node Database) layer that allocates physical blocks of storage. The NDB layer consists of the header, file allocation information, blocks, nodes, and two BTrees: the Node BTree (NBT) and the Block BTree (BBT). The Block B-tree implements storage allocation within the PST file, based on data blocks with size up to 8 kbytes.
  • A LTP (Lists, Tables, and Properties) layer that implements higher-level concepts on top of the NDB construct and contains the core elements Property Context (PC) and Table Context (TC). A PC represents a collection of properties. A TC represents a two-dimensional table where the rows represent a collection of properties and the columns represent which properties are within the rows.
  • A Messaging layer (sometimes referred to as the PST layer) that implements folder objects, message objects, etc. as structures of lists, tables, and properties.

For example, a message object consists logically of a set of properties, a recipients table, the message content, an optional attachment table, and attachments (which have there own sets of properties and content). A message node connects the message object to its parent folder, the data block in which its properties are stored, and to the sub-nodes representing the recipients table, attachment table, etc.

The two versions of PST, PST_ANSI and PST_Unicode, are differentiated primarily by software implementation versions, character sets, maximum file size constraints and bit values.

Now considered a legacy format and replaced by PST_Unicode, PST_ANSI was used by Microsoft Outlook 97-2002. It employs the American National Standards Institute (ANSI) character set and has an overall size limit of 2 gigabytes (GB). PST_ANSI uses 32-bit values to represent block IDs (BIDs) and byte index (IB).

Production phase PST files provide a mechanism for the centralized storage of email folders, email messages, their attachments, contacts, calendar items, etc.
Relationship to other formats
    Has later version PST_Unicode, Microsoft Outlook PST 2003 (Unicode)
    Affinity to TNEF, Transport Neutral Encapsulation Format

Local use Explanation of format description terms

LC experience or existing holdings The Library of Congress includes PST Unicode and PST ANSI files in its collections, especially in the Manuscripts and Music Divisions as well as other personal papers repositories.
LC preference The Library of Congress Recommended Formats Statement (RFS) lists PST as an acceptable format for Email: For aggregated groups of messages. The RFS does not specify a version of PST.

Sustainability factors Explanation of format description terms

Disclosure Fully documented. Proprietary file format developed by Microsoft.
    Documentation Microsoft [MS-PST]: Outlook Personal Folders (.pst) File Format specification available from Microsoft. See Format Specifications below.
Adoption

The Outlook .pst files are used for POP3, IMAP, and HTTP accounts and are supported by several Microsoft client applications, including Microsoft Exchange Client, Windows Messaging, and Microsoft Office Outlook.

PST_ANSI was implemented in Office Outlook versions 97-2002. Outlook 2003, Outlook 2007, and Outlook 2010 can read, write, and create both ANSI and Unicode PST files. By 2010 (when the specification was made public by Microsoft), PST_ANSI was considered a legacy format with a recommendation that it not be used to create new PST files. The default format was declared to be PST_Unicode.

At least two open-source software libraries have been developed to examine and manipulate PST files: libpff, a library (in C, with python bindings partially implemented as of late 2013) to access PST and related formats; PST File Format SDK, a cross-platform C++ library for reading PST files, developed under Microsoft auspices through a 2009-2010 project.

    Licensing and patents

Covered by the Microsoft Open Specification Promise, whereby Microsoft "irrevocably promises" not to assert any claims against those making, using, and selling implementations of any specification covered by the promise (so long as those accepting the promise refrain from suing Microsoft for patent infringement in relation to Microsoft's implementation of the covered specification).

Transparency

Depends upon algorithms and tools to read; will require sophistication to build tools. Text in messages may be encrypted and even when not encrypted cannot be rendered by a simple text viewer. Even Microsoft in its Understanding the Outlook MS-PST Binary File Format admits that "understanding and working with binary file formats in general, and the MS-PST file format in particular, can be a challenge. Fortunately, the PST File Format SDK exists to make this easier."

Joachim Metz in Personal Folder File (PFF) Forensics: Analyzing the Horrible Reference File Format says "the actual data of an item within a PFF is scattered over different data structures...the bad news for forensic analysis is that PFF obfuscates the information in the data structures which makes a basic text string search impossible."

Self-documentation

The PST format version is declared in the file header. According to the specification, the wVer field for a PST_ANSI file must have a value of 14 or 15. Folder objects, message objects, and attachment objects all have properties which include the header fields users typically see in an email application as well as many properties relating to the status, management, and history of the object in an Outlook application. A message object also has a recipients table that identifies each recipient and may have an attachments table that lists and identifies attachments.

External dependencies None
Technical protection considerations

PST files may have encoding applied to data blocks (although not to header and BTree "pages") to provide data obfuscation. A field in the file header indicates one of three levels/types for "encryption", defined in the specification as None; Compressible encryption; and High encryption. Although the term "encryption" is used in these definitions, the specification documents two keyless cipher algorithms used to encode the data blocks in the PST. The specification (in section 4) indicates that these algorithms "can be conveniently decoded once the exact encoding algorithm is understood." The compilers of this description are unable to confirm whether the open-source software libraries are able to decode "encrypted" PST files fully. Comments welcome.

Another form of protection is password protection which is available as an option in PST files. However, the Microsoft specification admits that the password functionality in PST is not very robust, calling it "a superficial mechanism" that  "does not provide any security benefit to preventing the PST data to be read by unauthorized parties." In essence, if a password is used, a CRC-32 hash of the password text is stored in the PidTagPstPassword property. If this property exists and is non-zero, implementations are expected to prompt the end user for a password, compute the CRC-32 hash of the user password, and verify it against the value stored in PidTagPstPassword. However, a password stored as a CRC-32 hash of the original password string is relatively weak and provides little functional security.


Quality and functionality factors Explanation of format description terms

Text
Normal rendering PST_ANSI can only represent ANSI character encoding.
Integrity of document structure

At the physical level, the file starts with a header, followed by an optional density list, and then a series of mapping structures interspersed at set intervals between blocks of data. The mapping structures are of fixed size, and repeat as often as needed to encapsulate areas of data as the file grows.

At the logical level, a .pst file has three layers: the Node Database (NDB) layer, the Lists, Tables, and Properties (LTP) layer, and the Messaging layer.

An important structural issue of PST_ANSI files is that they have a restricted size limit (2 GB) because PST_ANSI files only contain the initial FPMap in the header and no additional FPMap pages.

The semantic structure of messages (with their headers) in folders and attachments linked to messages is represented in the Messaging layer.

Since this format is designed for active use in an email system as a stand-alone message store, the full semantics required and/or observed in the system that generated the file is represented.


File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension pst
 
Internet Media Type application/vnd.ms-outlook
This is not registered with IANA but appears in Forensic Wiki's page on Personal Folder File (PAB, PST, OST).
Magic numbers Hex: 21 42 44 4E
ASCII: !BDN
From specification. This specification applies to the three Microsoft content types (PST, OST and PAB) that share the general PFF structure.
File signature Hex: 53 4D 0E 00
Hex: 53 4D 0F 00
Offset 8 bytes from start of file. In conjunction with the magic number at the beginning of the file, this identifies that the file is a PST file using the PST_ANSI version. The first value is more frequently found.
File signature x-fmt/248
PRONOM entry for Microsoft Outlook Personal Folders (ANSI). Identification based on internal signifier.
Wikidata Title ID Q1480633
See https://www.wikidata.org/wiki/Q1480633. Wikidata does not distinguish between versions of PST.

Notes Explanation of format description terms

General

The data models used for message objects in MS Outlook and in the message format used for Internet mail are significantly different. Microsoft has specified a detailed mapping in each direction: for "MIME analysis" and for "MIME generation." [MS-OXCMAIL]: RFC 2822 and MIME to Email Object Conversion Algorithm describes this complex mapping in detail. An important aspect is the use of an extra "MIME skeleton" property to store all incoming MIME message content that cannot be mapped cleanly to Microsoft's message object properties (often referred to as MAPI properties) so that the message can be regenerated. As well as using the mapping in Outlook as the basis for email sent between Outlook application systems and the Internet (as opposed to message exchanges within and among Outlook systems), Microsoft provides a MAPI-MIME Conversion API as part of Outlook.

History  

Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 02/28/2024