Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Format Description Categories >> Browse Alphabetical List

FASTA Database Format

>> Back
Table of Contents
Format Description Properties Explanation of format description terms

Identification and description Explanation of format description terms

Full name FASTA Database Format
Description

FASTA is a text-based, bioinformatic data format used to store nucleotide or amino acid sequences (e.g. Deoxyribonucleic Acid [DNA] or Ribonucleic Acid [RNA]). Each file can store single or multiple sequences.

FASTA is pronounced "Fast A" ("fast-aye") because the name is a shortening of "FAST-All". FASTA is named this because it is an evolution of previous tools "FAST-P" (protein) and "FAST-N" (nucleotide), combining the ability to work with "all" (both nucleotides and proteins).

FASTA was created by David J. Lipman and William R. Pearson and debuted in their 1985 paper Rapid and sensitive protein similarity searches. Pearson maintains a relationship with FASTA, but the format does not have any formal governing body or specifications.

Sequences in FASTA are represented by single alphabetical codes outlined in IUPAC (International Union of Pure and Applied Chemistry) nucleic acid notation. Other ASCII characters outside of the IUPAC characters are not accepted. If these characters exist in the sequence, it is recommended they are removed or replaced by an appropriate letter code such as "N" which is used to represent an unknown nucleic acid residue or "X" which is used to represent unknown amino acid residue. FASTA files can be viewed and analyzed by any basic text editor as they are structured text. They may be further analyzed with DNA analysis software or bioinformatics software.

A software suite of tools developed by the FASTA format creators, Lipman and Pearson, is also named FASTA. This software suite is the evolution of a previously used software named FASTP. The FASTA software accepts the FASTA format. However, the FASTA format extends beyond the FASTA software suite. See "Support for Software Interface" for more details.

Because the FASTA format was born from an academic paper and has no formal oversight, the standards regarding alphanumeric characters is ambiguous. It is possible that some software will adhere to more strict rules than others. The FASTA algorithm is open source, however, access to the journal article is behind a paywall.

Subsequent usage of FASTA by others has been documented with additional structure and rules. These rules may not apply to the format as a whole, only to each organization or group's usage of the format.

Production phase May be used in production, as an exchange format, and as a storage format.
Relationship to other formats
    Has extension FASTQ appends a quality score to the FASTA data. Not described separately at this time. See Notes for details.
    Has extension A2M is a truncated version of FASTA. Not described separately at this time. See Notes for details.
    Has extension A3M is a truncated version of FASTA. Not described separately at this time. See Notes for details.

Local use Explanation of format description terms

LC experience or existing holdings The Library of Congress, as of August 2023, has close to 500,000 FASTA files in its collections.
LC preference The Library of Congress has not defined preferences for bioinformatic data. See the Recommended Formats Statement for information about dataset formats.

Sustainability factors Explanation of format description terms

Disclosure

FASTA is a text-based algorithm developed by David J. Lipman and William R. Pearson in 1985 and first published in their paper Rapid and Sensitive Protein Similarity Searches published that same year. This paper is not available via open access. Additional documentation and rules is available from the University of Michigan's Zhang Group and the National Center for Biotechnology Information.

    Documentation

Original format concept, not open access: Lipman, D. J. & Pearson, W. R. (1985).Rapid and sensitive protein similarity searches. Science, 227(4693), 1435–1441. FASTA software defining paper: Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8), 2444–2448. National Library of Medicine (NLM) National Center for Biotechnology Information (NCBI) FASTA documentation: Guidelines for Import to GenBank: U.S. National Library of Medicine. (n.d.). FASTA format for nucleotide sequences. National Center for Biotechnology Information. BLAST software topics: U.S. National Library of Medicine. (n.d.-b). Query input and database selection - BLASTTOPICS 0.1.1 documentation. National Center for Biotechnology Information. University of Michigan Zhang Lab FASTA documentation: What is FASTA format?. Zhang Lab. (n.d.).

Adoption

FASTA is adopted as a general use format for many bioinformatics and DNA analysis software. Many software suites, scripts, toolboxes, online databases or servers can read, edit, write, analyze, and visualize this format.

Notable adoptees of FASTA are the National Library of Medicine's (NLM) National Center for Biotechnology Information (NCBI) and the University of Michigan's Zhang Lab. NCBI maintains the Basic Local Alignment Search Tool (BLAST) program and algorithm. BLAST is used for comparing biotechnology data and accepts the FASTA format. An additional, non-exhaustive, list of software that can read and analyze FASTA:

    Licensing and patents

None. Comments welcome.

Transparency

This is a transparent, human-readable, structured text file accessible in any basic text editor.

Relationships and other context require the use of DNA analysis or bioinformatics software.

Self-documentation

This is a human-readable structured text file.

Self documentation identifying the structures is limited to less than 80 alphabetical characters stored in the comment line for each sequence.

FASTA does not contain dedicated space for authorship or additional metadata fields. The format functions essentially as a database.

External dependencies External specialized software is not required to read, write, or edit this format. However, the use of specialized bioinformatics software is required to support analysis and visualization.
Technical protection considerations

None. There is no in-built mechanism for encryption, compression, or intellectual property protection.

Users may apply compression or encryption to their FASTA files utilizing external tools but it is not required.


Quality and functionality factors Explanation of format description terms

Dataset
Normal functionality

This format is ASCII-based text composed in two structural segments. Each segment may be multiple lines, but, each line may not exceed 80 characters. The first segment begins with a greater-than ">" symbol and a single line description defining the sequence. This segment must be only one line and less than 80 characters. The second segment is a sequence of letters with each representing a single amino acid or nucleic acid. This segment can be more than one line. Multiple sequences in one file are allowed. Blank lines are not allowed.

These codes are standard IUB/IUPAC code values with some exceptions:

  • Lowercase letters are mapped into uppercase
  • A single hyphen can represent a gap of indeterminate length
  • In amino acid sequences, U and * are acceptable
  • N for unknown nucleic acid residue
  • X for unknown amino acid residue

The limitation of 80 characters per line is a strong recommendation, but not strictly required. See NCBI and Zhang Lab.

Individual organizations may have their own standards for normal functionality.

See Useful references for more information.

Support for software interfaces (APIs, etc.)

FASTA Sequence Comparison software was developed and is maintained by FASTA format creator W. R. Pearson. This software was originally released in 1988 and is currently maintained (last update May, 2023, as of June, 2023). The most recent version is called fasta36. Code for FASTA Sequence Comparison is available under an Apache License (Version 2.0), and Copyright (c) 1996, 1997, 1998, 1999, 2002, 2014, 2015 by William R. Pearson and The Rector & Visitors of the University of Virginia

See Official FASTA (software) website and fasta36 on GitHub.

FASTA format is also used and accessible by many other bioinformatic and DNA analysis software.

Data documentation (quality, provenance, etc.) See Normal Functionality
Beyond normal functionality NCBI uses lowercase letters to indicate masking of repetitive sequences of data of eukaryotes. This is considered "soft" masking, rather than using the generic letter N to denote an undefined value. By default, all FASTA lowercase letters are mapped directly to uppercase letters, but some software (e.g. BLAST) can distinguish between upper- and lowercase. See NCBI's How do alignment programs treat the lower-case masking in genomic FASTA files?.

File type signifiers and format identifiers Explanation of format description terms

Tag Value Note
Filename extension fasta
fas
fa
seq
fsa
fna
faa
See NCBI file extensions, RNAstructure Command Line Help. Shortened FASTA filename. See https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/17807-difference-between-fas-and-faa-files Used for nucleic acid sequences and for amino acid sequences.
Internet Media Type text/x-fasta
Not officially listed by the IANA.; text/x-fasta is based on popular accepted usage. application/x-fasta may be used in the context of a software application. text/plain may also be used. Wikidata cites text/plain as well as chemical/seq-aa-fasta and chemical/seq-na-fasta. Comments welcome.
Magic numbers ASCII: >
Hex: 3e
According to the NCBI and Zhang Group, files are expected to be ASCII-based and start with the greater-than character (">"), which is 3e in hexadecimal. In the original format, a semicolon (";") could also be used as a comment but this practice is generally no longer accepted and considered legacy.
Pronom PUID See note.  PRONOM has no corresponding entry as of July 2023.
Wikidata Title ID Q1593782
See https://www.wikidata.org/wiki/Q1593782. Not to be confused with Q1111641, which is the entry that defines the "DNA and protein sequence alignment software package"

Notes Explanation of format description terms

General

There have been some variations on the FASTA format used in the bioinformatics industry:

FASTQ appends a quality score to the FASTA data. In FASTQ:

  • The first line sequence identifier begins with a "@" instead of ">".
  • The second line sequence contains the FASTA data.
  • The third line sequence begins with a "+" to indicate the start of quality data and contains sequence identifiers.
  • The fourth line sequence contains ASCII characters which correspond to the quality data and the FASTA data.

See Quality Score Encoding for more.

A2M is used to truncate FASTA data in order to make components easier for analysis software to read. A2M adds supplemental characters to equalize the length of all data lines. A2M uses "." and "~" to fill in leading positions, trailing positions, and gap positions in the line sequences. See "Description of A2M Alignment Format".

A3M is used to truncate FASTA data in order to make components easier for analysis software to read. A3M adds supplemental characters to equalize the length of all data lines. A3M also allows gaps that align with inserted characters to be omitted for conciseness. See trRosetta.

History

This format debuted in 1985 in a paper titled "Rapid and sensitive protein similarity searches" by David J. Lipman and William R. Pearson. This paper outlines use of an algorithm for condensed bioinformatic data that would support faster digital analysis. From the paper's abstract: FASTA "facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases"

See for more information: Pearson, William R. (1990). [5] Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in Enzymology, 63–98. https://doi.org/10.1016/0076-6879(90)83007-v Wikimedia Foundation. (2023, April 3). FASTA. Wikipedia. https://en.wikipedia.org/wiki/FASTA UK National HPC Service. Software - FASTA. (n.d.). http://www.csar.cfs.ac.uk/user_information/software/bio-infomatics/fasta.shtml


Format specifications Explanation of format description terms


Useful references

URLs


Last Updated: 06/21/2024