Bioinformatics Bioinformatics File Formats Sequence Format


Pinterest LinkedIn Tumblr

Sequence Alignment Map (SAM) is a TAB-delimited text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Welcome Sanger Institute, and throughout the 1000 Genomes Project. It is simply a text format for storing sequence data in a series of tab delimited ASCII columns.

The SAM format consists of a header and an alignment section. The binary equivalent of a SAM file is a Binary Alignment Map (BAM) file, which stores the same data in a compressed, indexed, binary representation. SAM files can be analyzed and edited with the software SAM tools. The header section must be prior to the alignment section if it is present. Headings begin with the ‘@’ symbol, which distinguishes them from the alignment section. Alignment sections have 11 mandatory fields, as well as a variable number of optional fields. 

  1. QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. 
  2. FLAG: Combination of bitwise FLAGs
  3. RNAME: Reference sequence NAME of the alignment. If @SQ header lines are present, RNAME (if not ‘*’) must be present in one of the SQ-SN tags. 
  4. POS: 1-based leftmost mapping POSition of the first matching base. 
  5. MAPQ: MAPping Quality. It equals −10 log10, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
  6. CIGAR: Concise Idiosyncratic Gapped Alignment Report (CIGAR) string.
  7. RNEXT: Reference sequence name of the primary alignment of the NEXT read in the template. 
  8. PNEXT: Position of the primary alignment of the NEXT read in the template. Set as 0 when the information is unavailable. 
  9. TLEN: signed observed Template LENgth. 
  10. SEQ: segment SEQuence. 
  11. QUAL: ASCII of base QUALity 

Currently, most SAM format data is output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome. In the future, SAM will also be used to archive unaligned sequence data generated directly from sequencing machines. The header section may contain information about the entire file and additional information for alignments. The alignments then associate themselves with specific header information. Sometimes, it may not require but it contains generic information for the SAM file. The header may contain the version information for the SAM file and information regarding whether or not and how the file is sorted.

It also contains supplemental information for alignment records like information about the reference sequences, the processing that was used to generate the various reads in the file, and the programs that have been used to process the different reads. The alignment records may then point to this supplemental information identifying which ones the specific alignment is associated with.

Write A Comment