The BAM Format is a binary format for storing sequence data. A BAM file (*. bam) is the compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb. BAM index files (*.bam.bai) provide an index of the corresponding BAM file. BAM stands for Binary Alignment Map.
BAM and SAM formats are designed to contain the same information. The SAM format is more human readable, and easier to process by conventional text based processing programs, such as awk, sed, python, cut and so on. The BAM format provides binary versions of most of the same data, and is designed to compress reasonably well. The libStatGen library reads both SAM and BAM format files. BAM files contain a header section and an alignment section. The BAM Header also may contain comments which are free-form text lines that can contain any information.
The header section may contain information about the entire file and additional information for alignments. The alignments then associate themselves with specific header information. Header is started with @ symbol. The alignment section contains the information for each sequence about where/how it aligns to the reference genome. The BAM header is not required, but if it is there, it contains generic information for the BAM file. The header may contain the version information for the BAM file and information regarding whether or not and how the file is sorted. It also contains supplemental information for alignment records like information about the reference sequences, the processing that was used to produce the various reads in the file, and the programs that have been used to process the different reads. The alignment records may then point to this supplemental information identifying which ones the specific alignment is associated with.
Each Alignment has:
- query name; read_name (BAM). It is used to group/identify alignments that are together, like paired alignments or a read that appears in multiple alignments.
- a bitwise set of information describing the alignment, FLAG.
Fields in BAM format are;
List of reference information
List of alignments
- l read name
- n cigar op
- l seq
List of auxiliary data
Additional optional information is also contained within the alignment, TAGs that are a bunch of different information stored here and they appear as key/value pairs.
BAM is compressed in the BGZF format. All multibyte numbers in BAM are little-endian, regardless of the machine endianness. The format consists of a table where values in brackets are the default when the corresponding information is not available.