Bioinformatics Bioinformatics File Formats

VCF (Variant Call Format)

Pinterest LinkedIn Tumblr

VCF (Variant Call Format)

The Variant Call Format (VCF) was designed as part of the 1000 Genomes Project as a standardized means to report genetic variation coming from SNP, INDEL, genomic rearrangements and structural variant detection programs and ownership has been subsequently transferred to Global Alliance for Genomics and Health Data Working group file format team. 

The format can be used to show information about all kinds of genomic variation. The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects.

VCF is a text file format most likely stored in a compressed manner. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.

The text files in VCF are saved with .vcf or .vcf.gz extensions.

Every VCF file has three parts in the following order:

  1. Meta-information lines 
  2. One header line 
  3. Data lines contain marker and genotype data (one variant per line). A data line is called a VCF record.

Each VCF record has the same number of tab-separated fields as the header line. The symbol “.” is used to denote missing data.

Meta-information lines

Each meta-information line must have the form ##KEY=VALUE and cannot contain white-space. The first meta-information line must specify the VCF version number. Additional meta-information lines are optional, but are often included to describe terms used in the FILTER, INFO, and FORMAT fields. 

The header line syntax

The header line names the 8 fixed, mandatory columns. These columns are as follows:

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  8. INFO

Data lines

Fixed fields

There are 8 fixed fields per record. All data lines are tab-delimited. In all cases, missing values are specified with a dot (“.”). Fixed fields are:

  1. CHROM chromosome: an identifier from the reference genome. All entries for a specific CHROM should form a contiguous block within the VCF file.(Alphanumeric String, Required)
  2. POS position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. (Integer, Required)
  3. ID semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). (Alphanumeric String)
  4. REF reference base(s): Each base must be one of A,C,G,T,N. Bases should be in uppercase. Multiple bases are permitted. The value in the POS field shows  the position of the first base in the String. For InDels, the reference String must include the base before the event. (String, Required).
  5. ALT comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N, or an angle-bracketed ID String (”<ID>“). If there are no alternative alleles, then the missing value should be used. Bases should be in uppercase. (Alphanumeric String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
  6. QUAL phred-scaled quality score for the assertion made in ALT. If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10 p(no variant). (Numeric)
  7. FILTER filter: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. (Alphanumeric String)
    • INFO additional information: INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: <key>=<data>[,data]. (Alphanumeric String) 

Common INFO fields:

  1. AA ancestral allele
  2. AC allele count 
  3. AF allele frequency 
  4. AN total number of alleles in called genotypes
  5. BQ RMS base quality 
  6. CIGAR cigar string describing how to align an alternate allele to the reference allele
  7. DB dbSNP membership
  8. DP combined depth across samples
  9. END end position of the variant 
  10. H2 membership in hapmap2
  11. MQ RMS mapping quality
  12. MQ0 Number of MAPQ == 0 
  13. NS Number of samples with data
  14. SB strand bias at this position
  15. SOMATIC indicates somatic mutation
  16. VALIDATED describes validated data

Common FORMAT fields:

  • AD Read depth for each allele
  • ADF Read depth for each allele on the forward strand
  • ADR Read depth for each allele on the reverse strand
  • DP Read depth
  • EC Expected alternate allele counts
  • FT Filter
  • GL Genotype likelihoods
  • GP Conditional genotype quality
  • GQ Conditional genotype quality
  • GT Genotype
  • HQ Haplotype quality
  • MQ RMS mapping quality
  • PL Phred-scaled genotype likelihoods
  • PQ Phasing quality
  • PS Phase set

Write A Comment