The arrival of massively parallel sequencing technologies has fundamentally changed the study of genetics. New platforms like the Illumina HiSeq2000 produced unprecedented levels of sequencing throughput. The analysis and interpretation of data from next-generation sequencing (NGS) platforms shows a substantial informatics challenge.
Massively parallel sequencing technologies hold incredible promise for the study of DNA sequence variation, particularly the identification of variants affecting human disease. The unprecedented throughput and relatively short read lengths of Roche/454, Illumina/Solexa, and other platforms have spurred development of a new generation of sequence alignment algorithms. Yet detection of sequence variants based on short read alignments remains challenging, and most currently available tools are limited to a single platform or aligner type. VarScan is a platform-independent software tool for variant detection that is compatible with several short read aligners developed at the Genome Institute at Washington University to detect variants in NGS data. It is VarScan’s ability to detect SNPs and indels with high sensitivity and specificity, in both Roche/454 sequencing of individuals and deep Illumina/Solexa sequencing of pooled samples.
VarScan is a tool that detects variants (SNPs and indels) in next-generation sequencing data. The new release is implemented in Java, and includes several new features:
- SAM/BAM compatibility. VarScan now takes SAMtools pileup as input, so it’s compatible with most SAM-friendly short read aligners. For a list of SAM-friendly aligners on which VarScan has been tested, see below.
- Java implementation, which improves performance and lets VarScan run on any operating system.
- SNP, indel, and consensus calling. In addition to detecting variants, VarScan calls consensus genotypes based on read counts and allele frequency.
- Somatic variant detection. Given input from a tumor sample and matched control, VarScan identifies variants and determines their somatic status (Germline, Somatic, or LOH) by comparing the read counts.
- Exome-based copy number alteration detection. VarScan is a platform-independent mutation caller for targeted, exome, and whole-genome resequencing data generated on Illumina, SOLiD, Life/PGM, Roche/454, and similar instruments. The newest version, VarScan 2, is written in Java, so it runs on most operating systems.
It can be used to detect different types of variation:
- Germline variants (SNPs and indels) in individual samples or pools of samples.
- Multi-sample variants (shared or private) in multi-sample datasets (with mpileup).
- Somatic mutations, LOH events, and germline variants in tumor-normal pairs.
- Somatic copy number alterations (CNAs) in tumor-normal exome data.
Most of the published variant callers for next-generation sequencing data employ a probabilistic framework, such as Bayesian statistics, to detect variants and assess confidence in them. These approaches generally work quite well, but can be confounded by numerous factors such as extreme read depth, pooled samples, and contaminated or impure samples. In contrast, VarScan employs a robust heuristic/statistic approach to call variants that meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance.
VarScan is under continued development and improvement at a leading genome center with early access to new sequencing technologies, substantial computing resources, immense public/private datasets, and established expertise in sequencing, genetics, and genomics.
In somatic mode, VarScan reads the pileup files from normal and tumor simultaneously. Only positions that are present in both files, and meet the minimum coverage in both files, will be compared. VarScan tries to obtain the maximum number of comparisons, even if it means closing and reopening the normal file to try to match contig and position. This can lead to looping errors.
This command expects both a normal and a tumor file in SAM tools pileup format from sequence alignments in binary alignment/map (BAM) format. To build a pileup file, we will need:
- A SAM/BAM file (“myData.bam”) that has been sorted using the sort command of SAMtools.
- The reference sequence (“reference.fasta”) to which reads were aligned, in FASTA format.
- The SAM tools software package.
VarScan creates two output files by default, one for SNVs (.snp) and one for indels (.indel). If the validation flag is turned on, a third file (.validation) will also be generated containing all positions that were called. Output files have headers, and all share the same format.