The Pindel software uses a pattern growth approach to detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-generation sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short read.
The Pindel software consists of the following major steps.
- Align reads against reference genome using BWA
- Call insertions and deletions using Pindel
- Filter variants by regions of interest (optional)
- Filter variants by variant properties (e.g. variant frequency, depth)
- Samples and/or BAMs (paired-end reads required)
- Targeted region manifest (optional)
- Targeted region bed file (optional)
- Sample Name.bam: aligned reads in BAM format
- Sample Name.vcf: variant calls in VCF format
- Sample Name.summary.csv: summary statistics and parameter settings
- Sample Name.sv.summary.csv: variant summary statistics
- Other files can be ignored
In the Pindel program, it is aimed to compute the precise break points as well as the fragments inserted or deleted compared to the reference genome from paired-end reads. In the preprocessing step, SSAHA2 is used to map all the reads to the reference genome. Then the mapping results are examined to keep those paired reads that only one end can be mapped. For each of those read pairs, the mapped end must be uniquely located in the genome with no mismatch bases while the other end cannot be mapped to anywhere in the genome under a given threshold alignment score. For each of those pairs, Pindel program uses the mapped end to determine the anchor point on the reference genome and the direction of the unmapped read. Knowing the anchor point, the direction to search for the unmapped read and the user defined Maximum Deletion Size (Max_D_Size) parameter, a sub-region in the reference genome can be located, where Pindel will break the unmapped read into 2 (deletion) or 3 (short insertion) fragments and map the two terminal fragment separately.
The computational procedure for each unmapped read is described as following:
- Read in the location and the direction of the mapped read from the mapping result obtained in the preprocessing step
- Define the 3′ end of the mapped read as anchor point
- Use pattern growth algorithm to search for minimum and maximum unique substrings from the 3′ end of the unmapped read within the range of two times of the insert size from the anchor point
- Use pattern growth to search for minimum and maximum unique substrings from the 5′ end of the unmapped read within the range of read length+Max_D_Size starting from the already mapped 3′ end of the unmapped read obtained in step 3
- Check whether a complete unmapped read can be reconstructed combining the unique substrings from 5′ and 3′ ends found in steps 3 and 4. If yes, store it in the database U.