Most sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. The “reference flow” alignment method that uses information from multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner, reference flow exhibits a similar level of accuracy and bias avoidance, but with 13% of the memory footprint and 6 times the speed.
As a result, reference bias occurs when sample reads that highly differ from the reference do not map correctly, and are either mapped to the wrong position in the genome or remain unmapped altogether. Incorrect read mapping in turn leads to false negative or false positive variant calls. When aligned to a linear reference genome, only a small proportion of sample reads containing an insertion (dark blue) are mapped to the correct position.
Lunter and Goodson tested many mapping tools (BWA, MAQ, and Stampy) on reads from an individual from the 1000 Genomes Project, focusing on a known heterozygous indel in the genome. The mappers showed a consistent bias in mapping reads containing the reference allele, leading to underestimation of the proportion of reads with a non-reference allele. Degner and colleagues sequenced mRNA from two HapMap cell lines. When reads containing heterozygous SNPs were aligned to the reference genome using MAQ, reads with the reference allele were mapped significantly more than reads with the non-reference allele.
Reference bias affects clinically significant regions of the genome. For example, because human leukocyte antigen (HLA) genes have the highest diversity of any region in the genome, HLA genotyping is susceptible to reference bias. When Brandt and colleagues compared HLA genotypes from the 1000 Genomes Project as measured by next-generation sequencing methods or gold-standard Sanger sequencing, they found that 18.6% of SNPs identified through next-generation sequencing variant calling were inaccurate. They also found evidence of reference bias reads with an alternative allele at a SNP were less likely to map to the reference genome than were reads carrying the reference allele. Because reference bias results in inaccurate HLA typing by next-generation sequencing, genomic research is limited in the many clinical areas where HLA genes play an important role, from autoimmune diseases to organ transplant rejection.
Another area of clinical research impacted by reference bias is structural variant discovery in cancer genomes. Structural variants are relatively large genome alterations (tens to hundreds of bases or longer) and include deletions, insertions, tandem duplications, inversions, and translocations. Algorithms to identify structural variants largely rely on detecting patterns of discordant read pairs or split reads, an approach that depends on the accuracy of read mapping. Because of this, reference bias limits the detection of novel structural variants via next-generation sequencing, impacting the characterization of structural variants as drivers of tumor development.