Pairwise whole genome alignments can be used to determine conservation or differences between pairs of species or to match up regions between species, so as to study the same genomic region in multiple species. LastZ and its predecessor BlastZ are used to align the genome sequences at the DNA level.
CNEr identifies conserved noncoding elements (CNEs) from pairwise whole genome alignment of two species. UCSC has provided alignments between many species on the downloads, hence it is highly recommended to use their alignments when available. When the alignments of some new assemblies/species are not available from UCSC yet, this vignette describes the pipeline of generating the alignments merely from soft-masked 2bit files or fasta files. This vignette is based on the genome wiki from UCSC.
Classical WGA tools use a four phases, anchor-based strategy consisting of:
- Similarity detection (P1): computes pairs of genomic regions sharing sequence similarity, usually short, exact (or nearly exact) matches, e.g. MUMs, MEMs.
- Chaining (P2): selects a maximal subset of non-overlapping matches (computed in P1) that form the backbone of the alignment, i.e. the anchors; the maximization criterion depends mostly on length and similarity.
- Recursion (P3): any two facing regions located between adjacent anchors on each genome are considered as smaller sequences and are aligned with the same procedure, i.e. by applying the first two phases (P1 + P2) recursively with adapted parameters, and complete the backbone with a second, complementary set of anchors.
- “Last chance alignment” (P4): uses classical alignment tools to compute global alignments between as yet unaligned facing regions. Alignments are performed and incorporated in the WGA based on different criteria depending on the aligner.
Two types of pairwise genome alignment are available in Ensembl Genomes, based on LastZ (or its predecessor BlastZ) and translated BLAT (tBLAT). LastZ is typically used for closely related species, and tBLAT for more distant species. The method of alignment affects the coverage of the genomes, with tBLAT expected to mostly find homologies in coding regions.
The raw results from LastZ or tBLAT are alignment blocks, which are ‘chained’ according to their location in both genomes, then ‘netted’ to choose the best sub-chain in each region for the reference species. The resultant LastZ-net and tBLAT-net alignments are displayed in Ensembl Genomes for selected fungal, metazoan, protist and plant species.
Methods for WGA strategies generally start by finding local alignments between, and perhaps within, the genomes. The Smith–Waterman algorithm is the classical solution to the pairwise local alignment problem, but is generally not used for WGA because it runs in time quadratic in the size of the genomes, which can be large. Instead, most methods adopt a “seed-and-extend” approach for discovering high-scoring local alignments, much like BLAST. This approach first identifies short ungapped matches between the sequences using one of a variety of data structures. It then extends the short matches from both ends using a variant of the Smith–Waterman algorithm, stopping the extension when the score of the alignment drops below a specified threshold. In some cases, nearby and consistent (in terms of order and orientation) local alignments are “chained” together to form larger alignments.
There are a number of techniques used for discovering seeds at the genomic scale for the “seed-and-extend” approach to local alignment. A first distinction between the techniques is whether they find exact or inexact matching seeds. Exact seed discovery is often faster and easier to implement, whereas inexact seeds offer better sensitivity.
Seed-finding techniques can often be improved by taking advantage of DNA evolutionary models. A generalization of spaced seeds is “subset seeds”, which allow subsets of bases to be considered equivalent when determining if there is a match at a given position. Subset seeds are particularly useful for taking into account that transitions are often more common than transversions in genome comparisons. Further taking into account biologically informed substitution patterns is the “translated” seed, which is a match at the amino acid level after translating genomic sequences in all six possible reading frames. Translated seeds enable increased sensitivity in comparisons of more diverged genomes. Lastly, when aligning a genome to a set of genomes for which a multiple WGA has already been constructed, one can take into account the substitution patterns and ancestral sequences inferred from the WGA to devise more sensitive seeds.
Our company, BioinfoLytics, is affiliated with BioCode and is a project, which is covering many topics on Genomics, Proteomics, their analysis using many tools in a cool way, Sequence Alignment & Analysis, Bioinformatics Scripting & Software Development, Phylogenetic and Phylogenomic Analysis, Functional Analysis, Biological Data Analysis & Visualization, Custom Analysis, Biological Database Analysis, Molecular Docking, Protein Structure Prediction and Molecular Dynamics etc for the seekers of Biocode to further develop their interest to take part in these services to fulfill their requirements and obtain their desired results. We are providing such a platform where one can find opportunities to learn, research projects, analyze and get help and huge knowledge based on molecular, computational and analytical biology.
We are providing “Pairwise Genome Alignments” service to our customers to study conservation or differences between pairs of species that are closely related to one another and to strive for high quality research and will advance science in the domain of Genome Analysis.