BLAST has many modes of operation, one of which aligns an amino acid query sequence to a database of nucleotide sequences, where the nucleotide sequences are often either fragments of a genome or cDNAs representing expressed genes. This mode of operation is known by the name tBLASTn.
tBLASTn (protein sequence searched against translated nucleotide sequences) compares a protein query sequence against the six-frame translations of a database of nucleotide sequences. tBLASTn is useful for finding homologous protein coding regions in un-annotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG), present in the BLAST databases ESTs and HTGs, respectively. ESTs are short, single-read cDNA sequences. They comprise the largest pool of sequence data for many organisms and contain portions of transcripts from many uncharacterized genes. Since ESTs have no annotated coding sequences, there are no corresponding protein translations in the BLAST protein databases. Hence a tBLASTn search is the only way to search for these potential coding regions at the protein level. The HTG sequences, draft sequences from various genome projects or large genomic clones, are another large source of un-annotated coding regions.
tBLASTn searches take hundreds of times longer to run than BLASTn searches. This amounts to 10 or more minutes (per sequence) to search all sequences in this database. Therefore, searches are restricted to single strains.
There are three critical differences between using a translated nucleotide database and using a database of known proteins or protein fragments. First, each sequence stored in a nucleotide database may contain more than one coding region in the same or different translation frames. Second, the majority of the hypothetical amino acid sequence data generated by translating a nucleotide sequence does not correspond to any protein at all, due to the fact that the location of open reading frames (ORFs) in the nucleotide database is not provided to tBLASTn. One is therefore often either translating a noncoding region or translating a coding region in the wrong frame. Third, the split of a genome into distinct database sequences is performed at sometimes arbitrarily chosen locations, for example when bacterial artificial chromosomes (BACs) are used to obtain the sequence data.
For tBLASTn, “windows” of hypothetical amino-acid data is set when applying composition based statistics. Each window has part of a hypothetical coding region for a protein, specifying a substring of nucleotide data and a translation frame. BLAST identifies windows that contain likely coding regions and then uses these windows to compute the composition of the hypothetical proteins. By focusing on smaller regions of the database and frames most likely to contain true amino acids, enough information about the composition of the hypothetical protein to accurately assess the significance of the alignment can be captured.
- Identifying transcripts, potentially from multiple organisms, similar to a given protein
- mapping a protein to genomic DNA