In BLASTp the query sequence is broken into all possible 3-letter words using a moving window. A numerical score is calculated for each word by adding up the values for the amino acids from the BLOSUM62 matrix. Words with a score of 12 of more words with more highly conserved amino acids, are collected into the initial BLASTp search set. BLASTp next widen the search set by adding synonyms that differ from the words at one position. Only synonyms with scores above a threshold value are added to the search set. NCBI BLASTp uses a default threshold of 10 for synonyms, but this can be adjusted by the user. Using this search set, BLAST rapidly scans a database and identifies protein sequences that contain at two or more word/synonyms from the search set. These sequences are set aside for the next phase of the BLASTp process, where these short matches serve as seeds for more extended alignments in both directions from the original match. BLAST keeps a running raw score as it extends the matches. Each new amino acid either increases or decreases the raw score. Penalties are given for mismatches and for gaps between the two alignments. In the NCBI default settings, the presence of a gap brings an initial penalty of 11, which increases by 1 for each missing amino acid. Once the score falls below a set level, the alignment ceases. Raw scores are then converted into bit scores by correcting for the scoring matrix used in the search and the size of the database search space.
In some cases, the alignment may not extend along the entire length of the protein or there may be gaps between aligned regions of the sequences. “Max score” is the bit score for the aligned region with the highest score. “Total score” adds the bit scores for all aligned regions. When there are no gaps in an alignment, the total and max scores are the same. The “Query cover” means the fraction of the query sequence where the alignment score is above the threshold value. BLASTp also reports the percentage of aligned amino acids that are identical in two sequences as “Ident.”
Blast Protein performs protein sequence searches using a BLAST web service hosted by the UCSF Resource for Biocomputing, Visualization, and Informatics (RBVI).
- pdb (default): sequences of structures in the Protein Data Bank (PDB).
- Nr: all non-redundant GenBank CDS translations + RefSeq Proteins + PDB + SwissProt + PIR + PRF.
- E-value (1e-X): significance cutoff; only matches with E-values ≤ 10-X will be returned
- Matrix (BLOSUM45/BLOSUM62/BLOSUM80/PAM30/PAM70) : amino acid substitution matrix to use for alignment scoring
- Passes (default 1):number of psiBLAST iterations; 1 pass is equivalent to blast, whereas multiple passes will find more distantly related sequences
- List only best-matching chain per PDB entry (default): whether multiple hits with the same PDB identifier (matches to different chains in that PDB entry) or only the best-matching one should be included in the results
- Identifying common regions between proteins
- collecting related proteins for phylogenetic analyses