Motifs are important patterns that have many applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Motif search has many applications in solving some crucial biological problems. For example, finding DNA motifs is very important for the determination of open reading frames, identification of gene promoter elements, location of RNA degradation signals, and the identification of alternative splicing sites. For more than 15 years, motif search has stimulated a lot of interest from researchers in different areas.
CLC Genomics Workbench offers advanced and versatile options to search for known motifs represented either by a simple sequence or a more advanced regular expression. These advanced search capabilities are available for use in both DNA and protein sequences.
MotifSearch uses a set of profiles, representing similarities within a family of sequences, search a database for new sequences similar to the original family or annotate the members of the original family with details of the matches between the profiles and each of the members. Normally, the profiles are created with the program MEME.
MotifSearch searches a set of sequences for one or more ungapped profiles, using the method of Bailey and Gribskov. The profiles and the sequences should all be of the same type, protein or nucleotide.
The algorithm calculates position scores for each profile at each possible position within a sequence. These scores are translated into p-values, which represent the likelihood of the given profile scoring that well against a randomly generated sequence. The best position p-values for each profile are then adjusted to take into account the length of the sequence. These adjusted p-values are then used to calculate a combined p-value, which is the p-value of the product of the adjusted p-values. Motifsearch never tries to introduce gaps in the profiles or in the search sequence. Any gapping information in the profiles is ignored.
MotifSearch reports its results as a sorted list file of the best-scoring sequences, or alternatively the subsequences that correspond to individual profile hits. It can also generate an RSF file in which profile hits appear as annotated features.
By default, MotifSearch generates a list file as output. This file contains all the sequences whose combined p-value was below the threshold specified during the run. Each entry includes the number of different motifs that hit against the sequence as well as the total number of hits. If 2 of 3 profiles scored hits but one of those profiles scored 2 hits, the list would report 2 motifs with hits, and a total of 3 hits overall.
MotifSearch can do the following for us;
- Search with a protein query sequence against Motif Libraries such as PROSITE, NCBI-CDD and Pfam
- Two types of profile data, either in PROSITE or Pfam format, are calculated from the multiple alignment sequences. using PFMake or HMMBuild, respectively.
- Align a protein sequence with a profile given by the users.
- Search with a profile against protein sequence databases
- Search a protein sequence pattern (regular expression) against sequence databases
- Generate a profile from a set of multiple aligned sequences