Motifs are short sequence patterns of biological significance in either DNA, RNA or protein sequences. The discovery of such motifs is an important task in molecular biology. The characterization and localization of motifs is a fundamental approach to a better understanding of the structure, function and evolutionary relationships of the corresponding genes or proteins.
The Motif Finder, consisting of the class MotifFinder and the function findMotif, provides four different motif finding algorithms, the two heuristic algorithms, PROJECTION and ePatternBranching and the two exact algorithms, PMS1 and PMSP.
The protein motif extraction algorithm (MotifFinder) has previously been implemented in the Java language. The algorithm has been investigated with the aim of improving the algorithm’s speed and performance in identifying new protein motifs.
The algorithm’s objective is to discover motif(s) or consensus patterns from within sequences belonging to the same protein family. The algorithm consists of four main steps: Sequence Preprocessing, Motif Generation, Motif Selection and Motif Optimisation. The results are displayed as features in two new tracks. By default, the results from the positive strand are displayed in blue, and results from the negative strand in red. To change the color, right-click on the track and select Change Track Color from the pop-up menu. Homer includes certain motif databases that are used to help annotate results and conduct searches for known motifs. Each database is composed of a set of HOMER-formatted motif files.
1. Bring up the motif finder dialog, via Tools>Find Motif.
2. Enter the sequence for which to search, using one of following three formats:
- A sequence of nucleotides.
- A sequence of nucleotides with IUPAC ambiguity codes.
For example, let’s say we want to find bacterial promoter upstream elements consisting of 6 adenines (A), followed by a purine (A or G), then any nucleotide (A, C, G, or T) and finally another purine (A or G). We would enter the sequence “AAAAAARNR”.
- A regular expression that follows Java regex syntax.
For example, to find occurrences of the canonical TATA box sequences TATAAAAA, TATATAAA, and TATAAATA, we can enter the regular expression “TATA[AT]A[AT]A”. Regular expressions are particularly useful for finding variable length sequences. For example, to search for the sequence TATAAA, optionally followed by any number of additional adenines, enter the regular expression “TATAAA+”.
3. Enter names for the feature tracks that will show where the sequence matches the positive and negative strands of the reference genome.
Since we entered a short sequence, it gets a large number of hits. Looking at the results directly upstream of the gene GBP4, we see a match on the positive strand and two on the negative strand. Note that by default, the search result tracks are displayed in Expanded mode, so we can see overlapping matches.