The ab initio method is used for predicting eukaryotic promoters and regulatory elements that also relies on searching the input sequences for matching of consensus patterns of known promoters and regulatory elements. The consensus patterns are obtained from experimentally determined DNA binding sites which are compiled into profiles and stored in a database for scanning an unknown sequence to find similar conserved patterns. However, this approach has to create a very high rate of false positives due to nonspecific matches with the short sequence patterns. Furthermore, because of the high variability of transcription factor binding sites, the simple sequence matching often misses true promoter sites, creating false negatives. To increase the specificity of prediction, a unique feature of eukaryotic promoter is applied, which is the presence of CpG islands. It is known that many vertebrate genes are specified by a high density of CG dinucleotides near to the promoter region overlapping the transcription start site. By identifying the CpG islands, promoters can be traced on the instant upstream region from the islands. By joining CpG islands and other promoter signals, the accuracy of prediction can be improved. Certain programs have been developed based on the combined features to predict the transcription start sites in particular. The start of eukaryotic transcription needs cooperation of a large number of transcription factors. Cooperativity means that the promoter regions tend to contain a high density of protein-binding sites. Thus, finding a cluster of transcription factor binding sites often increases the probability of individual binding site prediction.
CpGProD is a web-based program that finds promoters having a high density of CpG islands in mammalian genomic sequences. It calculates moving averages of GC% and CpG ratios over a window of a certain size. When the values are above a certain threshold, the region is identified as a CpG island.
Eponine is a web-based program that finds transcription start sites based on a series of preconstructed PSSMs of certain regulatory sites, such as the TATA box, the CCAAT box, and CpG islands. The query sequence from a mammalian source is scanned through the PSSMs. The sequence stretches with high-score matching to all the PSSMs, as well as matching of the spacing between the elements, are declared transcription start sites. A Bayesian method is also used in decision making.
Cluster-Buster is an HMM-based, web-based program discovered to find clusters of regulatory binding sites. It works by detecting a region of high concentration of known transcription factor binding sites and regulatory motifs. A query sequence is scanned with a window size of 1 kb for putative regulatory motifs using motif HMMs. If multiple motifs are detected within a window, a positive score is assigned to each motif found. The total score of the window is the sum of each motif score subtracting a gap penalty, which is proportional to the distances between motifs.
FirstEF is a web-based program known as First exon finder that finds promoters for human DNA.
McPromoter is a web-based program that applies a neural network to make promoter predictions. It has a unique promoter model containing six scoring segments. The program scans a window of 300 bases for the likelihoods of being in each of the coding, noncoding, and promoter regions. The input for the neural network includes parameters for sequence physical properties, such as DNA bendability, plus signals such as the TATA box, initiator box, and CpG islands. The hidden layer combines all the features to derive an overall likelihood for a site being a promoter.
TSSW is a web based program that identifies promoter sequences from non-promoter sequences on the basis of a combination of unique content information such as hexamer/trimer frequencies and signal information such as the TATA box in the promoter region. CONPRO is a web-based program that implements a consensus method to find promoter elements for human DNA.