Prediction of promoter regions is important for studying gene function and regulation. The well-accepted position weight matrix method for this purpose depends on predefined motifs, which would hinder application across different species.
Promoters are the most typical and necessary elements in the process of start of transcription and regulation in prokaryotes and eukaryotes. In bacteria, RNA polymerase (RNAP) and its associated sigma factors have to identify and bind to special regions in promoters in order to initiate transcription. The binding of RNAP and promoters is tightly regulated in bacteria and is the main mechanism modulating gene expression. Therefore, accurate annotation of promoter regions in the genome is essential for studying the regulation and expression of bacterial genes. Moreover, in bacteria, functionally related genes are usually clustered into a single transcriptional unit known as an operon. Therefore, recognition of promoters can also facilitate the identification of operons, which would be useful for discovering the functions of unknown genes.
The position weight matrix (PWM) method is the most well-known prediction tool for identifying consensus elements in a promoter sequence. Such elements include the −10 and −35 hexamers as well as binding sites for transcriptional regulators surrounding the core promoter. In addition to the sequence information for these elements, the distance between them is another important indicator for identifying promoters. Because the PWM method is limited by high false-positive rates, other computational methods have been used for motif recognition, including hidden Markov models (HMMs). As more and more transcription factor-binding sites are discovered, the precision of these motif-based promoter prediction methods has greatly improved.
Machine-learning methods can take out information from experimentally characterized transcription start sites (TSSs), such as data derived using RNA-Seq, which can then be applied to uncharacterized sequences.
Besides this, a deep-learning method can be used for promoter prediction, resulting in a dramatic improvement in recognition accuracy. Machine-learning methods are powerful for classifying different types of data. Instead of alphabetized data, machines tend to accept digital vectors as input.
Current algorithms for predicting promoters and regulatory elements can be categorized as either ab initio based, which make de novo predictions by scanning individual sequences or similarity based, which make predictions based on alignment of homologous sequences or expression profile based using profiles constructed from a number of coexpressed gene sequences from the same organism. The similarity type of prediction is also called phylogenetic footprinting. The conventional approach to detecting a promoter or regulatory site is through matching a consensus sequence pattern shown by regular expressions or matching a position-specific scoring matrix constructed from well-characterized binding sites. In either case, the consensus sequences or the matrices are relatively short, covering 6 to 10 bases.
Promoter prediction tools for Bacteria
- Promoter Prediction by Neural Networks
- Deep Learning Recognition using Convolutional Neural Networks
- Virtual Footprint
Promoter prediction tools for Eukaryotes
- Neural Network Promoter Prediction
- Promoter 2.0 Prediction Server
- Promoter and gene expression regulatory motifs search