GeneMark is a generic name for a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. It was developed in 1993 but original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced in homogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. GeneMark can be used for whole genome analysis as well as for the local analysis of a particular gene and its surrounding regions.
The task of gene identification frequently challenges researchers working with both novel and well studied genomes can be easily and reliably solved with the help of the GeneMark web software. The website provides interfaces to the GeneMark family of programs designed and tuned for gene prediction in prokaryotic, eukaryotic and viral genomic sequences.
The programs of the GeneMark family are ab initio gene finders. Such programs are the only means to identify genes with no homologues in current databases. The GeneMark web software includes two major programs, called GeneMark and GeneMark.hmm. Both programs apply inhomogeneous (three-periodic) Markov chain models describing protein-coding DNA and homogeneous Markov chain models describing non-coding DNA. GeneMark uses a Bayesian formula to calculate posterior probability of the presence of the genetic code (in at least one of six possible frames) in a short DNA sequence fragment, thus being a local approach.
For the GeneMark program, there are several specific options. The window size and step size parameters (96 nt and 12 nt, respectively, by default) define the size of the sliding window and how far this window is moved along the sequence in one step. The threshold parameter determines the minimal average coding potential for an open reading frame (ORF) to be predicted as a gene. There are several options which allow fine-tuning of the GeneMark graphical output. In addition, there are options supporting the analysis of eukaryotic DNA sequences by GeneMark including the ability to provide lists of putative splice sites and protein translations of predicted exons.
The output of the GeneMark program consists of a list of ORFs predicted as genes, such as those with average coding potential above the selected threshold. Although each predicted gene can have more than one potential start, additional data is provided to help the researcher annotate one of the alternatives as the ‘true’ one. The start probability (abbreviated ‘Start Prob’) is derived from the sequences in the windows immediately upstream and downstream of each potential start. In addition to the list of predicted genes, GeneMark provides a list of ‘regions of interest’, spans of significant length between in-frame stop codons where spikes of coding potential are wide enough and may warrant further analysis even if no genes are predicted therein based on automatic comparison with the threshold.
Future directions for GeneMark web software development include detection of several genomic elements that are currently not predicted by either GeneMark or GeneMark.hmm, such as rRNA and tRNA genes.