Program |
Operating Principle |
Technical Data and URL |
Reference |
AIignACE |
Gibbs sampling algorithm that returns a series of motifs as weight matrices that are over-represented in the input set. |
Judges alignments sampled during the course of the algorithm using a maximum a priori likelihood score, which gauges the degree of over-representation. Provides an adjunct measure (group specificity score) that takes into account the sequence of the entire genome and highlights those motifs found preferentially in association with the genes under consideration. |
1 |
ANN-Spec |
ModeIs the DNA binding specificity of a transcription factor using a weight matrix. |
Objective function based on log likelihood that a transcription factor binds at least once in each sequence of the positive training data compared with the number of times it is estimated to bind in the background training data. Parameter fitting is accomplished with a gradient descent method, which includes Gibbs sampling of the positive training examples. |
2 |
Consensus |
Models motifs using weight matrices searching for lhe matrix with maximum information content. |
Uses a greedy method, first finding the pair of sequences that share the motif with greatest information content, then finding the third sequence that can be added to the motif, resuIting in greatest information content. |
3 |
GLAM |
Gibbs sampling-based algorithm that automatically optimizes the alignment width and evaluates the statistical significance of its output. |
Since the basic algorithm cannot find multiple motif instances per sequence, long sequences are fragmented into shorter ones, and the alignment is transformed into a weight matrix and used to scan the sequences to obtain the final site predictions. |
4 |
Improbizer |
Uses expectation maximization to determine weight matrices of DNA motifs that occur improbably often in the input sequences. |
As a background (null) model it uses up to a second-order Markov model of background sequence. Optionally, Improbizer constructs a Gaussian model of motif placement so that motifs that occur in similar positions in the input sequences are more likely to be found. |
5 |
MEME |
Optimizes the E-value of a statistic related to the information content of the motif. |
Rather than sum of information content of each motif column, the statistic used is the product of the p-values of column information contents. The motif search consists of performing expectation maximization from starting points derived from each subsequence occurring in the input sequences. MEME differs from MEME3 mainly in using a correction factor to improve the accuracy of the objective function. |
6 |
MITRA |
Uses an efficient data structure to traverse the space of IUPAC patterns. |
For each pattern, MITRA computes the hypergeometric score of the occurrences in the target sequences relative to the background sequence and reports the highest scoring patterns. |
7 |
MotifSampler |
Matrix-based motif finding algorithm that extends Gibbs sampling by modeling the background with a higher order Markov model. |
The probabilistic framework is further exploited to estimate the expected number of motif instances in the sequence. |
8 |
Oligo/dyad anaIysis |
Detects over-represented oligonucleotides with oligo analysis and spaced motifs with dyad analysis. |
These algorithms detect statistically significant motifs by counting the number of occurrences of each word or dyad and comparing these with expectation. The most crucial parameter is the choice of an appropriate probabilistic model for the estimation of occurrence significance. |
9, 10 |
QuickScore |
Based on an exhaustive searching algorithm that estimates probabilities of rare or frequent words in genomic texts. |
Incorporates an extended consensus method allowing well-defined mismatches and uses mathematical expressions for efficiently computing z-scores and p-values depending on the statistical models used in their range of applicability. Special attention is paid to the drawbacks of numerical instability. The background model is Markovian, with order up to 3. |
11 |
SeSiMCMC |
Modification of Gibbs sampler aIgorithm that models the motif as a weight matrix, optionally with the symmetry of a palindrome or of a direct repeat and optionally, with spacers. |
Includes two alternating stages. The first one optimizes the weight matrix for a given motif and spacer length. The algorithm changes the positions of the motif occurrence in the sequences and infers the motif model from the current occurrences. These changes are used to optimize the likelihood of sequences as being segmented into the (Bernoulli) background and the motif occurrences. The optimization is organized via a Gibbs-like Markov chain that samples positions in sequences one-by-one until the Markov chain converges. The second stage looks for best motif and spacer lengths for obtained motif positions. It optimizes the common information content of motif and of distributions of motif occurrence positions. |
12 |
Weeder |
Consensus-based method that enumerates exhaustively all the oligos up to maximum length and collects their occurrences (with substitutions) from input sequences. |
Each motif is evaluated according to the number of sequences in which it appears and how well conserved it is in each sequence, with respect to expected values derived from the oligo frequency analysis of all the available upstream sequences of the same organism. Different combinations of canonical motif parameters derived from the analysis of known instances of yeast transcription factor binding sites (length ranging from 6 to l2, number of substitutions from 1 to 4) are automatically tried by the algorithm in different runs. It also analyzes and compares the top-scoring motifs of each run with a simple clustering method to detect which ones could be more likely to correspond to transcription factor binding sites. Best instances of each motif are selected from sequences using a weight matrix built with sites found by consensus-based algorithms. |
13 |
YMF |
Uses an exhaustive search algorithm to find motifs with the greatest z-scores. |
A p-value for the z-score is used to assess the significance of the motif. Motifs themselves are short sequences over the IUPAC alphabet with spacers ("N"s) constrained to occur in the middle of the sequence. |
14 |
Composite Module Analyst (CMA) |
Uses a multi-component fitness function for selection of the promoter model which fits best to the observed gene expression profile. |
Defines a promoter model based on composition of transcription factor binding sites and their pairs. Adjusts the results of the fitness function using a genetic algorithm for the analysis of functionally related or coexpressed genes. |
15 |
REDUCE |
Motif-based regression method for microarray analysis. |
The only required inputs are (i) a single genome-wide set of absolute or relative mRNA abundances and (ii) the DNA sequence of the regulatory region associated with each gene that is probed. REDUCE uses unbiased statistics to identify oligonucleotide motifs whose occurrence in the regulatory region of a gene correlates with the level of mRNA expression. Regression analysis is used to infer the activity of the transcriptional module associated with each motif. |
16 |
MotifRegressor |
Combines the advantages of matrix-based motif finding and oligomer motif-expression regression analysis. |
MotifRegressor first constructs candidate motifs and then applies regression analysis to select motifs that are strongly correlated with changes in gene expression. It is particularly effective in discovering expression-mediating motifs of medium-to-long width with multiple degenerate positions. MotifRegressor relies in part on MDScan, a software package developed by the Brutlag Lab at Stanford University. |
17 |
CisModule |
Employs a hierarchical mixture approach to model the cis-regulatory module structure. |
It is based on the hierarchical mixture model, followed by ade novo motif-module discovery algorithm using the Bayesian inference of module locations and within-module motif sites. Dynamic programming-like recursions are developed to reduce the computational complexity from exponential to linear in sequence length. |
18 |
References