Promoter Analysis Tools

Program

Operating Principle

Technical Data and URL

Reference

AIignACE

Gibbs sampling algorithm that returns a series of motifs as weight matrices that are over-represented in the input set.

Judges alignments sampled during the course of the algorithm using a maximum a priori likelihood score, which gauges the degree of over-representation. Provides an adjunct measure (group specificity score) that takes into account the sequence of the entire genome and highlights those motifs found preferentially in association with the genes under consideration.

http://atIas.med.harvard.edu

1

ANN-Spec

ModeIs the DNA binding specificity of a transcription factor using a weight matrix.

Objective function based on log likelihood that a transcription factor binds at least once in each sequence of the positive training data compared with the number of times it is estimated to bind in the background training data. Parameter fitting is accomplished with a gradient descent method, which includes Gibbs sampling of the positive training examples.

http://www.cbs.dtu.dk/services/DNAarray/ann-spec.php

2

Consensus

Models motifs using weight matrices searching for lhe matrix with maximum information content.

Uses a greedy method, first finding the pair of sequences that share the motif with greatest information content, then finding the third sequence that can be added to the motif, resuIting in greatest information content.

http://bifrost.wustl.edu/consensus/

3

GLAM

Gibbs sampling-based algorithm that automatically optimizes the alignment width and evaluates the statistical significance of its output.

Since the basic algorithm cannot find multiple motif instances per sequence, long sequences are fragmented into shorter ones, and the alignment is transformed into a weight matrix and used to scan the sequences to obtain the final site predictions.

http://zlab.bu.edu/glam/

4

Improbizer

Uses expectation maximization to determine weight matrices of DNA motifs that occur improbably often in the input sequences.

As a background (null) model it uses up to a second-order Markov model of background sequence. Optionally, Improbizer constructs a Gaussian model of motif placement so that motifs that occur in similar positions in the input sequences are more likely to be found.

http://www.soe.ucsc.edu/~kent/improbizer

5

MEME

Optimizes the E-value of a statistic related to the information content of the motif.

Rather than sum of information content of each motif column, the statistic used is the product of the p-values of column information contents. The motif search consists of performing expectation maximization from starting points derived from each subsequence occurring in the input sequences. MEME differs from MEME3 mainly in using a correction factor to improve the accuracy of the objective function.

http://meme.sdsc.edu

6

MITRA

Uses an efficient data structure to traverse the space of IUPAC patterns.

For each pattern, MITRA computes the hypergeometric score of the occurrences in the target sequences relative to the background sequence and reports the highest scoring patterns.

http://www.ccls.columbia.edu/compbio/mitra/

7

MotifSampler

Matrix-based motif finding algorithm that extends Gibbs sampling by modeling the background with a higher order Markov model.

The probabilistic framework is further exploited to estimate the expected number of motif instances in the sequence.

http://www.esat.kuleuven.ac.be/~dna/Biol/Software.html

8

Oligo/dyad anaIysis

Detects over-represented oligonucleotides with oligo analysis and spaced motifs with dyad analysis.

These algorithms detect statistically significant motifs by counting the number of occurrences of each word or dyad and comparing these with expectation. The most crucial parameter is the choice of an appropriate probabilistic model for the estimation of occurrence significance.

http://rsat.ulb.ac.be/rsat/

9, 10

QuickScore

Based on an exhaustive searching algorithm that estimates probabilities of rare or frequent words in genomic texts.

Incorporates an extended consensus method allowing well-defined mismatches and uses mathematical expressions for efficiently computing z-scores and p-values depending on the statistical models used in their range of applicability. Special attention is paid to the drawbacks of numerical instability. The background model is Markovian, with order up to 3.

http://algo.inria.fr/dolley/QuickScore/

11

SeSiMCMC

Modification of Gibbs sampler aIgorithm that models the motif as a weight matrix, optionally with the symmetry of a palindrome or of a direct repeat and optionally, with spacers.

Includes two alternating stages. The first one optimizes the weight matrix for a given motif and spacer length. The algorithm changes the positions of the motif occurrence in the sequences and infers the motif model from the current occurrences. These changes are used to optimize the likelihood of sequences as being segmented into the (Bernoulli) background and the motif occurrences. The optimization is organized via a Gibbs-like Markov chain that samples positions in sequences one-by-one until the Markov chain converges. The second stage looks for best motif and spacer lengths for obtained motif positions. It optimizes the common information content of motif and of distributions of motif occurrence positions.

http://favorov.imb.ac.ru/SeSiMCMC/

12

Weeder

Consensus-based method that enumerates exhaustively all the oligos up to maximum length and collects their occurrences (with substitutions) from input sequences.

Each motif is evaluated according to the number of sequences in which it appears and how well conserved it is in each sequence, with respect to expected values derived from the oligo frequency analysis of all the available upstream sequences of the same organism. Different combinations of canonical motif parameters derived from the analysis of known instances of yeast transcription factor binding sites (length ranging from 6 to l2, number of substitutions from 1 to 4) are automatically tried by the algorithm in different runs. It also analyzes and compares the top-scoring motifs of each run with a simple clustering method to detect which ones could be more likely to correspond to transcription factor binding sites. Best instances of each motif are selected from sequences using a weight matrix built with sites found by consensus-based algorithms.

http://159.149.109.9/modtools/

13

YMF

Uses an exhaustive search algorithm to find motifs with the greatest z-scores.

A p-value for the z-score is used to assess the significance of the motif. Motifs themselves are short sequences over the IUPAC alphabet with spacers ("N"s) constrained to occur in the middle of the sequence.

http://bio.cs.washington.edu/software.html#yms

14

Composite Module Analyst (CMA)

Uses a multi-component fitness function for selection of the promoter model which fits best to the observed gene expression profile.

Defines a promoter model based on composition of transcription factor binding sites and their pairs. Adjusts the results of the fitness function using a genetic algorithm for the analysis of functionally related or coexpressed genes.

http://www.gene-regulation.com/cgi-bin/CMA/cma.cgi

15

REDUCE

Motif-based regression method for microarray analysis.

The only required inputs are (i) a single genome-wide set of absolute or relative mRNA abundances and (ii) the DNA sequence of the regulatory region associated with each gene that is probed. REDUCE uses unbiased statistics to identify oligonucleotide motifs whose occurrence in the regulatory region of a gene correlates with the level of mRNA expression. Regression analysis is used to infer the activity of the transcriptional module associated with each motif.

http://bussemaker.bio.columbia.edu/reduce/

16

MotifRegressor

Combines the advantages of matrix-based motif finding and oligomer motif-expression regression analysis.

MotifRegressor first constructs candidate motifs and then applies regression analysis to select motifs that are strongly correlated with changes in gene expression. It is particularly effective in discovering expression-mediating motifs of medium-to-long width with multiple degenerate positions. MotifRegressor relies in part on MDScan, a software package developed by the Brutlag Lab at Stanford University.

http://www.math.umass.edu/~conlon/mr.html

17

CisModule

Employs a hierarchical mixture approach to model the cis-regulatory module structure.

It is based on the hierarchical mixture model, followed by ade novo motif-module discovery algorithm using the Bayesian inference of module locations and within-module motif sites. Dynamic programming-like recursions are developed to reduce the computational complexity from exponential to linear in sequence length.

http://www.stat.ucla.edu/~zhou/CisModule/index.html

18

 

References

  1. Hughes JD, Estep PW, Tavazoie S, Church GM (2000). Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharomyces cerevisiae. J Mol Biol 296:1205–1214.
  2. Workman CT and Stormo GD (2000). ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. PacificSymposium on Biocomputing (ed. Altman R, Dunker AK, Hunter L, Klein TE). 467–478 (Stanford University, Stanford, CA).
  3. Hertz GZ and Stormo GD (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15:563–577.
  4. Frith MC, Hansen U, Spouge JL, Weng Z (2004). Finding functional sequence elements by multiple local alignment. Nucleic Acids Res 32:189–200.
  5. Ao W, Gaudet J, Kent WJ, Muttumu S, Mango SE (2004). Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science 305:1743–1746.
  6. Bailey TL and Elkan C (1995). The value of prior knowledge in discovering motifs with MEME. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. 21–29 (AAAI Press, Menlo Park, CA).
  7. Eskin E and Pevzner P (2001). Finding composite regulatory patterns in DNA sequences. Bioinformatics (Supplement 1) 18:S354–S363.
  8. Thijs G, et al (2001). A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17:1113–1122.
  9. van Helden J, Andre B, Collado-Vides J (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281:827–842.
  10. van Helden J, Rios AF, Collado-Vides J (2000). Discovering regulatory elements in noncoding sequences by analysis of spaced dyads. Nucleic Acids Res 28:1808–1818.
  11. Régnier M and Denise A (2004). Rare events and conditional events on random strings. Discrete Math Theor Comput Sci 6:191–214.
  12. Favorov AV, Gelfand MS, Gerasimova AV, Mironov AA, Makeev VJ (2004). Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length and its validation on the ArcA binding sites. Proceedings of BGRS 2004 (BGRS, Novosibirsk).
  13. Pavesi G, Mereghetti P, Mauri G, and Pesole G (2004). Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res 32:W199–W203.
  14. Sinha S and Tompa M (2003). YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res 31:3586–3588.
  15. Konovalova T, Valeev T, Cheremushkin E, Kel AE (2005). Composite Module Analyst: Tool for Prediction of DNA Transcription Regulation. Testing on Simulated Data. ICNC 2:1202-1205.
  16. Roven C and Bussemaker HJ (2003). REDUCE: an online tool for inferring cis-regulatory elements and transcriptional module activities from microarray data. Nucleic Acids Research 31(13):3487-3490.
  17. Conlon EM, Liu XS, Lieb JD, Liu JS(2003). Proc Natl Acad Sci USA 100 (6):3339.
  18. Zhou Q and Wong WH (2004). CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc Natl Acad Sci USA 101:12114-12119.