Promoter Analysis and Extraction

The PhenoGen website allows you to perform oPOSSUM and MEME promoter analysis, and upstream sequence extraction.

Choose Genes List Analysis Tools in the main menu.
Click Analyze a gene list. A page displays the gene lists to which you have access.
Click the gene list for which you want to perform a promoter analysis.
Click the Promoter tab. The Promoter page displays.

Take a look

oPOSSUM Overview

oPOSSUM is a tool for determining the over-representation of transcription factor binding sites (TFBS) within a set of (co-expressed) genes as compared with a pre-compiled background set (Ho Sui et al., 2005, Nucleic Acids Res 33(10):3154-64). The input is a set of gene identifiers and analysis parameters. The system compares the number of hits for each selected TFBS on the target gene set against the background set. Two different measures of statistical significance are applied to determine which TFBS sites are over-represented in the target set. The results of the analysis are displayed in a tabular form.

Notes:

The PhenoGen website uses a customized version of oPOSSUM featuring a sub-set of input parameters.
All matrices in the oPOSSUM database with a given minimum specificity are selected. These matrices are obtained from the JASPAR database.

Selection Criteria

Search Regional Level

This refers to the size of the region around the transcription start site (TSS) which was analyzed for TFBS sites. The background set was computed using a region extending a maximum of 5000 bp upstream and 5000 bp downstream of the TSS. During the background computation the upstream region was truncated to less than 5000 bp if it overlapped an upstream exon from another gene.

Conservation Level

To limit spurious TFBS sites, conservation with the aligned orthologous mouse sequence was used as a filter, and only sites which fell within these non-coding conserved regions were kept. A conserved region was defined as a span of some minimum length L within the human sequence which had a percent identity with the aligned mouse sequence of some minimum value X. The background set was pre-computed with three levels of conservation filter. Level 1 corresponds to the top 10 percentile of non-coding conserved regions with an absolute minimum percent identity of 70%. Level 2 corresponds to the top 20 percentile with a minimum percent identity of 65% and level 3 corresponds to the top 30 percentile with a minimum percent identity of 60%.

Matrix Match Threshold

TFBS sites are scanned by sliding the corresponding position weight matrix (PWM) along the sequence and scoring it at each position. The threshold is the minimum relative score used to report the position as a putative binding site. The background set was computed using a threshold of 70%.

Statistical measure for over-representation

Two measures of statistical over-representation are available: a one-tailed Fisher exact probability and a Z-score.

One-tailed Fisher Exact Probability

The one-tailed Fisher exact probability compares the proportion of co-expressed genes containing a particular TFBS to the proportion of the background set that contains the site to determine the probability of a non-random association between the co-expressed gene set and the TFBS of interest. It is calculated using the hypergeometric probability distribution that describes sampling without replacement from a finite population consisting of two types of elements. Therefore, the number of times a TFBS occurs in the promoter of an individual gene is disregarded, and instead, the TFBS is considered as either present or absent.

Z-score

The Z-score uses a simple binomial distribution model to compare the rate of occurrence of a TFBS in the target set of genes to the expected rate estimated from the pre-computed background set.

For a given TFBS, let the random variable x denote the number of predicted binding site nucleotides in the conserved non-coding regions of the target gene set. Let B be the number of predicted binding site nucleotides in the conserved non-coding regions of the background gene set. Using a binomial model with n events, where n is the total number of nucleotides examined (i.e., the total number of nucleotides in the conserved non-coding regions) from the co-expressed genes, and N is the total number of nucleotides examined from the background genes, then the expected value of x is u = B * C, where C = n / N (i.e., C is the ratio of sample sizes). Then taking p = B / N as the probability of success, the standard deviation is given by s = sqrt(n * p * (1 - p)).

Let x be the observed number of binding site nucleotides in the conserved non-coding regions of the co-expressed genes. By applying the Central Limit Theorem and using the normal approximation to the binomial distribution with a continuity correction, the z-score is calculated as z = (x - u - 0.5) / s. Then, the probability of observing x or more binding site nucleotides in the conserved non-coding regions of the target genes, given the TFBS is not truly over-represented in the target genes, is the p-value associated with Pr(Z >= z).

MEME Overview

The MEME (Multiple EM for Motif Extraction) search is based on occurrences of known motifs (transcription factor binding sites). There are many software options available to explore the occurrence of previously uncharacterized motifs. Although these have not been directly incorporated within the PhenoGen website as with oPOSSUM, they can easily be applied using other publicly available web servers.

A recent comprehensive review (Tompa et al., 2005, Nature Biotechnology 23:137) of such programs found that MEME (Bailey and Elkan, 1995, Proc. Int Conf Intell Syst Mol Biol 3:21), was one of the best performing algorithms on mouse data. Methods like MEME are optimal for analyzing sequences less than 2KB and it is not recommended to use longer lengths for such tools. Furthermore, many motif software webservers restrict the input data size. In addition to accessing MEME on the PhenoGen website, MEME can also be accessed at http://meme.sdsc.edu/meme/meme.html.

Upstream Sequence Extraction Overview

An important step in understanding the mechanisms that regulate the expression of genes is the ability to identify regulatory elements, i.e., the binding sites in DNA for transcription factors. Transcription factors are DNA binding proteins, typically upstream from, and close to, the transcription start site (TSS) of a gene, that modulate the expression of the gene by activating or repressing the transcription machinery.

Because there is a limited amount of information regarding the majority of the transcription factors and especially about their target binding sites (even in well-characterized organisms) you could focus on computational tools designed for the discovery of novel regulatory elements, where nothing is known a priori of the transcription factor or its preferred binding sites. If you provide a collection of sequences that correspond to the regulatory regions of genes that are believed to be co-regulated, the computational tool identifies short DNA sequence 'motifs' that are statistically over- or under-represented in these regulatory regions. Accurate identification of these motifs is very difficult because they are short signals (typically about 10 bp long) in the midst of a great amount of statistical noise (a typical input being one regulatory region of length 1,000 bp upstream of each gene). Also, there is marked sequence variability among the consensus binding sites of a given transcription factor, and the nature of the variability itself is not well understood.

There are numerous tools available for this task of motif prediction. They differ from each other mainly in their definition of what represents a motif and what would be an acceptable model for statistical over-representation of a motif. A comprehensive list of tools that could be used (table adapted from Tompa et al, 2005, Nature 23(1):137-144) is presented in "Supplementary Information". This sequence information can be used to carry out TFBS analysis, off the PhenoGen website, using any of these tools. See "Promoter Analysis Tools".