Home Genome Blast / Blat Batch Genes Batch Sequences Markers Genetic Maps Submit Searches Site Map
Find:
WormBase Banner
Introduction Methods Download Prediction



We implement a simple 5-step pipeline for the detection of transcription factor binding sites conserved between C. elegans and C. briggsae. Here are the details of the five steps outlined in the introduction:

  1. Build a Position Weight Matrix (PWM)    The user submits a binding site alignment file. This file is input to the hmmbuild command from the HMMER software suite, to build the position weight matrix. Only the emission parameters in the file are used in the search. These are the four numbers in lines starting with 1,2,3 etc. The sixth column in these lines is the position in the multiple alignment which these parameters represent. For example, in the following position weight matrix:
    PWM_column  A      C      G      T     alignment_column
    
         1   -231   -514   1472  -1392     1
         2    568    158    834  -4321     2
         3  -4321    273   2016  -1959     3
         4  -1106   2016   -304  -4321     4
         5   1357  -4323   -370  -4321     5
         6    -93    560   1305  -4321     6
         7  -4321   1915    690  -2407     7
         8   -300   1801  -1414  -2189     9
         9  -1229   2152  -1046  -4321    10
        10  -4321   2486  -4323  -4321    11
        11   1555  -4323  -4323  -4321    12
        12   -126  -4323   -551    802    13
        13  -1859  -4323   2181  -1399    14
        14  -4321   2486  -4323  -4321    15
        15   1555  -4323  -4323  -4321    16
        16  -1612   1168  -4323    602    18
        17  -2636   1020     83    348    19
    
    generated from the alignment
    GAGCAAGTCCCATGCAATT
    TAGCAGGTCCCATGCACTT
    AAGCAAGTCCCATGCAATG
    GGGCAAGTCCCATGCAATA
    AAGCAGTTCCCATGCAGAC
    AAGCAGTTCCCATGCAGAC
    AAGCACCTCCCATGCAGAC
    GAGAAGCGCCCAAGCATTG
    GCGCAACCACCAGTCATTT
    GCGCAACCACCAGTCATTT
    GGGCAGCGAACAAGCATCT
    CAGCAGGCGCCATGCAACC
    GCGCAACCACCAGTCAGCT
    GGTGAGCCCCCAAGCATCT
    ACGCAACACCCATACACTC
    AGCAGCCGCCCATGCATTG
    CAGCAGGATACATGCAACC
    TGCGGCCGCCCATGCACTC
    GACCACCAAGCAAGCACCT
    
    we can see that there are 19 columns in the alignment but only 17 columns (displayed inverted as rows) in the position weight matrix. Columns 8 and 17 HMMER judged too close to backgound frequency to extract a log-odds scoring column. If this position weight matrix were used in an actual search, the nucleotides in positions 8 and 17 would be ignored and merely used as placeholders.

    There can be poorly conserved columns anywhere in the alignment, but for a successful search, the alignment must contain at least 6 well conserved columns. hmmbuild then uses a tree-based sequence weighting scheme to calculate the weighted counts of nucleotides at each position in the multiple alignment. Background frequency priors are added to each count at each position. These are the frequencies of nucleotides in the non-exonic regions of C. elegans and C. briggsae genomes to be searched. (Using current GFF files to determine these regions, the frequencies are 34.23% G/C and 65.77% A/T content.) Then the ratio with the background nucleotide frequency of the sequence being searched is taken, and the log2 is taken of this ratio, giving the 'bits'. The resulting PWM is the 4 x n matrix in which a cell i,j is the bits score of nucleotide i at sequence position j. If the bits are too low in flanking positions (resulting from a position in the alignment having close to background frequency), hmmbuild automatically excludes these positions from the PWM. Because of this, if the user is unsure where the region of conservation ends, it is best to err on the side of including unconserved flanking regions, as this won't affect the results. For details, see the HMMER UserGuide.

  2. Isolate and Classify non-exonic sequence   Using GFF  (General Feature Format) exon annotation files, the program snip.plx isolates all non-exonic sequence from either C. elegans or C. briggsae genomic sequence. (See this illustration). Sequences are classified as

    • 3' intergenic   between the 3' ends of two genes on opposite strands
    • 5' intergenic   between the 5' ends of two genes on opposite strands
    • 3'/5' intergenic   between two genes on the same strand
    • intronic#   between exons of the same gene (though not necessarily the same splice variant)
      Note: if the gene has only one splice form, these fragments correspond to actual numbered introns
    • BEGIN or END   at the beginning or ending of a chromosome (C. elegans) or sequence read (C. briggsae)
    • other    all segments not in any above category (such as those bounded by exons from two different genes.

    In addition, any segment will be prefixed by alt_ if either of its bounding exons belongs to a gene with more than one splice variant. Note that no attempt is made to correctly number introns for each splice variant.

  3. Search non-exonic regions using the PWM   The program SimpleSearch is used as a scanning window to find up to N (user-defined) top-scoring hits in each genome. Technically, the program chooses the score cutoff to produce as many top-scoring hits as possible less than N. It estimates this cutoff by sampling 1% of the windows and sorting the resulting list of scores. Therefore, the number of hits retrieved will be slightly less than N in most cases. If the PWM has very little information (low bits in most positions) then there will be many windows achieving the same score, and the number of hits retrieved could be substantially less than N. This can arise if the binding sites are incorrectly aligned or poorly conserved.

  4. Filter out 'orphan' hits   We then use a 1-1 mapping of C. elegans to C. briggsae orthologs provided in the file orthologs-2.00 to find only those hits in one species for which there is a hit in the matching ortholog of the other species. The filter traverses all ortholog pairs, and retrieves the top-scoring hit for each ortholog, called a 'hit-pair'. The percentage of top-scoring hits filtered out in each species during this step will depend on the number of hits (N) the user chooses to retrieve. A permissive (i.e. 20000) value for N will include many more hit-pairs during this step, but since the hit-pairs passing the filter are then sorted by either average or maximum or minimum score (see next step), often the value of N chosen won't affect which hits appear at the top of the list. The only case where this can affect the top-ranked hit-pairs is the results sorted by mismatch first, then by some score criterion (minimum, average, or maximum).

  5. Output sorted Hit-Pairs in HTML tables   Finally, the resulting hit-pairs are sorted by either minimum, average, or maximum score in hit-pair, and by number of mismatches between the hit sequences. All six possible combinations of primary/secondary sortings are provided. Ranking by average score first gives the most statistically correct measure of rank order in the context of our protocol. The other sortings are provided for more specialized analysis of the results.

webmaster@www.wormbase.org Copyright Statement
Send comments or questions to WormBase Privacy Statement