C. elegans/C. briggsae Transcription Factor Binding Site Search

Introduction	Methods	Download	Prediction

We implement a simple 5-step pipeline for the detection of transcription factor binding sites conserved between C. elegans and C. briggsae. Here are the details of the five steps outlined in the introduction:

Build a Position Weight Matrix (PWM) The user submits a binding site alignment file. This file is input to the hmmbuild command from the HMMER software suite, to build the position weight matrix. Only the emission parameters in the file are used in the search. These are the four numbers in lines starting with 1,2,3 etc. The sixth column in these lines is the position in the multiple alignment which these parameters represent. For example, in the following position weight matrix:
```
PWM_column  A      C      G      T     alignment_column

     1   -231   -514   1472  -1392     1
     2    568    158    834  -4321     2
     3  -4321    273   2016  -1959     3
     4  -1106   2016   -304  -4321     4
     5   1357  -4323   -370  -4321     5
     6    -93    560   1305  -4321     6
     7  -4321   1915    690  -2407     7
     8   -300   1801  -1414  -2189     9
     9  -1229   2152  -1046  -4321    10
    10  -4321   2486  -4323  -4321    11
    11   1555  -4323  -4323  -4321    12
    12   -126  -4323   -551    802    13
    13  -1859  -4323   2181  -1399    14
    14  -4321   2486  -4323  -4321    15
    15   1555  -4323  -4323  -4321    16
    16  -1612   1168  -4323    602    18
    17  -2636   1020     83    348    19
```
generated from the alignment
```
GAGCAAGTCCCATGCAATT
TAGCAGGTCCCATGCACTT
AAGCAAGTCCCATGCAATG
GGGCAAGTCCCATGCAATA
AAGCAGTTCCCATGCAGAC
AAGCAGTTCCCATGCAGAC
AAGCACCTCCCATGCAGAC
GAGAAGCGCCCAAGCATTG
GCGCAACCACCAGTCATTT
GCGCAACCACCAGTCATTT
GGGCAGCGAACAAGCATCT
CAGCAGGCGCCATGCAACC
GCGCAACCACCAGTCAGCT
GGTGAGCCCCCAAGCATCT
ACGCAACACCCATACACTC
AGCAGCCGCCCATGCATTG
CAGCAGGATACATGCAACC
TGCGGCCGCCCATGCACTC
GACCACCAAGCAAGCACCT
```
we can see that there are 19 columns in the alignment but only 17 columns (displayed inverted as rows) in the position weight matrix. Columns 8 and 17 HMMER judged too close to backgound frequency to extract a log-odds scoring column. If this position weight matrix were used in an actual search, the nucleotides in positions 8 and 17 would be ignored and merely used as placeholders.

There can be poorly conserved columns anywhere in the alignment, but for a successful search, the alignment must contain at least 6 well conserved columns. hmmbuild then uses a tree-based sequence weighting scheme to calculate the weighted counts of nucleotides at each position in the multiple alignment. Background frequency priors are added to each count at each position. These are the frequencies of nucleotides in the non-exonic regions of C. elegans and C. briggsae genomes to be searched. (Using current GFF files to determine these regions, the frequencies are 34.23% G/C and 65.77% A/T content.) Then the ratio with the background nucleotide frequency of the sequence being searched is taken, and the log₂ is taken of this ratio, giving the 'bits'. The resulting PWM is the 4 x n matrix in which a cell i,j is the bits score of nucleotide i at sequence position j. If the bits are too low in flanking positions (resulting from a position in the alignment having close to background frequency), hmmbuild automatically excludes these positions from the PWM. Because of this, if the user is unsure where the region of conservation ends, it is best to err on the side of including unconserved flanking regions, as this won't affect the results. For details, see the HMMER UserGuide.
Isolate and Classify non-exonic sequence Using GFF (General Feature Format) exon annotation files, the program snip.plx isolates all non-exonic sequence from either C. elegans or C. briggsae genomic sequence. (See this illustration). Sequences are classified as
- 3' intergenic between the 3' ends of two genes on opposite strands
- 5' intergenic between the 5' ends of two genes on opposite strands
- 3'/5' intergenic between two genes on the same strand
- intronic# between exons of the same gene (though not necessarily the same splice variant)
  Note: if the gene has only one splice form, these fragments correspond to actual numbered introns
- BEGIN or END at the beginning or ending of a chromosome (C. elegans) or sequence read (C. briggsae)
- other all segments not in any above category (such as those bounded by exons from two different genes.
In addition, any segment will be prefixed by alt_ if either of its bounding exons belongs to a gene with more than one splice variant. Note that no attempt is made to correctly number introns for each splice variant.
Search non-exonic regions using the PWM The program SimpleSearch is used as a scanning window to find up to N (user-defined) top-scoring hits in each genome. Technically, the program chooses the score cutoff to produce as many top-scoring hits as possible less than N. It estimates this cutoff by sampling 1% of the windows and sorting the resulting list of scores. Therefore, the number of hits retrieved will be slightly less than N in most cases. If the PWM has very little information (low bits in most positions) then there will be many windows achieving the same score, and the number of hits retrieved could be substantially less than N. This can arise if the binding sites are incorrectly aligned or poorly conserved.
Filter out 'orphan' hits We then use a 1-1 mapping of C. elegans to C. briggsae orthologs provided in the file orthologs-2.00 to find only those hits in one species for which there is a hit in the matching ortholog of the other species. The filter traverses all ortholog pairs, and retrieves the top-scoring hit for each ortholog, called a 'hit-pair'. The percentage of top-scoring hits filtered out in each species during this step will depend on the number of hits (N) the user chooses to retrieve. A permissive (i.e. 20000) value for N will include many more hit-pairs during this step, but since the hit-pairs passing the filter are then sorted by either average or maximum or minimum score (see next step), often the value of N chosen won't affect which hits appear at the top of the list. The only case where this can affect the top-ranked hit-pairs is the results sorted by mismatch first, then by some score criterion (minimum, average, or maximum).
Output sorted Hit-Pairs in HTML tables Finally, the resulting hit-pairs are sorted by either minimum, average, or maximum score in hit-pair, and by number of mismatches between the hit sequences. All six possible combinations of primary/secondary sortings are provided. Ranking by average score first gives the most statistically correct measure of rank order in the context of our protocol. The other sortings are provided for more specialized analysis of the results.

webmaster@www.wormbase.org	Copyright Statement
Send comments or questions to WormBase	Privacy Statement