- Build a Position Weight Matrix (PWM) The user
submits a binding site alignment file. This file is input to the hmmbuild
command from the HMMER software suite, to build the position weight matrix. Only the emission parameters
in the file are used in the search. These are the four numbers in lines
starting with 1,2,3 etc. The sixth column in these lines is the position
in the multiple alignment which these parameters represent. For example,
in the following position weight matrix:
PWM_column A C G T alignment_column
1 -231 -514 1472 -1392 1
2 568 158 834 -4321 2
3 -4321 273 2016 -1959 3
4 -1106 2016 -304 -4321 4
5 1357 -4323 -370 -4321 5
6 -93 560 1305 -4321 6
7 -4321 1915 690 -2407 7
8 -300 1801 -1414 -2189 9
9 -1229 2152 -1046 -4321 10
10 -4321 2486 -4323 -4321 11
11 1555 -4323 -4323 -4321 12
12 -126 -4323 -551 802 13
13 -1859 -4323 2181 -1399 14
14 -4321 2486 -4323 -4321 15
15 1555 -4323 -4323 -4321 16
16 -1612 1168 -4323 602 18
17 -2636 1020 83 348 19
generated from the alignment GAGCAAGTCCCATGCAATT
TAGCAGGTCCCATGCACTT
AAGCAAGTCCCATGCAATG
GGGCAAGTCCCATGCAATA
AAGCAGTTCCCATGCAGAC
AAGCAGTTCCCATGCAGAC
AAGCACCTCCCATGCAGAC
GAGAAGCGCCCAAGCATTG
GCGCAACCACCAGTCATTT
GCGCAACCACCAGTCATTT
GGGCAGCGAACAAGCATCT
CAGCAGGCGCCATGCAACC
GCGCAACCACCAGTCAGCT
GGTGAGCCCCCAAGCATCT
ACGCAACACCCATACACTC
AGCAGCCGCCCATGCATTG
CAGCAGGATACATGCAACC
TGCGGCCGCCCATGCACTC
GACCACCAAGCAAGCACCT
we can see that there are 19 columns in the alignment but only 17
columns (displayed inverted as rows) in the position weight matrix.
Columns 8 and 17 HMMER judged too close to backgound frequency to
extract a log-odds scoring column. If this position weight matrix were
used in an actual search, the nucleotides in positions 8 and 17 would be
ignored and merely used as placeholders.
There can be poorly
conserved columns anywhere in the alignment, but for a successful
search, the alignment must contain at least 6 well conserved columns.
hmmbuild then uses a tree-based sequence weighting scheme to
calculate the weighted counts of nucleotides at each position in the
multiple alignment. Background frequency priors are added to each count
at each position. These are the frequencies of nucleotides in the
non-exonic regions of C. elegans and C. briggsae genomes
to be searched. (Using current GFF files to determine these regions, the
frequencies are 34.23% G/C and 65.77% A/T content.) Then
the ratio with the background nucleotide frequency of the sequence being
searched is taken, and the log2 is taken of this ratio,
giving the 'bits'. The resulting PWM is the 4 x n matrix in which
a cell i,j is the bits score of nucleotide i at sequence
position j. If the bits are too low in flanking positions
(resulting from a position in the alignment having close to background
frequency), hmmbuild automatically excludes these positions from
the PWM. Because of this, if the user is unsure where the region of
conservation ends, it is best to err on the side of including
unconserved flanking regions, as this won't affect the results. For
details, see the HMMER UserGuide.
- Isolate and Classify non-exonic sequence Using GFF (General Feature Format) exon annotation
files, the program snip.plx isolates all non-exonic sequence from
either C. elegans or C. briggsae genomic sequence. (See
this illustration). Sequences are classified as
- 3' intergenic between the 3' ends of two
genes on opposite strands
- 5' intergenic between the 5' ends of two
genes on opposite strands
- 3'/5' intergenic between two genes on
the same strand
- intronic# between exons of the same gene
(though not necessarily the same splice variant)
Note: if
the gene has only one splice form, these fragments correspond to
actual numbered introns
- BEGIN or END at the
beginning or ending of a chromosome (C. elegans) or sequence
read (C. briggsae)
- other all segments not in any above
category (such as those bounded by exons from two different genes.
In addition, any segment will be prefixed by
alt_ if either of its bounding exons belongs to a gene
with more than one splice variant. Note that no attempt is made to
correctly number introns for each splice variant.
- Search non-exonic regions using the PWM The
program SimpleSearch is used as a scanning window to find up to N
(user-defined) top-scoring hits in each genome. Technically, the program
chooses the score cutoff to produce as many top-scoring hits as possible
less than N. It estimates this cutoff by sampling 1% of the windows and
sorting the resulting list of scores. Therefore, the number of hits
retrieved will be slightly less than N in most cases. If the PWM has
very little information (low bits in most positions) then there will be
many windows achieving the same score, and the number of hits retrieved
could be substantially less than N. This can arise if the binding sites
are incorrectly aligned or poorly conserved.
- Filter out 'orphan' hits We then use a 1-1
mapping of C. elegans to C. briggsae orthologs provided in
the file orthologs-2.00 to find only those hits in one species
for which there is a hit in the matching ortholog of the other species.
The filter traverses all ortholog pairs, and retrieves the top-scoring
hit for each ortholog, called a 'hit-pair'. The percentage of
top-scoring hits filtered out in each species during this step will
depend on the number of hits (N) the user chooses to retrieve. A
permissive (i.e. 20000) value for N will include many more hit-pairs
during this step, but since the hit-pairs passing the filter are then
sorted by either average or maximum or minimum score (see next step),
often the value of N chosen won't affect which hits appear at the top of
the list. The only case where this can affect the top-ranked hit-pairs
is the results sorted by mismatch first, then by some score criterion
(minimum, average, or maximum).
- Output sorted Hit-Pairs in HTML tables Finally,
the resulting hit-pairs are sorted by either minimum, average, or
maximum score in hit-pair, and by number of mismatches between the hit
sequences. All six possible combinations of primary/secondary sortings
are provided. Ranking by average score first gives the most
statistically correct measure of rank order in the context of our
protocol. The other sortings are provided for more specialized analysis
of the results.
|