BioPerl DocBook Bio::SearchIO HOWTO Jason Stajich [1]Duke University [2]University Program in Genetics [3]Center for Genome Technology Duke University Medical Center Box 3568 Durham, North Carolina 27710-3568 USA <[4]jason-at-bioperl.org> Brian Osborne [5]Cognia Corporation NYC, NY 10022 USA <[6]brian-at-cognia.com> This document is copyright Jason Stajich, Brian Osborne, 2002-2004. It can be copied and distributed under the terms of the Perl Artistic License. 2002-07-14 Revision History Revision 0.1 2002-07-14 js first draft Revision 0.2 2002-10-11 js added info on extending Search objects Revision 0.3 2003-02-13 BIO added table and text to Parsing section Revision 0.4 2003-09-10 BIO updated Parsing section Revision 0.5 2004-01-30 JES and BIO Fixed some missing table entries for Hit functions, typo fixes. Revision 0.6 2004-06-21 JES Made document redistributable under Perl Artistic. Abstract This is a HOWTO written in DocBook (SGML) for the reasoning behind the creation of the Bio::SearchIO system, how to use it, and how one goes about writing new adaptors to different output formats. We will also describe how the Bio::SearchIO::Writer modules work for outputting various formats from Bio::Search objects. _________________________________________________________________ Table of Contents 1. [7]Background 2. [8]Design 3. [9]Parsing with Bio::SearchIO 4. [10]Creating Reports for SearchIO 5. [11]Implementation 6. [12]Writing and formatting output 7. [13]Extending SearchIO 8. [14]Lightweight Objects 9. [15]Useful links 1. Background One of the most common and necessary tasks in bioinformatics is parsing analysis reports so that one can write programs which can help interpret the sheer volume of data that can be produced by processing many sequences. To this end the Bioperl project has produced a number of parsers for the ubiquitous BLAST report. Steve Chervitz wrote one of the first Bioperl modules for BLAST called Bio::Tools::Blast. Ian Korf allowed us to import and modify his BPlite (Blast Parser) Bio::Tools::BPlite module into Bioperl. This is of course in a sea of BLAST parsers that have been written by numerous people, but we will only cover the ones associated directly with the Bioperl project in this document. One of the reasons for writing yet another BLAST parser in the form of Bio::SearchIO is that even though both Bio::Tools::Blast and Bio::Tools::BPlite did their job correctly, and could parse WU-BLAST and NCBI-BLAST output, they did not adequately genericize what they were doing. By this we mean everything was written around the BLAST format and was not easily applicable to parsing, say, FastA alignments or a new alignment format. One of the powerful features of the Object-Oriented framework in Bioperl is the ability to read in, say, a sequence file in different formats or from different data sources like a database or XML-flatfile, and have the program code process the sequences objects in the same manner. We wanted to have this capability in place for analysis reports as well and thus the generic design of the Bio::SearchIO module. 2. Design The Bio::SearchIO system was designed with the following assumptions: That all reports parsed with it could be separated into a hierarchy of components. The Result is the entire analysis for a single query sequence, and multiple Results can be concatenated together into a single file (i.e. running blastall with a fasta database as the input file rather than a single sequence). Each Result is a set of Hits for the query sequence. Hits are sequences in the searched database which could be aligned to the query sequence and met the minimal search parameters, such as e-value threshold. Each Hit has one or more High-scoring segment Pairs (HSPs) which are the alignments of the query and hit sequence. Each Result has a set of one or more Hits and each Hit has a set of one or more HSPs, and this relationship can be used to describe results from all pairwise alignment programs including BLAST, FastA, and implementations of the Smith-Waterman and Needleman-Wunsch algorithms. A design pattern, called Factory, is utilized in object oriented programming to separate the entity which process data from objects which will hold the information produced. In the same manner that the [16]Bio::SeqIO module is used to parse different file formats and produces objects which are [17]Bio::PrimarySeqI compliant, we have written [18]Bio::SearchIO to produce the Bio::Search objects. Sequences are a little less complicated so there is only one primary object (Bio::PrimarySeqI) which Search results need three main components to represent the data processed in a file: [19]Bio::Search::Result::ResultI (top level results), [20]Bio::Search::Hit::HitI (hits) and [21]Bio::Search::HSP::HSPI (HSPs). The [22]Bio::SearchIO object is then a factory which produces [23]Bio::Search::Result::ResultI objects and the [24]Bio::Search::Result::ResultI objects contain information about the query, the database searched, and the full collection of Hits found for the query. The generality of the SearchIO approach is demonstrated by large number of report formats that have appeared since its introduction. These formats include AXT format reports (BLAT, BLASTZ), NCBI tabular output (-m 8 or -m 9 options), NCBI Blast XML, chadosxpr format flat databases, Exonerate output, FASTA output, hmmsearch output (HMMER), megablast output, PSL format output (BLAT), sim4 output, WABA output, and output from Wise. 3. Parsing with Bio::SearchIO This section is going to describe how to use the SearchIO system to process reports. We'll describe BLAST reports but the idea is that once you understand the methods associated with the objects you won't need to know anything special about other SearchIO parsers. Before we get into the details we should admit that there is some confusion about the names and functions of the objects for historical reasons. Both Steve Chervitz and Jason Stajich have implemented parsers in this system. Steve created the psiblast parser (which does parse regular BLAST files too) and a host of objects named Bio::Search::XXX::BlastXXX where XXX is HSP, Hit, and Result. These objects are created by his Bio::SearchIO::psiblast implementation. The objects Jason has created are called Bio::Search::XXX::GenericXXX where, again, XXX is HSP, Hit, and Result. Because of some of the assumptions made in Steve's implementation and his utilization of what is known as 'lazy parsing', it is probably not going to be very easy to maintain his system without his help. On the other hand Jason has tried to make his implementations much easier to follow because all the parsing is done in one module. The important take home message is that you cannot assume that methods in the BlastXXX objects are in fact implemented by the GenericHSP objects. More likely than not the BlastXXX objects will be deprecated and dismantled as their functionality is ported to the GenericHSP objects. For this reason we'll only be discussing the Generic* objects, though we'll use the terms 'hit', 'HSP', and 'result'. Here's example code which processes a BLAST report finding all the hits where the HSPs are greater than 100 residues and the percent identity is less than 75 percent. This code demonstrates that a result, in this case from a BLAST report, contains one or more hits, and a hit contains one or HSPs. use strict; use Bio::SearchIO; my $in = new Bio::SearchIO(-format => 'blast', -file => 'report.bls'); while( my $result = $in->next_result ) { while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { if( $hsp->length('total') > 100 ) { if ( $hsp->percent_identity >= 75 ) { print "Hit= ", $hit->name, ",Length=", $hsp->length('total'), ",Percent_id=", $hsp->percent_identity, "\n"; } } } } } The example above shows just a few of the many methods available in SearchIO. In order to display all these methods and what they return let's use a report as input, a simple BLASTX result: BLASTX 2.2.4 [Aug-26-2002] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= gi|20521485|dbj|AP004641.2 Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 1, BAC clone:B1147B04, 3785 bases, 977CE9AF checksum. (3059 letters) Database: test.fa 5 sequences; 1291 total letters Score E Sequences producing significant alignments: (bits) Value gb|443893|124775 LaForas sequence 92 2e-022 >gb|443893|124775 LaForas sequence Length = 331 Score = 92.0 bits (227), Expect = 2e-022 Identities = 46/52 (88%), Positives = 48/52 (91%) Frame = +1 Query: 2896 DMGRCSSGCNRYPEPMTPDTMIKLYREKEGLGAYIWMPTPDMSTEGRVQMLP 3051 D+ + SSGCNRYPEPMTPDTMIKLYRE EGL AYIWMPTPDMSTEGRVQMLP Sbjct: 197 DIVQNSSGCNRYPEPMTPDTMIKLYRE-EGL-AYIWMPTPDMSTEGRVQMLP 246 Database: test.fa Posted date: Feb 12, 2003 9:51 AM Number of letters in database: 1291 Number of sequences in database: 5 Lambda K H 0.318 0.135 0.401 Gapped Lambda K H 0.267 0.0410 0.140 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Hits to DB: 7140 Number of Sequences: 5 Number of extensions: 180 Number of successful extensions: 2 Number of sequences better than 10.0: 2 Number of HSP's better than 10.0 without gapping: 1 Number of HSP's successfully gapped in prelim test: 0 Number of HSP's that attempted gapping in prelim test: 0 Number of HSP's gapped (non-prelim): 1 length of database: 1291 effective HSP length: 46 effective length of database: 1061 effective search space used: 1032353 frameshift window, decay const: 50, 0.1 T: 12 A: 40 X1: 16 ( 7.3 bits) X2: 38 (14.6 bits) X3: 64 (24.7 bits) S1: 32 (17.6 bits) Table 1 shows all the data returned by methods used by the Result, Hit, and HSP objects when the report shown above is used as input. Note that many of the methods shown can be used to either get or set values, but we're just showing what they get. Object Method Example Description Result algorithm BLASTX algorithm Result algorithm_version 2.2.4 [Aug-26-2002] algorithm version Result query_name gi|20521485|dbj|AP004641.2 query name Result query_accession AP004641.2 query accession Result query_length 3059 query length Result query_description Oryza sativa ... 977CE9AF checksum. query description Result database_name test.fa database name Result database_letters 1291 number of residues in database Result database_entries 5 number of database entries Result available_statistics effectivespaceused ... dbletters statistics used Result available_parameters gapext matrix allowgaps gapopen parameters used Result num_hits 1 number of hits Result hits Search::Hit::GenericHit object Hit name gb|443893|124775 hit name Hit length 331 Length of the Hit sequence Hit accession 443893 accession Hit description LaForas sequence hit description Hit algorithm BLASTX algorithm Hit raw_score 92 hit raw score Hit significance 2e-022 hit significance Hit bits 92.0 hit bits Hit hsps Search::HSP::GenericHSP object Hit num_hsps 1 number of HSPs in hit Hit locus 124775 locus name Hit accession_number 443893 accession number HSP algorithm BLASTX algorithm HSP evalue 2e-022 e-value HSP expect 2e-022 alias for evalue() HSP frac_identical 0.884615384615385 Fraction identical HSP frac_conserved 0.923076923076923 desc HSP gaps 2 number of gaps HSP query_string DMGRCSSG ... string from alignment HSP hit_string DIVQNSS ... string from alignment HSP homology_string D+ + SSGCN ... string from alignment HSP length('total') 52 length of HSP (including gaps) HSP length('hit') 50 length of hit participating in alignment minus gaps HSP length('query') 156 length of query participating in alignment minus gaps HSP hsp_length 52 Length of the HSP (including gaps) alias for length('total') HSP frame 0 $hsp->query->frame,$hsp->hit->frame HSP num_conserved 48 number of conserved residues HSP num_identical 46 number of identical residues HSP rank 1 rank of HSP HSP seq_inds('query','identical') (966,971,972,973,974,975 ...) identical positions as array HSP seq_inds('query','conserved') (967,969) conserved positions as array HSP seq_inds('hit','identical') (197,202,203,204,205 ...) identical positions as array HSP seq_inds('hit','conserved') (198,200) conserved positions as array HSP score 227 score HSP bits 92.0 bits HSP range('query') (2896,3051) start and end as array HSP range('hit') (197,246) start and end as array HSP percent_identity 88.4615384615385 % identical HSP strand('hit') 1 strand of the hit HSP strand('query') 1 strand of the query HSP start('query') 2896 start position from alignment HSP end('query') 3051 end position from alignment HSP start('hit') 197 start position from alignment HSP end('hit') 246 end position from alignment HSP matches('hit') (46,48) number of identical and conserved as array HSP matches('query') (46,48) number of identical and conserved as array HSP alignment Bio::SimpleAlign object HSP hsp_group Group field from WU-BLAST reports run with -topcomboN or -topcomboE specified HSP links Links field from WU-BLAST reports run with -links showing consistent HSP linking Table 1. SearchIO Methods Table 1 shows that a method can return a string, an array, or an object. When an object is returned some additional code will probably be needed to get the data of interest. For example, if you wanted a printable alignment after you'd parsed BLAST output you could use the get_aln() method, retrieve a [25]Bio::SimpleAlign object and use it like this: use Bio::AlignIO; # $aln will be a Bio::SimpleAlign object my $aln = $hsp->get_aln; my $alnIO = Bio::AlignIO->new(-format=>"msf"); my $alignment_as_string = $alnIO->write_aln($aln); On one hand it appears to be a complication, but by entering the worlds of the AlignIO and SimpleAlign objects you now have access to their functionality and flexibility. This is the beauty of Bioperl! Some of these methods deserve a bit more explanation since they do more than simply extract data directly from the output. For example, the ambiguous_aln() method is designed to tell us whether two or more HSPs from a given hit overlap, and whether the overlap refers to the queries or the hits, or both. One situation is where overlaps would be found in one but not the other arises where there are repeats in the query or hit. The ambiguous_aln() method will return one of these 4 values: q query sequence contains overlapping sub-sequences while hit sequence does not s hit sequence contains overlapping sub-sequences while query does not qw query and hit sequences contain overlapping sub-sequences relative to each other - query and hit sequence do not contain multiple domains relative to each other OR both contain the same distribution of similar domains Another method that's useful in dissecting an HSP is the seq_inds() method of the HSP object. What this method does is tell us what the positions are of all the identical or conserved residues in an alignment, query or hit. It could be used like this: # put all the conserved matches in query strand into an array my @str_array = split "",$hsp->query_string; foreach ( $hsp->seq_inds('query','conserved') ){ push @conserved,$str_array[$_ - 1]; } seq_inds() can be very useful for extracting the mismatch bases in an alignment If you wanted to figure out which bases are not matching in an alignment you could use seq_inds to get these positions and then extract out these specific bases from the alignment. In most cases the SearchIO methods extract data directly from output but there's one important exception, the frame() method of the HSP object. Instead of using the values in the BLAST report it converts them to values according to the GFF specification, which is a format used by many Bioperl modules involved in gene annotation (for more on GFF see [26]http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml). Specifically, the frame() method returns 0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, or +3 in BLAST. GFF frame values are meaningful relative to the strand of the hit or query sequence so in order to reconstruct the BLAST frame you need to both the strand, 1 or -1, and the GFF frame value: my $blast_frame = ($hsp->query->frame + 1) * $hsp->query->strand; Our simple table of methods does not show all available arguments or returned values for all the SearchIO methods. The best place to explore any method in detail is [27]http://doc.bioperl.org which provides the HTML versions of the Perl POD (Plain Old Documentation) that is embedded in every well-written Perl module - there's also a list of modules at the bottom of this HOWTO. Other sources of code are the examples/searchio/ and scripts/searchio directories in the Bioperl package. 4. Creating Reports for SearchIO One note on creating reports that can be parsed by SearchIO: the developers haven't attempted to parse all the possible reports that could be created by programs with many command-line options, like blastall. Certainly you should be able to parse reports created using the default settings, but if you're running blastall, say, using some special set of options and you've encountered a parsing problem this may be the explanation. For example, one can currently parse BLAST output created with the default settings as well as the reports created when using the "-m 8" or "-m 9" options (use format "blasttable") or the XML-formatted reports but it's still possible to find sets of options that SearchIO can't parse. 5. Implementation This section is going to describe how the SearchIO system was implemented, it is probably not necessary to understand all of this unless you are curious or want to implement your own Bio::SearchIO parser. We have utilized an event-based system to process these reports. This is analagous to the SAX (Simple API for XML) system used to process XML documents. Event based parsing can be simply thought of as simple start and end events. When you hit the beginning of a report a start event is thrown, when you hit the end of the report an end event is thrown. So the report events are paired, and everything else that is thrown in between the paired start and end events is related to that report. Another way to think of it is as if you pick a number and color for a card in a standard deck. Let's say you pick red and 2. The you start dealing cards from our deck and pile them one on top of each other. When you see your first red 2 you start a new pile, and start dealing cards onto that pile until you see the next red 2. Everything in your pile that happened between when you saw the beginning red 2 and ending red 2 is data you'll want to keep and process. In the same way all the events you see between a pair of start and end events (like 'report' or 'hsp') are data associated with object or child object in its hierarchy. A listener object processes all of these events, in our example the listener is the table where the stack of cards is sitting, and later it is the hand which moves the pile of cards when a new stack is started. The listener will take the events and process them. We've neglected to tell you of a third event that is thrown and caught. This is the characters event in SAX terminology, which is simply data. So one sends a start event, then some data, then an end event. This process is analagous to a finite state machine in computer science (and I'm sure the computer scientists reading this right are already yawning) where what we do with data received is dependent on the state we're in. The state that the listener is in is affected by the events that are processed. A small caveat: in an ideal situation a processor would throw events and not need to maintain any state information, it would just be processing data and the listener would manage the information and state. However, a lot of the parsing of these human-readable reports requires contextual information to apply the correct regular expressions. So in fact the event thrower has to know what state it is in and apply different methods based on this. In contrast the XML parsers simply keep track of what state they are in, but can process all the data with the same system of reading the tag and sending the data that is in between the XML start and end tags. All of this framework has been built up so to implement a new parser one only needs to write a module that produces the appropriate start and end events and the existing framework will do the work of creating the objects for you. Here's how we've implemented event-based parsing for [28]Bio::SearchIO. The Bio::SearchIO is just the front-end to this process, in fact the processing of these reports is done by different modules in the Bio/SearchIO directory. So if you look at your bioperl distribution and the modules in Bio/SearchIO you'll see modules in there like blast.pm, fasta.pm, blastxml.pm, SearchResultEventBuilder.pm, EventHandlerI.pm (depending on what version of the toolkit there may be more modules in there). There is also a SearchWriterI.pm and Writer directory in there but we'll save that for later. If you don't have the distribution handy you can navigate this at the bioperl [29]CVS web page. Let's use the blast.pm module as an example to describe the relationship of the modules in this directory (could have substituted any of the other format parsers like fasta.pm or blastxml.pm - these are always lowercase for historical reasons). The module has some features you should look for - the first is the hash in the BEGIN block called %MAPPING. This key/value pairs here are the shorthand for how we map events from this module to general event names. This is only necessary because if we have an XML processor (see the blastxml.pm module) the event names will be the same as the XML tag names (like in the NCBI BLAST XML DTD). So to make this general we'll make sure all of the events inside our parser map to the values in the %MAPPING hash - we can call them whatever we want inside this module. Some of the events map to hash references (like Statistics_db-len) and this is so we can map multiple values to the same top-level attribute field but we know they will be stored as a hash value in the subsequent object (in this example, keyed by the name 'dbentries'). The capital "RESULT", "HSP", or "HIT" in the value name allow us to encode the event state in the event so we don't have to pass in two values. It is also easy for someone to quickly read the list of events and know which ones are related to Hits and which ones are related to HSPs. The listener in our architecture is the [30]Bio::SearchIO::SearchResultEventBuilder. This object is attached as a listener through the [31]Bio::SearchIO method add_EventListener. In fact you could have multiple event listeners and they could do different things. In our case we want to create Bio::Search objects, but an event listener could just as easily be writing data directly into a database or writing to a file, based on the events. The SearchResultEventBuilder takes the events thrown by the SearchIO classes and builds the appropriate Bio::Search::HSP object from it. Sometimes special objects are needed that are extensions beyond what the GenericHSP or GenericHit objects are meant to represent. For this case we have implemented Bio::SearchIO::SearchResultEventBuilder so that it can use factories for creating its resulting Bio::Search objects - see the Bio::SearchIO::hmmer::_initialize method for an example of how this can be set. 6. Writing and formatting output Often people want to write back out a BLAST report for users who are most comfortable with that output or if you want to visualize the context of a weakly aligned region to use human intuition to score the confidence of a putative homologue. The Bio::SearchIO::Writer modules are for creating output using the information. Bio::SearchIO::Writer currently creates output in a few different formats: text (recreating something like the BLAST report itself, in part or entirely), HTML, BSML, "ResultTable" (tab-delimited format), "HSPTable" (tab-delimited, for HSPs), and Gbrowse GFF. The simplest way to output data in HTML format is as follows. my $writerhtml = new Bio::SearchIO::Writer::HTMLResultWriter(); my $outhtml = new Bio::SearchIO(-writer => $writerhtml, -file => ">searchio.ht ml"); # get a result from Bio::SearchIO parsing or build it up in memory $outhtml->write_result($result); If you wanted to get the output as a string rather than write it out to a file, simply use the following. $writerhtml->to_string($result); The HTMLResultWriter supports setting your own remote database url for the sequence links in the event you'd like to point to your own SRS or local HTTP-based connection to the sequence data. Simply use the remote_database_url method which accepts a sequence type as input (protein or nucleotide). You can also override the id_parser() method to define what the unique IDs are from these sequence ids in the event you would like to use something other than the accession number that is gleaned from the sequence string. If your data is instead stored in a database you could build the Bio::Search objects up in memory directly from your database and then use the Writer object to output the data. 7. Extending SearchIO The framework for [32]Bio::SearchIO is just a starting point for parsing these reports and creating objects which represent the information. If you would like to create your own set of objects which extend the current functionality we have built the system so that it will support this. For example, you may have built your own HSP object which supports a special operation like realign_with_sw(), which might realign the HSP via a Smith-Waterman algorithm, pulling extra bases from the flanking sequence. You might call your module Bio::Search::HSP::RealignHSP and put it in a file called Bio/Search/HSP/RealignHSP.pm. Note that you don't have to put this file directly in the bioperl source directory - you can create your own local directory structure that is in parallel to the bioperl release source code as long as you have updated your PERL5LIB to contain your local directory or you are using the 'use lib' directive in your script. Also, you don't have to use the namespace Bio::Search::HSP as namespaces don't mean anything to perl with respect to object inheritance, but do we recommend you name things in a logical manner so that others might read and understand your code (and if you feel encouraged to donate your code to the project it might easily integrated with existing modules). So, you're going to write your new special module, you do need to make sure it inherits from the base [33]Bio::Search::HSP::HSPI object. Additionally unless you want to reimplement all the initialization state in the current [34]Bio::Search::HSP::GenericHSP you should just plan to extend that object. You need to follow the chained constructor system that we have set up so that the arguments are properly processed. Here is a sample of what your code might look like (don't forget to write your own POD so that it will be documented, we've left it off here to keep things simple). package Bio::Search::HSP::RealignHSP; use strict; use Bio::Search::HSP::GenericHSP; use vars qw(@ISA); # for inheritance @ISA = qw(Bio::Search::HSP::GenericHSP); # RealignHSP inherits from GenericH SP sub new { my ($class,@args) = @_; my $self = $class->SUPER::new(@args); # chained contructor # process the 1 additional argument this object supports my ($ownarg1) = $self->_rearrange([OWNARG1],@args); return $self; # remember to pass the object reference back out } sub realign_hsp { my ($self) = @_; # implement my special realign method here } The above code gives you a skeleton of how to start to implement your object. To register it so that it is used when the SearchIO system makes HSPs you just need to call a couple of functions. The code below outlines them. use Bio::SearchIO; use Bio::Search::HSP::HSPFactory; use Bio::Search::Hit::HitFactory; # setup the blast parser, you can do this with and SearchIO parser however my $searchio = new Bio::SearchIO(-file => $blastfile, -format =>'blast'); # build HSP factory with a certain type of HSPs to make # the default is Bio::Search::HSP::GenericHSP my $hspfact = new Bio::Search::HSP::HSPFactory(-type => 'Bio::Search::HSP::RealignHSP'); # if you wanted to replace the Hit factory you can do this as well # additionally there is an analagous # Bio::Search::Result::ResultFactory for setting custom Result objects my $hitfact = new Bio::Search::Hit::HitFactory(-type => 'Bio::Search::Hit::SUPERDUPER_Hit'); $searchio->_eventHandler->register_factory('hsp', $hspfact); $searchio->_eventHandler->register_factory('hit', $hitfact); We have to register the HSPFactory, which is the object which will create HSPI objects, by allowing this to be built by a factory rather than a hard-coded Bio::Search::HSP::GenericHSP->new(...) call. We are allowing the user to take advantage of the whole parsing structure and the ability to slot their own object into the process rather than re-implementing very much. We think this is very powerful and worth the system overhead, but it may not permit this to be as efficient in parsing as we would like. Future work will hopefully address speed and memory issues with this parser. Volunteers and improvement code are always welcome. 8. Lightweight Objects The approaches described above will create a lot of objects, one for each of the components of a report. When you have 2000 hits in a BLASTX result there will be quite a few objects built, and a lot of memory consumed. It may be that you will need to use an approach that's less memory-intensive if your result sets are large. One option is to use the tabular output from BLAST when dealing with large datasets. There are other workarounds depending on what kind of data you want. We designed SearchIO to be a modular system which separates parsing the data from instantiating objects by throwing events (like SAX) and having a listener build objects from these events. One can instantiate a different listener which builds simpler objects and throws away the data you don't want. Here is an example of how you can use an existing, lightweight listener - FastHitEventBuilder - which just throws away the HSPs and only builds Result and Hit objects. use Bio::SearchIO; use Bio::SearchIO::FastHitEventBuilder; my $searchio = new Bio::SearchIO(-format => $format, -file => $file); $searchio->attach_EventHandler(new Bio::SearchIO::FastHitEventBuilder); while( my $r = $searchio->next_result ) { while( my $h = $r->next_hit ) { # Hits will NOT have HSPs print $h->significance,"\n"; } } You could also build your own listener object - SearchResultEventBuilder and FastHitEventBuilder are 2 example implementations that specify the type of Result/Hit/HSP objects that are created by the listeners. You could creating some lightweight Hit and HSP objects and have SearchResultEventBuilder create these instead of the default full-fledged ones. The whole parser/listener design assumes that you want to process all the data for a result before moving on to the next one. From the listener's standpoint this means you have to store all the data you just got from the parser. Whether this is in memory, or potentially stored in a temporary file or database, would be up to the implementation. 9. Useful links SearchIO Modules [35]SearchIO.pm [36]SearchIO/axt.pm [37]SearchIO/blast.pm [38]SearchIO/blasttable.pm [39]SearchIO/blastxml.pm [40]SearchIO/EventHandlerI.pm [41]SearchIO/exonerate.pm [42]SearchIO/fasta.pm [43]SearchIO/FastHitEventBuilder.pm [44]SearchIO/hmmer.pm [45]SearchIO/IteratedSearchResultEventBuilder.pm [46]SearchIO/megablast.pm [47]SearchIO/psl.pm [48]SearchIO/SearchResultEventBuilder.pm [49]SearchIO/SearchWriterI.pm [50]SearchIO/sim4.pm [51]SearchIO/waba.pm [52]SearchIO/wise.pm [53]SearchIO/Writer/BSMLResultWriter.pm [54]SearchIO/Writer/GbrowseGFF.pm [55]SearchIO/Writer/HitTableWriter.pm [56]SearchIO/Writer/HSPTableWriter.pm [57]SearchIO/Writer/HTMLResultWriter.pm [58]SearchIO/Writer/ResultTableWriter.pm [59]SearchIO/Writer/TextResultWriter.pm [60]Search/BlastUtils.pm [61]Search/DatabaseI.pm [62]Search/GenericDatabase.pm [63]Search/Hit/BlastHit.pm [64]Search/Hit/Fasta.pm [65]Search/Hit/GenericHit.html">Search/Hit/GenericHit.pm [66]Search/Hit/HitFactory.pm [67]Search/Hit/HitI.pm [68]Search/Hit/HMMERHit.pm [69]Search/Hit/PsiBlastHit.pm [70]Search/HSP/BlastHSP.pm [71]Search/HSP/FastaHSP.pm [72]Search/HSP/GenericHSP.pm [73]Search/HSP/HMMERHSP.pm [74]Search/HSP/HSPFactory.pm [75]Search/HSP/HSPI.pm [76]Search/HSP/PsiBlastHSP.pm [77]Search/HSP/PSLHSP.pm [78]Search/HSP/WABAHSP.pm [79]Search/Iteration/GenericIteration.pm [80]Search/Iteration/IterationI [81]Search/Processor.pm [82]Search/Result/BlastResult.pm [83]Search/Result/GenericResult.pm [84]Search/Result/HMMERResult.pm [85]Search/Result/ResultFactory.pm [86]Search/Result/ResultI.pm [87]Search/Result/WABAResult.pm [88]Search/SearchUtils.pm [89]Search/BlastStatistics.pm [90]Search/GenericStatistics.pm [91]Search/StatisticsI.pm BioPerl DocBook References 1. http://www.duke.edu/ 2. http://upg.duke.edu/ 3. http://cgt.genetics.duke.edu/ 4. mailto:jason-at-bioperl.org 5. http://www.cognia.com/ 6. mailto:brian-at-cognia.com 7. file://localhost/cygdrive/c/docbook/deploy/SearchIO/SearchIO.html#background 8. file://localhost/cygdrive/c/docbook/deploy/SearchIO/SearchIO.html#design 9. file://localhost/cygdrive/c/docbook/deploy/SearchIO/SearchIO.html#use 10. file://localhost/cygdrive/c/docbook/deploy/SearchIO/SearchIO.html#creating-reports 11. file://localhost/cygdrive/c/docbook/deploy/SearchIO/SearchIO.html#implementation 12. file://localhost/cygdrive/c/docbook/deploy/SearchIO/SearchIO.html#outputting 13. file://localhost/cygdrive/c/docbook/deploy/SearchIO/SearchIO.html#extending 14. file://localhost/cygdrive/c/docbook/deploy/SearchIO/SearchIO.html#lightweight 15. file://localhost/cygdrive/c/docbook/deploy/SearchIO/SearchIO.html#links 16. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SeqIO.html 17. http://doc.bioperl.org/releases/bioperl-1.4/Bio/PrimarySeq.html 18. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO.html 19. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Result/ResultI.html 20. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/HitI.html 21. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/HSPI.html 22. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO.html 23. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Result/ResultI.html 24. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Result/ResultI.html 25. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SimpleAlign.html 26. http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml 27. http://doc.bioperl.org/ 28. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO.html 29. http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/SearchIO/?cvsroot=bioperl 30. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/SearchResultEventBuilder.html 31. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO.html 32. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO.html 33. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/HSPI.html 34. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/GenericHSP.html 35. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO.html 36. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/axt.html 37. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/blast.html 38. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/blasttable.html 39. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/blastxml.html 40. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/EventHandlerI.html 41. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/exonerate.html 42. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/fasta.html 43. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/FastHitEventBuilder.html 44. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/hmmer.html 45. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/IteratedSearchResultEventBuilder.html 46. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/megablast.html 47. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/psl.html 48. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/SearchResultEventBuilder.html 49. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/SearchWriterI.html 50. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/sim4.html 51. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/waba.html 52. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/wise.html 53. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/Writer/BSMLResultWriter.html 54. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/Writer/GbrowseGFF.html 55. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/Writer/HitTableWriter.html 56. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/Writer/HSPTableWriter.html 57. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/Writer/HTMLResultWriter.html 58. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/Writer/ResultTableWriter.html 59. http://doc.bioperl.org/releases/bioperl-1.4/Bio/SearchIO/Writer/TextResultWriter.html 60. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/BlastUtils.html 61. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/DatabaseI.html 62. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/GenericDatabase.html 63. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/BlastHit.html 64. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/Fasta.html 65. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/GenericHit.html 66. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/HitFactory.html 67. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/HitI.html 68. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/HMMERHit.html 69. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Hit/PsiBlastHit.html 70. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/BlastHSP.html 71. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/FastaHSP.html 72. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/GenericHSP.html 73. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/HMMERHSP.html 74. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/HSPFactory.html 75. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/HSPI.html 76. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/PsiBlastHSP.html 77. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/PSLHSP.html 78. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/HSP/WABAHSP.html 79. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Iteration/GenericIteration.html 80. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Iteration/IterationI.html 81. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Processor.html 82. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Result/BlastResult.html 83. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Result/GenericResult.html 84. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Result/HMMERResult.html 85. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Result/ResultFactory.html 86. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Result/ResultI.html 87. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/Result/WABAResult.html 88. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/SearchUtils.html 89. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/BlastStatistics.html 90. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/GenericStatistics.html 91. http://doc.bioperl.org/releases/bioperl-1.4/Bio/Search/StatisticsI.html