------------------------------------------------------- FTP resources for the dbSNP database ------------------------------------------------------- *************************************** CURRENT ANNOUNCEMENTS *************************************** Rev: Oct 25, 2003 Redistribution of platform-specific backup files no longer supported. Please see DATABASE STRUCTURE (below) for description of the traditional format (schema files and tab-delimited tables for data) and tips for maintaining a local copy of the database. **************************************** This document describes the FTP repository of dbSNP data with the following sections: I. CONTENT OVERVIEW II. DIRECTORY STRUCTURE III. FORMAT DESCRIPTIONS IV. ADDITIONAL FORMATS FOR BATCH QUERY SERVICE V. FLANKING SEQUENCE IN SYBASE, ASN.1 and XML REPORT FORMATS VI. HISTORY OF CHANGES TO THE FTP SITE VII. DATA PROCESSING AND SUMMARY MEASURES VIII. XML GENOTYPE EXCHANGE FORMAT --------------------------------------------------- dbSNP is available in several download formats --------------------------------------------------- I. CONTENT OVERVIEW: Description and Update Frequency NCBI supports the public redistribution of dbSNP by providing zip compressed data dumps in four data formats. dbSNP is in a state of growth, both in terms of the rate of submissions, and in terms of the relational schema used by NCBI to efficiently represent both the submitted data and the results of post-submission computation. Given this current dynamic environment, we have elected to periodically generate complete dumps of the database rather than quarterly dumps and weekly/monthly updates. These dumps are now refreshed weekly, usually on Sunday nights. NCBI reserves the right to change the frequency of these updates as our computational production cycle matures. Furthermore, NCBI reserves the right to change the structure of these data formats at any time. The documentation sources for our data formats are noted below. These sources will be updated as well when changes are made to a data format. Announcements for the release of new builds and notification of corrections to existing database content will be posted to a public maillist. Please subscribe at http://www.ncbi.nlm.nih.gov/mailman/listinfo/dbsnp-announce to receive these notifications. II. DIRECTORY STRUCTURE ***** PLEASE NOTE THAT THE DIRECTORY STRUCTURE HAS ****** ***** CHANGED TO SEGREGATE DATA BY ORGANISM AND ****** ***** MAP POSITION ****** Access to the NCBI FTP site is available via the web or anonymous FTP. The current URL/host addresses are: World Wide Web: ftp://ftp.ncbi.nih.gov/snp/ Anonymous FTP: host ftp.ncbi.nih.gov cd snp FILE COMPRESSION: All files on the site are compressed with standard zip compression utility. Futher information may be found at http://www.gnu.org/software/gzip/gzip.html Directories and subdirectories: /bin software tools for using ASN.1 binaries /specs ASN.1 and XML specifications for dbSNP docsum data structure /MSSQL database dump of all dbSNP tables /ss_fasta fasta format for all submissions in dbSNP /{organism} organism-specific data in multiple report formats Top-level Organism-specific Directories: human mouse rat chimpanzee plasmodium --subdirectories of /{organism} by format-- /ASN1_bin RefSNP docsum in ASN.1 binary format. /ASN1_flat RefSNP docsum from ASN.1 binary in human readable flatfile format /chr_rpts RefSNPs per chromosome sorted by chromosome location /rs_fasta fasta format for non-redundant refSNP clusters by chromosome /XML submission format and XML exchange format for dbSNP refSNP clusters including: submissions (ss#'s) in cluster, mapping information, gene function information computed from analysis of reference genome sequence, snp-links, accessions, submitter comments, comments on meth-failure, submitter defined gene contexts, flanking sequence and alleles, population definitions and allele frequencies. XML DTD available in /specs directory (above). /genome_reports Summery reports on SNPs in genes, SNP density on the genome, and intervals of genome sequence with little or no SNP content. Reports are generated with chXX appended to report style name to designate chromosome for respective data. Data are mapped as follows: ch1-ch22, chX, chY chromosomes 1-22,X,Y respectively chMulti variations that mapped to multiple chromosomes chNotOn variations that did not map to any current chromosome Mapping is defined by BLAST analysis of variation flanking sequence to the current NCBI genome assembly (NT_ contigs). Download formats FTP Subdirectory Data structure ---------------- ------------------ ----------------------- submission flatfile submit_format submitted data. Data are subdivided by year and quarter. FASTA ss_fasta flanking sequence for BLAST Data are subdivided by year and quarter. rs_fasta flanking sequence for BLAST. Data for human are subdivided by chromosome location. Other organisms are currently in single file. ASN.1 binary ASN1_bin Refsnp docsums for data exchange and map summaries (binary version) Data are divided by chromosome assignment. ASN.1 flatfile ASN1_flat Refsnp docsums for data exchange and map summaries (flatfile version) Data are divided by chromosome assignment. XML data exchange XML/XML_brief RefSNP mapping+locus information and submitter batch/sequence information for all submissions in the set. Data are divided by chromosome assignment. III. FORMAT DESCRIPTIONS ------------------- Submission flatfile ------------------- Documentation Source: http://www.ncbi.nlm.nih.gov/SNP/get_html.cgi?whichHtml=how_to_submit The submission flatfile format reports database contents in the format of the original database submissions. The dbSNP "how to submit" document is the reference that defines this format. Only the successfully validated sections of a submission that are loaded into dbSNP are dumped back into this report format. The submission format divides submitted data into "TYPE" sections; each of which is dumped into one of the following files: File Notes --------------- ------------------------------ contact.rep Handle definitions and submitter contact information publicat.rep Publications cited in the database method.rep Assay methods defined by submitters populatn.rep Population descriptions defined by submitters snpassay.rep Assay reports for all SNPs in the database. These reports use HANDLE,PUBLICATION,METHOD and POPULATION IDs defined in the above files. popuse.rep Population frequency data. These reports use HANDLE, METHOD, POPULATION and ASSAY IDs defined in the above files. induse.rep Individual genotype data. These reports use HANDLE, METHOD, CITATION, POPULATION and ASSAY IDs defined in the above files. novar.rep Reports of STS/sequences with no variation detected. **** NOTE **** This format only dumps data defined in the submission document. It does NOT include post-submission data objects or links computationally derived by NCBI, such as RefSNP IDs (rs#) or links between SNPs and other resources like GenBank accession numbers or LocusIDs. ------------------- FASTA sequence ------------------- Documentation Source: http://www.ncbi.nlm.nih.gov/BLAST/fasta.html This format provides the flanking sequence for each report of variation in dbSNP, as well as all submitted sequences that have no variation. This data format is typically used for BLAST applications (see following section). Two dumps are available: File Notes --------------- ------------------------------ ss.fas contains all submitted snp sequences in FASTA format rs.fas all the reference snp sequences in FASTA format *** FASTA format and data structure for an ss record *** defline for FASTA records start with ">" | object-type=general | | total length | | database name of sequence list of | | | offset of SNP| Submitter organism molecule class of alleles | | | dbSNP ss# in sequence | | SubmitterSNPID | type variation | | | | | | | | | | | | | defline: >gnl|dbSNP|ss271_allelePos=51totallen=101|DEBNICK|lp03022|taxid=9606|mol=Genomic|subsnpClass=1|alleles='G/A' 5' sequence: CTGCATCACA TGTACTGATT CTGTCCATTG GAACAGAGAT GATGACTGGT variation: R 3' sequence: TTACTAAACC CTGAGCCCTG GTGTTTCTGT TGATAGGGGG TTGCATTGAT Notes: 1. If "variation" is a single nucleotide polymorphism (SNP) then the appropriate IUPAC nucleotide ambiguity letter is selected to represent the reported possible allele states. 2. In all other cases (microsatellite, insertion/deletion) "variation" is represented as a single "N" on the variation line of the FASTA report. 3. If the string of alleles is more than 30 characters, the list of alleles is replaced by the tag "lengthTooLong" *** FASTA format and data structure for an rs record *** defline for FASTA records start with ">" | object-type=general | | | | database name | | | offset taxID list of | | | rs# | length | SNP class alleles | | | | | | | | | defline: >gnl|dbSNP|rs271_allelePos=51totallen=101|taxid=9606|snpClass=1|alleles='G/A' 5' sequence: CTGCATCACA TGTACTGATT CTGTCCATTG GAACAGAGAT GATGACTGGT variation: R 3' sequence: TTACTAAACC CTGAGCCCTG GTGTTTCTGT TGATAGGGGG TTGCATTGAT Notes: 1. rs FASTA records do not have submitter/local SNP ID on the defline, since they are clustered data objects constructed from one or more ss records. 2. SNP class is defined in /specs/docsum.asn to classify variations as strict single nucleotide polymorphism (1); insertion/deletion (2); microsatellite (4); unclassified heterozygous (3); named without allele sequence (5); or no variation (6). *** FASTA format and data structure for a no variation record *** defline for FASTA records start with ">" | object-type=general | | total length | | database name of sequence | | | No variation | | | | dbSNP ss#/rs# indicator | | | | | | | defline: >gnl|dbSNP|rs16598 NoVariation total len = 241 sequence: cacctccaac acccttcTTT TCTTTGAACA AGATTTTTCC TTAATTCCCC AATACTCCCT TTGAATATAT GATTTTAGCC ACCATCATAG CGAATTGCAT CGTCCTCGCA CTGGAGCAGC ATCTGCCTGA TGATGACAAG ACCCCGATGT CTGAACGGCT GGTGAGTGAT GTCTTTTCTC AGGGTCTTCT CCTTGGCTTT AGCAGGACAT TAATTTTTGG GGGAGTggag cagggcacag Notes: 1. No variation records receive a dbSNP "ss" accession number just like reports of variation in SNPASSAY sections. They participate in the clustering algorithm we use to construct unique RefSNP sets. As a consequence, no variation records in the RefSNP set (rs.fas) indicate unique sequence regions that appear to be deficient of variation. **** Making a local BLAST database with dbSNP FASTA files *** To create a local BLAST database of dbSNP sequences, you must basically do three things: 1. download a local copy of the NCBI blast executables and install in a unix or windows environment 2. download the desired ss.fas or rs.fas file from the dbSNP FTP site 3. convert the dbSNP FASTA file into a blast database by running the "formatdb" program on the desired FASTA file Instructions for performing these steps may be found online at: http://www.ncbi.nlm.nih.gov/BLAST/newblast.html#standalone http://genome.nhgri.nih.gov/blastall/blast_install/ ------------------- DATABASE STRUCTURE ------------------- Documentation Source: The current schema of the dbSNP database is defined by SQL DDL (Data Description Language) suitable for regenerating both tables and procedures in the schema. The dbSNP DDL file can be found in the mssql subdirectory as: ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/dbSNP_table.sql.gz Table indexing is provided in the files dbSNP_foreign_key.sql.gz and dbSNP_index_contraint.sql.gz. Additional documentation: The schema subdirectory also contains a data dictionary and entity relationship diagram for the database (DataDictionary_b116.html and erd_dbSNP.pdf, respectively). Contents: The entire contents of dbSNP are dumped table-wise into "mssql/data" sub- directory as files named "tablename.bcp". Fields of data within each file are tab delimited. ( Note, this is an update to file on 11/22/99. The delimiter used to be '|'. ) All table definitions and bcp dumps are now provided in gnuzip compression format due to the large size of the database. **** Maintaining a local Sybase copy of dbSNP **** Local copies of the dbSNP database should be re-created with each build, using the latest dbSNP_table.sql file discussed above. This will ensure that all tables will be correctly loaded. *** NOTE *** A sample Unix C shell script cmd.create_local_dbSNP has been provided in ftp site: ftp://ftp.ncbi.nih.gov/snp/mssql/loadscript. cmd.create_local_dbSNP shows how to use dbSNP_table.sql and *.bcp to create a local sql database. Users of other database platform can use the same method with a few modifications of platform specific commands. cmd.create_local_dbSNP first creates all tables without indexes. This allows fast load of bcp data. It then loads table data and compares lines in the data file with table row counts as error checking. **** Importing sections of dbSNP into a local spreadsheet **** The ".bcp" files in the "mssql/data" subdirectory may be loaded into most spreadsheet programs by setting the field delimiter character to "tab". ------------------- ASN.1 data exchange ------------------- Documentation Source ftp://ftp.ncbi.nih.gov/snp/specs/docsum.asn "docsum.asn" defines the denormalized summary data for each refSNP (non-redundant) variation in the dbSNP database. The docsum provides a de-normalized view of dbSNP, with information about variation, genes and map position(s) provided for variation records. We have provided three data files to support the download and parsing of these data: 1. Binary version of the structured ASN.1 records A complete database dump is also provided in compressed binary form in the files in the direcotory ASN1_bin. The data are organized by chromosome. Submissions that map to multiple chromosomes are reported in the file "chMulti", and submissions that do not map to the current reference sequence are in the file "chNotOn". The binary dumps can be read and extracted with any standard ASN.1 tool. ------------------- FLATFILE DOCSUM ------------------- A flat-file report has been generated from the ASN.1 datafiles, and is provided in the files '/ASN1_flat/ds_flat_chXX.flat'. As with all of the large report dumps, files are generated per chromosome (chXX in file name). These files have been constructed with the following format: 1. Rows start with one of the following keywords: 2. Fields are delimited by pipe '|' character 3. Each refSNP will have a 'rs', 'ss', 'SNP', 'VAL' and 'MAP' line with summary information about submitter ID, SNP alleles, variation, valiadation and map information, respectively. 4. Each refSNP will have a set of contig (CTG) and locus (LOC) lines reporting each genomic position predicted for the variation by in silico sequence analysis 5. Each refSNP will have a set of locus (LOC) lines reporting the id's of gene's that fall within 2 kb of the SNP position. Functional class has been implemented to report if a variation is in a locus region, transcript, or coding region. The latter is designated as contig-reference, coding-synonymous, coding-nonsynonymous, coding-undetermined, or coding-exception. 6. Each refSNP may have additional GenBank locus (GBL) lines to define functional classification based on alignment to non-contig sequence (usually mRNA sequences). 7. Each refSNP will have a set of sequence (SEQ) lines reporting the position of the variation on finished and/or draft sequence. 8. Fields with no value are reported with "?" as the value. The lines and fields reported in the flatfile format are: KEWORD docsum.asn FIELDS rs 1. NSE-rs.refsnp-id 2. NSE-rs.organism 3. NSE-rs.taxid 4. NSE-rs.snp-class 5. NSE-rs.genotype 6. NSE-rs.linkout 7. NSE-rs.last-action ss 1. NSE-ss.subsnp-id 2. NSE-ss.handle 3. NSE-ss.loc-snp-id 4. NSE-ss.orient (+ =forward, - =reverse) SNP 1. NSE-rs.observed 2. NSE-rs.het 3. NSE-rs.het-SE VAL 1. NSE-rs.validated 2. NSE-rs.valid-prob-min 3. NSE-rs.valid-prob-max 4. NSE-rs.snp-type MAP 1. NSE-rs.ncbi-num-chr-hits number of chromosomes hit during NCBI mapping 2. NSE-rs.ncbi-num-ctg-hits number of contigs hit during NCBI mapping 3. NSE-rs.ncbi-num-seq-loc total number of hits to NCBI genome assembly 4. NSE-rs.ncbi-mapweight (1=unique in genome, 2=hit twice, 3=hit 3-9 times, 10=10+ times) 5. NSE-rs.ucsc-num-chr-hits number of chromosomes hit during UCSC mapping 6. NSE-rs.ucsc-num-ctg-hits number of contigs hit during UCSC mapping 7. NSE-rs.ucsc-num-seq-loc total number of hits to UCSC genome assembly 8. NSE-rs.ucsc-mapweight (1=unique in genome, 2=hit twice, 3=hit 3-9 times, 10=10+ times) A varation will have a CTG line for each map location in NCBI contig (CTG) coordinates CTG 1. NSE-rsContigHit.chromosome 2. NSE-rsMapLoc.physmap-int 3. NSE-rsContigHit.contig-id:NSE-rsContigHit.version 4. NSE-rsMapLoc.asn-from beginning map location in contig coordinates 5. NSE-rsMapLoc.asn-to ending map location in contig coordinates 6. NSE-rsMapLoc.loctype (1=range '..'; 2=exact; 3=between '^' adjacent bases) 7. NSE-rsMapLoc.orient (+ =forward, - =reverse) A varation will have a GP line for each map location in UCSC contig (golden path) coordinates GP 1. NSE-rsUCSCContigHit.chromosome 2. NSE-rsMapLoc.physmap-int 3. NSE-rsUCSCContigHit.contig-id:NSE-rsContigHit.version 4. NSE-rsMapLoc.asn-from beginning map location in contig coordinates 5. NSE-rsMapLoc.asn-to ending map location in contig coordinates 6. NSE-rsMapLoc.loctype (1=range '..'; 2=exact; 3=between '^' adjacent bases) 7. NSE-rsMapLoc.orient (+ =forward, - =reverse) A variation will have a locus (LOC) line for each gene locus feature defined on NCBI contigs (CTG) LOC 1. NSE-FxnSet.symbol 2. NSE-FxnSet.locus-id 3. NSE-FxnSet.fxn-class-contig If the variation is determined to be in a coding region, the following additional fields may be defined: 4. NSE-FxnSet.allele the allele for the variation 5. NSE-FxnSet.reading-frame the position in codon (1,2,3) if applicable 6. NSE-FxnSet.residue the translated amino acid residue for this allele 7. NSE-FxnSet.aa-position the position of the amino acid in peptide sequence In these cases, the LOC line may refer to the functional context of individual alleles instead of the variation as a single entity. A variation may have additional GenBank locus (GBL) lines that define functional classification based on the alignment of the variation with an mRNA sequence instead of contig sequence. GBL 1. NSE-rsLocusID.symbol 2. NSE-rsLocusID.locus-id 3. NSE-rsLocusID.fxn-class-mrna 4. NSE-FxnSet.allele 5. NSE-FxnSet.reading-frame 6. NSE-FxnSet.residue A variation will have a SEQ line for each map location in sequence component coordinates Variations that map to multiple bases will have a location range denoted with .. Variations that map as an insertion between adjacent bases will have a location denoted with ^ SEQ 1. NSE-SeqLoc.source-db where ref-mrna is the NCBI RefSeq mRNA collection, gb-mrna is the set of organism-specific mRNAs in GenBank, gb-small is likewise the set of GenBank DNA sequences <30kb in length, hgs-finish is the set of finished genome sequences, hgs-draft is the set of draft genome sequences, and bes is the set of BAC-end sequences 2. NSE-rsSeqHit.accession 3. NSE-rsSeqHit.version 4. NSE-SeqLoc.asn-from [../^][NSE-SeqLoc.asn-to] 5. NSE-SeqLoc.loc-type (1=range '..'; 2=exact; 3=between '^' adjacent bases) 6. NSE-SeqLoc.orient (+ =forward, - =reverse, ? =unknown) ------------------------ XML Data Exchange Format ------------------------ Documentation Source: ftp://ftp.ncbi.nih.gov/snp/specs/NSE.mod ftp://ftp.ncbi.nih.gov/snp/specs/NCBI_Entity.mod ftp://ftp.ncbi.nih.gov/snp/specs/docsum.asn note: NSE.mod is the XML tagging-style equivalent to the ASN.1 data definition defined in docsum.asn. NCBI_Entity.mod is the DTD specification of the common data types. The XML formats provide query-specific information about refSNP clusters and their members in the NCBI SNP Exchange (NSE) format. This format currently has five modules: NSE-ExchangeSet (the attached XML report if appropriate) NSE-BaseURLSet NCBI resource ID's, and the link ID within the resource NSE-SubmitterList Contact information for all handles assigned in dbSNP NSE-AssayList Set of all batch-level information on assay conditions (methods, sample sizes, populations, strains, citations, submitter linkouts, and comments for submissions (ss#'s) in a refSNP cluster NSE-PopList Set of all batch-level information (methods, comments, and citations) for allele frequency estimation. Some tags in the NSE-ExchangeSet reference data in these other XML modules as noted in the NSE.mod (above). These modules are available via anonymous ftp from ftp://ftp.ncbi.nih.gov/snp/human/XML/ The XML exchange format is prepared in two versions: brief and full, with the full version including additional information about each submission in the refSNP cluster as described above. Both versions provide RefSNP summary information including: - the set of hits to reference genome sequence - functional relationships to annotated genes on reference sequence - submitter information (contact and batch) for all batches in dbSNP - flanking sequence information for each submission in a refSNP cluster The full version includes allele frequency information x ss# x population. VIEWING XML DATA FILES WITH INTERNET EXPLORER The Microsoft Internet Explorer Web Browser can be used to view the (plain text) XML data files with the following two steps: 1. Save the desired XML files to a local folder and extract with an uncompression utility. Make sure the extracted file has the extension ".xml" 2. Save the following XML DTD/MOD files to the same local folder. Make sure they have their original filename extensions. These are plain text files that define the data structure. /snp/specs/NSE.dtd /snp/specs/NSE.mod /snp/specs/NCBI_Entity.mod Comments on this spec can be sent to snp-admin@ncbi.nlm.nih.gov -------------------- CHROMOSOME REPORTS -------------------- Chromosome reports provide an ordered list of RefSNPs in approximate chromosome coordinates (the same coordinate system used for the NCBI genome MapViewer). Each line gives the following information for a single RefSNP in tab-delimited columns: Column Data 1 RefSNP id (rs#) 2 mapweight where 1 = mapped to single position in genome 2 = mapped to 2 positions on a single chromosome 3 = mapped to 3-10 positions in genome (possible paralog hits) 10 = mapped to >10 positions in genome 3 snp_type where 0 = not withdrawn 1 = withdrawn There are several reasons for withdrawn, the withdrawn status is fully defined in the asn1, flatfile, and XML descriptions of the RefSNP. See /specs/docsum.asn for full definition of snp-type values. 4 total number of chromosomes hit by this RefSNP during mapping 5 total number of contigs hit by this RefSNP during mapping 6 total number of hits to genome by this RefSNP during mapping 7 chromosome for this hit to genome 8 contig accession for this hit to genome 9 version number of contig accession for this hit to genome 10 contig ID for this hit to genome 11 position of RefSNP in contig coordinates 12 position of RefSNP in chromosome coordinates (used to order report) Locations are specified in NCBI sequence location convention where; x, a single number indicates a feature at base position x x..y, a feature that spans from x to y inclusive x^y, a feature that is inserted between bases x and y 13 genes at this same position on the chromosome 14 average heterozygosity of this RefSNP 15 standard error of average heterozygosity 16 maximum reported probability that RefSNP is real. (For computationally- predicted submissions) 17 validated status 0 = no validation information 1 = cluster has 2+ submissions, with 1+ submission assayed with a non-computational method 2 = at least one subsnp in cluster has frequency data submitted 3 = non-computational method in cluster and frequency data present 4 = at lease one subsnp in cluster has been experimentally validated by submitter 18 genotypes available in dbSNP for this RefSNP 1 = yes, 0 = no 19 linkout available to submitter website for further data on the RefSNP 1 = yes, 0 = no 20 dbSNP build ID when the refSNP was first created (e.g. create date) 21 dbSNP build ID of most recent change to the refSNP cluster (update date) where dates are reckoned in dbSNP build IDs IV. ADDITIONAL FORMAT DESCRIPTIONS FOR BATCH QUERY SERVICE Users may request the following additional report formats from the dbSNP Batch Query service (http://www.ncbi.nlm.nih.gov/SNP select Batch Search). Small result sets are returned via email. Large result sets are available by ftp. In this case users will be notified by email when the report is ready, and the email will provide a link to retreive the data. ----------------- RS CLUSTER REPORT ----------------- This report format generates a table of refSNP (rs) clusters, the submissions assigned to the cluster (ss#'s) in the current database build, and the submitters local ID for each ss# in the cluster. Input data can be a list of either ss#'s or rs#'s. Output is a tab-delimited report with the format: rs# ss# HANDLE LOCAL_SNP_ID ----- ----- ------- ---------------- V. RECONSTRUCTING FLANKING SEQUENCE IN SYBASE, ASN.1 and XML REPORT FORMATS dbSNP is a sequence-based database that relies on the flanking unique sequence of a variation to define it's map location and cluster neighbors. If a particular experimental method only interrogates a small number of nucleotide bases during a survey for variation, the flanking sequence may be insufficient in length to accurately map the variation onto genome sequence. dbSNP recognizes this fact, and encourages submitters to append known flanking sequence to the surveyed regions to provide a minimum of 100 b.p. of flanking sequence. However, dbSNP distinguishes these two regions as 'flank' for cut-and-paste regions of adjacent, unsurved sequence, and 'assay' for regions of sequence directly flanking the submitted variation and acutally surveyed for variation in some number of chromosomes. Furthermore, all flanking sequence is stored in the dbSNP database as an ordered set of string fragments of less than 255 characters and are reported as a sequence of seq_E tagged' strings. In general, the flanking sequence of an individual submission (subsnp, or ss#) is made up of 5 parts in the following order: 5' flank observed 3' flank 1 2 3 4 1 2 3 1 2 3 1 2 3 4 5 <- seq_E fragments | | | | | | | | | | | | | | | 5' --- --- --- - +++ ++ ++ [A/G] ~~~ ~~~ ~~~ === === === === = 3' (optional) 5' assay 3' assay (optional) The sequence of a refSNP cluster is instantiated as the sequence of its longest subsnp member, or exemplar. At the refsnp level, the distinction is no longer preserved between flank and assay, and the sequence is simply reported as 5' and 3'. Again, however, the sequences are stored in the database and reported in the ASN.1 and XML formats as an ordered list of sequence fragments that must be concatenated. In cases where no assay sequence was provided, the flanking sequence is constructed solely from the 'flank' fields themselves: 5' flank observed 3' flank 1 2 3 4 1 2 3 4 5 <- seq_E fragments | | | | | | | | | 5' --- --- --- - [A/G] === === === === = 3' The flanking sequence segments are stored in the Sybase tables as follows: 5' flank: SubSNP5Flank.line segments ordered by column 'line_num': 0,1,2,etc. 5' assay (250 bp. nearest to variation): SubSNP.seq_5 5' assay (all additional assay sequence): SubSNP5primExtra.line ordered by 'line_num' 3' flank: SubSNP3Flank.line ordered by 'line_num' 3' assay (250 bp. nearest to variation): SubSNP.seq_3 3' assay (all additional assay sequence): SubSNP3primExtra.line ordered by 'line_num' The sequence of a refSNP cluster is instantiated as the sequence of its longest subsnp member, or exemplar. At the refsnp level, the distinction is no longer preserved between flank and assay, and the sequence is simply reported as 5' and 3'. Again, however, the sequences are stored in the database and reported in the ASN.1 and XML formats as an ordered list of sequence fragments that must be concatenated. VI. HISTORY OF REVISIONS TO FTP SITE: Rev: Dec 10, 2002 Updates for directory structure Rev: October 9, 2002 The "Getting Started Using the dbSNP FTP site" was added to NCBI to provide a guide for new users to dbSNP FTP site. http://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/snpftp/index.html Rev: October 1, 2002 Added section VII. DATA PROCESSING AND SUMMARY MEASURES Rev: May 31, 2002 The ftp address for dbSNP data (and NCBI resources in general) has changed from ncbi.nlm.nih.gov to ftp.ncbi.nih.gov Rev: April 30, 2002 A data dictionary for the dbSNP schema is available online in the Documentation section or as a Rich Text File report in the file /Sybase/schema/dbSNPdataDictionary.rtf.gz Rev: March 21, 2002 A dbsnp-announce mailing list has been created to report the release of new builds, announce new features, and report corrections or problems with past or present builds. Follow the 'Announcements' link on the dbSNP home page to subscribe to this mailing list or visit the web page at: http://www.ncbi.nlm.nih.gov/mailman/listinfo/dbsnp-announce. Rev: April 29, 2003 XML genotype report description added -- 6/7: added 2 columns to chromosome reports to track create/update build ID's for refSNP clusters, and expanded the FASTA defline to include the full set of alleles for the variation. -- 5/31/02: ftp addresses updated to new NCBI ftp server address. -- 3/21/02: fixed typo in documentation of GP tag lines for flatfile format. GoldenPath coordinates are indicated by GP tag lines. -- 11/29/01: UCSC contig coordinates added to MAP (for summary data) and GP lines added for variation positions on the golden path. Amino acid position in peptide coordinates added to functional data (LOC) -- 7/12/01: LOC lines in FLATFILE format were extended to include allele-specific functional data for variations in coding regions (allele, reading frame, amino acide residue) -- 4/04/01: Additional report formats available via the dbSNP web-based Batch Query service introduced. RS Cluster Report format defined. -- 1/22/01: exchange directory is moved to organism-specific XML subdirectories where brief and full versions of the docsum will be dumped by chromosome. The asn.1 and XML DTD files have been moved to the /specs subdirectory. The chromosome report format is fully described. -- 10/30/00: ftp site is reorganized to keep files to a managable size (< 2 Gb) and provide more information in the structure of the directories themselves. -- 10/20/00: Only fasta file is updated. Due to filesize, only compressed files are on ftp site. -- 10/20/00 fasta file is chunked. The file name convention is: [rs]fas.#m: rfas - rs fasta; sfas - ss fasta 1m - fasta sequence for rs# or ss# under 1,000,000. 2m - fasta sequence for rs# or ss# between 1,000,001 and 2,000,000. -- 10/04/00 position of variation on a contig is changed from a single value (asn-from) to a two-value begin/end format. The field 'loctype' indicates if the two-value format represents a range, single position, or insertion between adjacent bases. -- 7/28/00 Due to the rapid recent growth of the database, we no longer dump the SYBASE format table dumps in the uncompressed directory. Full Sybase table dumps are still available in the PC_compressed and Unix_compressed subdirectories. -- 7/28/00 the docsum data has been removed from the SNP table. RefSNP docsum data is now kept in a packed binary format in the table SNPDocsum. FORMAT OF SNPDocsum Table in SYBASE DUMP: 'id' corresponds to 'snp_id' in table SNP 'size' size of docsum data in bytes 'vars' number of completely filled varbinary columns 'tail' is the number of bytes in the final partially filled varbinary column 'var1 - var7' seven varbinary columns used to store the docsum 'blob' image type column used to store docsums that are too large to fit in the seven varbinary columns -- 6/7/00 bcp data for SNP table has two flavor: SNP.bcp doesn't contain docsum column, can't be used to bcp data into SNP table. Since it doesn't have docsum column, it's much smaller and easy to read into a spreadsheet. SNP_bcp.bcp has docsum column and can be used to bcp data into SNP table. -- 6/06/00 FASTA defline redefined for rs set. The new defline includes tax ID and variation class. -- 6/05/00 ASN.1 flatfile format now includes information on all submissions in the refSNP cluster. This information is grouped in lines starting with "ss" and give the ss# (dbSNP accession for the submission), submitter's handle, and submitter's local ID for the variation. -- 6/01/00 ASN.1 flatfile format changed slightly. LOC, CTG, SEQ lines are no longer interleaved. Instead, there are seperate blocks for locus information (LOC), contig coordinates (CTG), and finished/draft seqeuence coordinates (SEQ). lines will be dumped as a set, -- 5/22/00 refSNP docsum definition extended to include map positions in coordinates of contig components (finished and/or draft sequence). New coordinates are available in RefSNPSeqHit data structure in ASN.1 format, and as SEQ lines in flatfile format. -- 4/6/00 refSNP docsum files are now available. These denormalized summaries of each refSNP cluster provide summary information on each variation in the dbSNP database. Summary information is detailed in the ASN.1 definition file 'docsum.asn' which can be found in the 'specs' subdirectory of the dbSNP ftp site. VII. DATA PROCESSING AND SUMMARY MEASURES "How variations are mapped to genome sequence" When reference genome assemblies are available, we use them as anchor sequence to place refSNP clusters into a genomic context. We clean dbSNP-flanking sequence with Repeat Masker and then remap them to the most current build of each genome using MegaBLAST. The mapping results then define a new non-redundant set of variations for the genome. In general a word size of 28 is used in MegaBLAST computations, but a small subset of our data has a half flank (i.e. 5' or 3' flanking sequence taken individually as MegaBLAST query sequence) size of 25 bases and this is blasted with a word size of 22. To map a deletion we required that both flanking sequences are returned in the alignment, and furthermore both penultimate bases flanking the allele are returned. In other words, the gap as defined in the alignment exactly matches the deletion as defined in dbSNP. The complete command line from MegaBLAST is: megablast -U T -F m -J F -X 180 -r 10 -q -20 -P 1000 -R T -W 28 In addition, we filter the MegaBLAST results into two classes in dbSNP database: 0) better than 95% sequence alignment with fewer than 6 mismatches 1) better than 75% alignment with less than 3% mismatches. Anything below the lower threshold is discarded. A non-negligible subset of dbSNP RS fails to map because of heavy repeat masking on the flanking sequence. When variations align at more than one site on the genome, they sometimes map with distinct mismatch counts. dbSNP does not make any effort to reduce map redundancy by comparing individual quality scores at each site. Generally, dbSNP reports all mapping results against the current assembly. This is certainly not, however, everything in dbSNP. There are three major cases where we do not map and/or annotate: a. Submissions that are completely masked as repetitive elements. These are dropped from any further computations. This set of refSNPs are dumped in chromosome "rs_chMasked" on our ftp site. b. Submissions that are defined in a cDNA context with extensive splicing. These SNPs are typically annotated on refseq mRNAs through a separate annotation process. We are working to reverse map these variations back to contig coordinates, but that has not been implemented. For now, you can find this set of variations in "rs_chNotOn" on the ftp site. c. variations with excessive hits to the genome. Variations with 3+ hits to the genome are neither annotated as variation features on contigs nor included in variation tracks for either NCBI or UCSC map viewer resource. These data are in "rs_chMulti" on our ftp site. Furthermore, the heuristics for non-SNP variations (i.e. named elements and STRs) are probably a bit too conservative so some of these are consequently lost. While we prefer to err on the side of caution to avoid false annotation of variation in inappropriate locations, we are working to improve the success hit-rate for these variations as well. "Why do the functional classifications for some variations change when a genome is re-assembled?" Functional annotation varies from build to build because the underlying substrate, namely the reference genome sequence, is itself changing from assembly to assembly. During each assembly, the algorithms used to define 'genes' are refined to improve accuracy. Since gene features can be defined by various classes of evidence that vary in their certainty, there is currently some thrashing in estimates of gene numbers and their precise exon structure on the genome. Duplicates are identified and merged, spurious annotations are removed, and new evidence is included as the annotation pipelines are developed. The net result for the dbSNP user is that SNPs may be in an exon (or gene more generally) in one build, and in an intron or UTR (or intergenic DNA) in the next build if the exon (gene) is subsequently removed. The genome sequencing community is converging on a stable reference sequence. When it is finished, annotation (including SNP function) will be much more stable. "How average heterozygosity is computed" Average heterozygosity is computed for each refSNP cluster as described here (http://www.ncbi.nlm.nih.gov/SNP/Hetfreq.html). VIII. XML Genotype Exchange Format ------------------------ Documentation Source: ftp://ncbi.nlm.nih.gov/snp/specs/genoex.xsd The files in this directory are intended to provide a data structure (with data types) that can be read or manipulated by other computer programs. Due to large size of the "by chromosome" genotype exchange files, it is inadvisable to open the XML download files using your internet browser; doing so may cause your computer to crash. ANNOTATION To view the data structure definition for the genotype exchange format, download the XML schema file, genoex.xsd, located in the ftp://ncbi.nlm.nih.gov/snp/specs/genoex.xsd directory. You can use the annotation tags () located throughout the schema to find a description of the data contained in the genotype exchange structure, since the annotation tag precedes each incidence of data description. For general information about XML schema, please see http://www.w3.org/2001/XMLSchema. TOOLS FOR USING VIEWING XML FILES If you require XML tools that can either read or manipulate the SNP genotype XML files, there are several proprietary and open source XML tools that can be used for this purpose. Examples of such tools can be found at: http://www.google.com/search?q=XML+parser+tools+editor Although the SNP development team has rectified any obvious cases of redundancy or overlap in dbSNP's genotype data, we will continue to curate for such cases, and correct them as they arise. Please send comments concerning dbSNP genotypes to:snpdev@ncbi.nlm.nih.gov