-------------------------------------------------------
        FTP resources for the dbSNP database
-------------------------------------------------------

***************************************
       CURRENT ANNOUNCEMENTS
***************************************
Rev: Oct 25, 2003
Redistribution of platform-specific backup files no longer
supported. Please see DATABASE STRUCTURE (below) for description
of the traditional format (schema files and tab-delimited tables
for data) and tips for maintaining a local copy of the database. 
****************************************

This document describes the FTP repository of dbSNP data with the
following sections:

   I. CONTENT OVERVIEW
  II. DIRECTORY STRUCTURE
 III. FORMAT DESCRIPTIONS
  IV. ADDITIONAL FORMATS FOR BATCH QUERY SERVICE
   V. FLANKING SEQUENCE IN SYBASE, ASN.1 and XML REPORT FORMATS
  VI. HISTORY OF CHANGES TO THE FTP SITE
 VII. DATA PROCESSING AND SUMMARY MEASURES
VIII. XML GENOTYPE EXCHANGE FORMAT

     ---------------------------------------------------
       dbSNP is available in several download formats
     ---------------------------------------------------

I. CONTENT OVERVIEW: Description and Update Frequency

NCBI supports the public redistribution of dbSNP by providing zip
compressed data dumps in four data formats.
 
dbSNP is in a state of growth, both in terms of the rate of 
submissions, and in terms of the relational schema used by NCBI to 
efficiently represent both the submitted data and the results of
post-submission computation. Given this current dynamic environment, 
we have elected to periodically generate complete dumps of the 
database rather than quarterly dumps and weekly/monthly updates.

These dumps are now refreshed weekly, usually on Sunday nights. 

NCBI reserves the right to change the frequency of these updates as
our computational production cycle matures. Furthermore, NCBI reserves
the right to change the structure of these data formats at any time. 
The documentation sources for our data formats are noted below. These
sources will be updated as well when changes are made to a data format.

Announcements for the release of new builds and notification of corrections
to existing database content will be posted to a public maillist. Please
subscribe at http://www.ncbi.nlm.nih.gov/mailman/listinfo/dbsnp-announce
to receive these notifications.


II. DIRECTORY STRUCTURE

***** PLEASE NOTE THAT THE DIRECTORY STRUCTURE HAS ******
*****   CHANGED TO SEGREGATE DATA BY ORGANISM AND  ******
*****                 MAP POSITION                 ******

Access to the NCBI FTP site is available via the web or anonymous FTP. 
The current URL/host addresses are:

World Wide Web:     ftp://ftp.ncbi.nih.gov/snp/
 Anonymous FTP:     host ftp.ncbi.nih.gov
                      cd snp


FILE COMPRESSION:
All files on the site are compressed with standard zip compression utility.  
Futher information may be found at http://www.gnu.org/software/gzip/gzip.html


Directories and subdirectories:

/bin               software tools for using ASN.1 binaries
/specs             ASN.1 and XML specifications for dbSNP docsum
                   data structure
/MSSQL             database dump of all dbSNP tables

/ss_fasta          fasta format for all submissions in dbSNP
/{organism}        organism-specific data in multiple report formats
                   Top-level Organism-specific Directories:
                       human
                       mouse
                       rat
                       chimpanzee
                       plasmodium

--subdirectories of /{organism} by format--
   /ASN1_bin       RefSNP docsum in ASN.1 binary format.
   /ASN1_flat      RefSNP docsum from ASN.1 binary in human readable flatfile format
   /chr_rpts       RefSNPs per chromosome sorted by chromosome location
   /rs_fasta       fasta format for non-redundant refSNP clusters by chromosome
   /XML            submission format and XML exchange format for
                   dbSNP refSNP clusters including:
                           submissions (ss#'s) in cluster, mapping
                           information, gene function information
                           computed from analysis of reference genome
                           sequence, snp-links, accessions, submitter comments, 
                           comments on meth-failure, submitter defined 
                           gene contexts, flanking sequence and alleles,
                           population definitions and allele frequencies.
                           XML DTD available in /specs directory (above).
   /genome_reports Summery reports on SNPs in genes, SNP density on the genome,
                   and intervals of genome sequence with little or no SNP content.




Reports are generated with chXX appended to report style name to designate
chromosome for respective data. Data are mapped as follows:
ch1-ch22, chX, chY         chromosomes 1-22,X,Y  respectively
chMulti                    variations that mapped to multiple chromosomes
chNotOn                    variations that did not map to any current chromosome


Mapping is defined by BLAST analysis of variation flanking sequence to the current
NCBI genome assembly (NT_ contigs).


Download formats        FTP Subdirectory      Data structure
----------------        ------------------    -----------------------
submission flatfile     submit_format         submitted data. Data are
                                              subdivided by year and quarter.

FASTA                   ss_fasta              flanking sequence for BLAST
                                              Data are subdivided by year
                                              and quarter.

                        rs_fasta              flanking sequence for BLAST. 
                                              Data for human are subdivided by
                                              chromosome location. Other
                                              organisms are currently in 
                                              single file.

ASN.1 binary            ASN1_bin              Refsnp docsums for data exchange
                                              and map summaries (binary version)
                                              Data are divided by chromosome
                                              assignment. 

ASN.1 flatfile          ASN1_flat             Refsnp docsums for data exchange
                                              and map summaries (flatfile version)
                                              Data are divided by chromosome
                                              assignment. 

XML data exchange       XML/XML_brief         RefSNP mapping+locus information and
                                              submitter batch/sequence information
                                              for all submissions in the set.
                                              Data are divided by chromosome
                                              assignment. 


III. FORMAT DESCRIPTIONS

-------------------
Submission flatfile
------------------- 

Documentation Source:
http://www.ncbi.nlm.nih.gov/SNP/get_html.cgi?whichHtml=how_to_submit

The submission flatfile format reports database contents in the format 
of the original database submissions. The dbSNP "how to submit" 
document is the reference that defines this format. Only the successfully
validated sections of a submission that are loaded into dbSNP are dumped
back into this report format. The submission format divides submitted 
data into "TYPE" sections; each of which is dumped into one of the 
following files:

File                 Notes
---------------      ------------------------------
contact.rep          Handle definitions and submitter contact information
publicat.rep         Publications cited in the database
method.rep           Assay methods defined by submitters
populatn.rep         Population descriptions defined by submitters

snpassay.rep         Assay reports for all SNPs in the database. These
                     reports use HANDLE,PUBLICATION,METHOD and POPULATION
                     IDs defined in the above files.

popuse.rep           Population frequency data. These reports use HANDLE,
                     METHOD, POPULATION and ASSAY IDs defined in the
                     above files.

induse.rep           Individual genotype data. These reports use HANDLE,
                     METHOD, CITATION, POPULATION and ASSAY IDs defined
                     in the above files.

novar.rep            Reports of STS/sequences with no variation detected.


**** NOTE ****
This format only dumps data defined in the submission document. It does NOT 
include post-submission data objects or links computationally derived by 
NCBI, such as RefSNP IDs (rs#) or links between SNPs and other resources 
like GenBank accession numbers or LocusIDs.




-------------------
FASTA sequence
------------------- 

Documentation Source:
http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

This format provides the flanking sequence for each report of variation in 
dbSNP, as well as all submitted sequences that have no variation. This data 
format is typically used for BLAST applications (see following section). 
Two dumps are available:

File                 Notes
---------------      ------------------------------
ss.fas               contains all submitted snp sequences in FASTA format
rs.fas               all the reference snp sequences in FASTA format


*** FASTA format and data structure for an ss record  ***

             defline for FASTA records start with ">" 
             | object-type=general
             | |                                 total length
             | |   database name                 of sequence                                                list of
             | |    |                   offset of SNP| Submitter      organism  molecule       class of      alleles
             | |    |    dbSNP ss#      in sequence  |   | SubmitterSNPID  |        type       variation       |
             | |    |    |              |            |   |      |          |          |           |            |
    defline: >gnl|dbSNP|ss271_allelePos=51totallen=101|DEBNICK|lp03022|taxid=9606|mol=Genomic|subsnpClass=1|alleles='G/A'
5' sequence: CTGCATCACA TGTACTGATT CTGTCCATTG GAACAGAGAT GATGACTGGT 
  variation: R
3' sequence: TTACTAAACC CTGAGCCCTG GTGTTTCTGT TGATAGGGGG TTGCATTGAT 

Notes:

1. If "variation" is a single nucleotide polymorphism (SNP) then the 
   appropriate IUPAC nucleotide ambiguity letter is selected to represent 
   the reported possible allele states.

2. In all other cases (microsatellite, insertion/deletion) "variation" is 
   represented as a single "N" on the variation line of the FASTA report.

3. If the string of alleles is more than 30 characters, the list of alleles
   is replaced by the tag "lengthTooLong"



*** FASTA format and data structure for an rs record  ***

             defline for FASTA records start with ">" 
             | object-type=general
             | |                    
             | |   database name   
             | |    |                offset                taxID                  list of
             | |    |  rs#              |        length      |      SNP class      alleles
             | |    |    |              |           |        |          |            |
    defline: >gnl|dbSNP|rs271_allelePos=51totallen=101|taxid=9606|snpClass=1|alleles='G/A'
5' sequence: CTGCATCACA TGTACTGATT CTGTCCATTG GAACAGAGAT GATGACTGGT 
  variation: R
3' sequence: TTACTAAACC CTGAGCCCTG GTGTTTCTGT TGATAGGGGG TTGCATTGAT 

Notes:

1. rs FASTA records do not have submitter/local SNP ID on the defline, since
   they are clustered data objects constructed from one or more ss records.

2. SNP class is defined in /specs/docsum.asn to classify variations as strict
    single nucleotide polymorphism (1); insertion/deletion (2); microsatellite (4);
   unclassified heterozygous (3); named without allele sequence (5); or no variation (6).


*** FASTA format and data structure for a no variation record  ***

             defline for FASTA records start with ">" 
             | object-type=general
             | |                                       total length
             | |   database name                       of sequence
             | |    |                    No variation    |      
             | |    |    dbSNP ss#/rs#   indicator       |      
             | |    |    |               |               |
    defline: >gnl|dbSNP|rs16598 NoVariation total len = 241
   sequence: cacctccaac acccttcTTT TCTTTGAACA AGATTTTTCC TTAATTCCCC AATACTCCCT 
             TTGAATATAT GATTTTAGCC ACCATCATAG CGAATTGCAT CGTCCTCGCA CTGGAGCAGC 
             ATCTGCCTGA TGATGACAAG ACCCCGATGT CTGAACGGCT GGTGAGTGAT GTCTTTTCTC 
             AGGGTCTTCT CCTTGGCTTT AGCAGGACAT TAATTTTTGG GGGAGTggag cagggcacag 
Notes:

1. No variation records receive a dbSNP "ss" accession number just like reports
   of variation in SNPASSAY sections. They participate in the clustering algorithm
   we use to construct unique RefSNP sets. As a consequence, no variation 
   records in the RefSNP set (rs.fas) indicate unique sequence regions that
   appear to be deficient of variation.


**** Making a local BLAST database with dbSNP FASTA files ***

To create a local BLAST database of dbSNP sequences, you must basically do 
three things:

1. download a local copy of the NCBI blast executables and install in a 
   unix or windows environment

2. download the desired ss.fas or rs.fas file from the dbSNP FTP site

3. convert the dbSNP FASTA file into a blast database by running the 
   "formatdb" program on the desired FASTA file

Instructions for performing these steps may be found online at:

http://www.ncbi.nlm.nih.gov/BLAST/newblast.html#standalone
http://genome.nhgri.nih.gov/blastall/blast_install/




-------------------
DATABASE STRUCTURE
------------------- 

Documentation Source:
The current schema of the dbSNP database is defined by SQL DDL (Data 
Description Language) suitable for regenerating both tables and procedures
in the schema. The dbSNP DDL file can be found in the mssql subdirectory as: 

ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/dbSNP_table.sql.gz

Table indexing is provided in the files dbSNP_foreign_key.sql.gz and
dbSNP_index_contraint.sql.gz.

Additional documentation:

The schema subdirectory also contains a data dictionary and entity relationship
diagram for the database (DataDictionary_b116.html and erd_dbSNP.pdf, respectively).

Contents:
The entire contents of dbSNP are dumped table-wise into "mssql/data" sub-
directory as files named "tablename.bcp". Fields of data within each file 
are tab delimited. ( Note, this is an update to file on 11/22/99.
The delimiter used to be '|'. ) 

All table definitions and bcp dumps are now provided in gnuzip compression format
due to the large size of the database.

**** Maintaining a local Sybase copy of dbSNP ****

Local copies of the dbSNP database should be re-created with each build, using the 
latest dbSNP_table.sql file discussed above. This will ensure that all tables 
will be correctly loaded. 

*** NOTE ***
A sample Unix C shell script cmd.create_local_dbSNP has been provided in ftp site:
ftp://ftp.ncbi.nih.gov/snp/mssql/loadscript.

cmd.create_local_dbSNP shows how to use dbSNP_table.sql and *.bcp to create a local 
sql database.  Users of other database platform can use the same method with a few 
modifications of platform specific commands.

cmd.create_local_dbSNP first creates all tables without indexes. This allows fast 
load of bcp data.  It then loads table data and compares lines in the data file with 
table row counts as error checking.


**** Importing sections of dbSNP into a local spreadsheet ****

The ".bcp" files in the "mssql/data" subdirectory may be loaded into most
spreadsheet programs by setting the field delimiter character to "tab".



-------------------
ASN.1 data exchange
------------------- 
Documentation Source
ftp://ftp.ncbi.nih.gov/snp/specs/docsum.asn

"docsum.asn" defines the denormalized summary data for each refSNP 
(non-redundant) variation in the dbSNP database. The docsum provides a
de-normalized view of dbSNP, with information about variation, genes 
and map position(s) provided for variation records.

We have provided three data files to support the download and parsing of
these data:

1. Binary version of the structured ASN.1 records

     A complete database dump is also provided in compressed binary
form in the files in the direcotory  ASN1_bin. The data are organized by
chromosome. Submissions that map to multiple chromosomes are reported in
the file "chMulti", and submissions that do not map to the current
reference sequence are in the file "chNotOn". The binary dumps can be read 
and extracted with any
standard ASN.1 tool.

-------------------
FLATFILE DOCSUM
-------------------

     A flat-file report has been generated from the ASN.1 datafiles,
and is provided in the files '/ASN1_flat/ds_flat_chXX.flat'. As with all of
the large report dumps, files are generated per chromosome (chXX in file name).

These files have been constructed with the following format:

     1. Rows start with one of the following keywords:
     2. Fields are delimited by pipe '|' character
     3. Each refSNP will have a 'rs', 'ss', 'SNP', 'VAL' and 'MAP' line
        with summary information about submitter ID, SNP alleles, variation,
        valiadation and map information, respectively.
     4. Each refSNP will have a set of contig (CTG) and locus (LOC)
        lines reporting each genomic position predicted for the
        variation by in silico sequence analysis
     5. Each refSNP will have a set of locus (LOC) lines reporting the
        id's of gene's that fall within 2 kb of the SNP position. 
        Functional class has been implemented to report if a
        variation is in a locus region, transcript, or coding region.
        The latter is designated as contig-reference, coding-synonymous, 
        coding-nonsynonymous, coding-undetermined, or coding-exception.
     6. Each refSNP may have additional GenBank locus (GBL) lines to define
        functional classification based on alignment to non-contig sequence
        (usually mRNA sequences).
     7. Each refSNP will have a set of sequence (SEQ) lines reporting
        the position of the variation on finished and/or draft
        sequence.
     8. Fields with no value are reported with "?" as the value.

The lines and fields reported in the flatfile format are:

KEWORD     docsum.asn FIELDS     
rs   1. NSE-rs.refsnp-id
     2. NSE-rs.organism
     3. NSE-rs.taxid
     4. NSE-rs.snp-class
     5. NSE-rs.genotype
     6. NSE-rs.linkout
     7. NSE-rs.last-action

ss   1. NSE-ss.subsnp-id
     2. NSE-ss.handle
     3. NSE-ss.loc-snp-id
     4. NSE-ss.orient (+ =forward, - =reverse)

SNP  1. NSE-rs.observed
     2. NSE-rs.het
     3. NSE-rs.het-SE

VAL  1. NSE-rs.validated
     2. NSE-rs.valid-prob-min
     3. NSE-rs.valid-prob-max
     4. NSE-rs.snp-type

MAP  1. NSE-rs.ncbi-num-chr-hits      number of chromosomes hit during NCBI mapping
     2. NSE-rs.ncbi-num-ctg-hits      number of contigs hit during NCBI mapping
     3. NSE-rs.ncbi-num-seq-loc       total number of hits to NCBI genome assembly
     4. NSE-rs.ncbi-mapweight (1=unique in genome, 2=hit twice, 3=hit 3-9 times, 10=10+ times)
     5. NSE-rs.ucsc-num-chr-hits      number of chromosomes hit during UCSC mapping
     6. NSE-rs.ucsc-num-ctg-hits      number of contigs hit during UCSC mapping
     7. NSE-rs.ucsc-num-seq-loc       total number of hits to UCSC genome assembly
     8. NSE-rs.ucsc-mapweight (1=unique in genome, 2=hit twice, 3=hit 3-9 times, 10=10+ times)


A varation will have a CTG line for each map location in NCBI contig (CTG) coordinates
CTG  1. NSE-rsContigHit.chromosome
     2. NSE-rsMapLoc.physmap-int
     3. NSE-rsContigHit.contig-id:NSE-rsContigHit.version     
     4. NSE-rsMapLoc.asn-from          beginning map location in contig coordinates
     5. NSE-rsMapLoc.asn-to            ending map location in contig coordinates
     6. NSE-rsMapLoc.loctype (1=range '..'; 2=exact; 3=between '^' adjacent bases)
     7. NSE-rsMapLoc.orient  (+ =forward, - =reverse) 

A varation will have a GP line for each map location in UCSC contig (golden path) coordinates
GP   1. NSE-rsUCSCContigHit.chromosome
     2. NSE-rsMapLoc.physmap-int
     3. NSE-rsUCSCContigHit.contig-id:NSE-rsContigHit.version     
     4. NSE-rsMapLoc.asn-from         beginning map location in contig coordinates
     5. NSE-rsMapLoc.asn-to           ending map location in contig coordinates
     6. NSE-rsMapLoc.loctype (1=range '..'; 2=exact; 3=between '^' adjacent bases)
     7. NSE-rsMapLoc.orient  (+ =forward, - =reverse) 

A variation will have a locus (LOC) line for each gene locus feature defined on NCBI contigs (CTG)
LOC     1. NSE-FxnSet.symbol
     2. NSE-FxnSet.locus-id
     3. NSE-FxnSet.fxn-class-contig
 
        If the variation is determined to be in a coding region, the following additional fields
        may be defined:

     4. NSE-FxnSet.allele              the allele for the variation
     5. NSE-FxnSet.reading-frame       the position in codon (1,2,3) if applicable
     6. NSE-FxnSet.residue             the translated amino acid residue for this allele
     7. NSE-FxnSet.aa-position         the position of the amino acid in peptide sequence

        In these cases, the LOC line may refer to the functional context of individual alleles
        instead of the variation as a single entity.

A variation may have additional GenBank locus (GBL) lines that define functional classification 
based on the alignment of the variation with an mRNA sequence instead of contig sequence. 
GBL  1. NSE-rsLocusID.symbol
     2. NSE-rsLocusID.locus-id
     3. NSE-rsLocusID.fxn-class-mrna
     4. NSE-FxnSet.allele
     5. NSE-FxnSet.reading-frame
     6. NSE-FxnSet.residue

A variation will have a SEQ line for each map location in sequence component coordinates
Variations that map to multiple bases will have a location range denoted with ..
Variations that map as an insertion between adjacent bases will have a location denoted with ^
SEQ  1. NSE-SeqLoc.source-db     
        where ref-mrna is the NCBI RefSeq mRNA collection,
              gb-mrna is the set of organism-specific mRNAs in GenBank,
              gb-small is likewise the set of GenBank DNA sequences <30kb in length,
              hgs-finish is the set of finished genome sequences,
              hgs-draft is the set of draft genome sequences, and
              bes is the set of BAC-end sequences
     2. NSE-rsSeqHit.accession
     3. NSE-rsSeqHit.version
     4. NSE-SeqLoc.asn-from [../^][NSE-SeqLoc.asn-to]
     5. NSE-SeqLoc.loc-type (1=range '..'; 2=exact; 3=between '^' adjacent bases)
     6. NSE-SeqLoc.orient   (+ =forward, - =reverse, ? =unknown)

------------------------
XML Data Exchange Format
------------------------


Documentation Source:
ftp://ftp.ncbi.nih.gov/snp/specs/NSE.mod
ftp://ftp.ncbi.nih.gov/snp/specs/NCBI_Entity.mod
ftp://ftp.ncbi.nih.gov/snp/specs/docsum.asn

note: NSE.mod is the XML tagging-style equivalent to the ASN.1 data 
definition defined in docsum.asn. NCBI_Entity.mod is the DTD specification
of the common data types.


The XML formats provide query-specific information about refSNP clusters
and their members in the NCBI SNP Exchange (NSE) format. This format currently
has five modules:
    NSE-ExchangeSet    (the attached XML report if appropriate)
    NSE-BaseURLSet     NCBI resource ID's, and the link ID within the resource
    NSE-SubmitterList  Contact information for all handles assigned in dbSNP
    NSE-AssayList      Set of all batch-level information on assay conditions
                       (methods, sample sizes, populations, strains, citations,
                       submitter linkouts, and comments for submissions (ss#'s)
                       in a refSNP cluster
    NSE-PopList        Set of all batch-level information (methods, comments,
                       and citations) for allele frequency estimation.

Some tags in the NSE-ExchangeSet reference data in these other XML modules as
noted in the NSE.mod (above). These modules are available via anonymous ftp 
from ftp://ftp.ncbi.nih.gov/snp/human/XML/

The XML exchange format is prepared in two versions: brief and full, with
the full version including additional information about each submission
in the refSNP cluster as described above.

Both versions provide RefSNP summary information including:
  -  the set of hits to reference genome sequence
  -  functional relationships to annotated genes on reference sequence
  -  submitter information (contact and batch) for all batches in dbSNP
  -  flanking sequence information for each submission in a refSNP cluster
The full version includes allele frequency information x ss# x population.


VIEWING XML DATA FILES WITH INTERNET EXPLORER

The Microsoft Internet Explorer Web Browser can be used to 
view the (plain text) XML data files with the following two steps:

1. Save the desired XML files to a local folder and extract with
an uncompression utility. Make sure the extracted file has the
extension ".xml"

2. Save the following XML DTD/MOD files to the same local folder. Make
sure they have their original filename extensions. These are plain text
files that define the data structure.
  /snp/specs/NSE.dtd
  /snp/specs/NSE.mod
  /snp/specs/NCBI_Entity.mod


Comments on this spec can be sent to snp-admin@ncbi.nlm.nih.gov

--------------------
CHROMOSOME REPORTS
--------------------

Chromosome reports provide an ordered list of RefSNPs in approximate
chromosome coordinates (the same coordinate system used for the
NCBI genome MapViewer). Each line gives the following information
for a single RefSNP in tab-delimited columns:

Column   Data
  1      RefSNP id (rs#)
  2      mapweight where
            1 = mapped to single position in genome
            2 = mapped to 2 positions on a single chromosome
            3 = mapped to 3-10 positions in genome (possible paralog hits)
           10 = mapped to >10 positions in genome
  3      snp_type where
            0 = not withdrawn
            1 = withdrawn There are several reasons for withdrawn, the
                withdrawn status is fully defined in the asn1, flatfile,
                and XML descriptions of the RefSNP. See /specs/docsum.asn
                for full definition of snp-type values.
  4      total number of chromosomes hit by this RefSNP during mapping
  5      total number of contigs hit by this RefSNP during mapping
  6      total number of hits to genome by this RefSNP during mapping
  7      chromosome for this hit to genome
  8      contig accession for this hit to genome
  9      version number of contig accession for this hit to genome
 10      contig ID for this hit to genome
 11      position of RefSNP in contig coordinates
 12      position of RefSNP in chromosome coordinates (used to order report)
           Locations are specified in NCBI sequence location convention where;
               x, a single number indicates a feature at base position x
            x..y, a feature that spans from x to y inclusive
             x^y, a feature that is inserted between bases x and y
 13      genes at this same position on the chromosome
 14      average heterozygosity of this RefSNP
 15      standard error of average heterozygosity
 16      maximum reported probability that RefSNP is real. (For computationally-
             predicted submissions)
 17      validated status
             0 = no validation information
             1 = cluster has 2+ submissions, with 1+ submission assayed 
                 with a non-computational method
             2 = at least one subsnp in cluster has frequency data submitted
             3 = non-computational method in cluster and frequency data present
             4 = at lease one subsnp in cluster has been experimentally 
                 validated by submitter
 18      genotypes available in dbSNP for this RefSNP
             1 = yes, 0 = no
 19      linkout available to submitter website for further data on the RefSNP
             1 = yes, 0 = no
 20      dbSNP build ID when the refSNP was first created (e.g. create date)
 21      dbSNP build ID of most recent change to the refSNP cluster (update date)
             where dates are reckoned in dbSNP build IDs 


IV. ADDITIONAL FORMAT DESCRIPTIONS FOR BATCH QUERY SERVICE

Users may request the following additional report formats from the dbSNP
Batch Query service (http://www.ncbi.nlm.nih.gov/SNP  select Batch Search).
Small result sets are returned via email. Large result sets are available
by ftp. In this case users will be notified by email when the report is ready,
and the email will provide a link to retreive the data.


-----------------
RS CLUSTER REPORT
-----------------

This report format generates a table of refSNP (rs) clusters, the submissions
assigned to the cluster (ss#'s) in the current database build, and the
submitters local ID for each ss# in the cluster. Input data can be a list of
either ss#'s or rs#'s. Output is a tab-delimited report with the format:

rs#     ss#     HANDLE  LOCAL_SNP_ID
-----   -----   ------- ----------------



V. RECONSTRUCTING FLANKING SEQUENCE IN SYBASE, ASN.1 and XML REPORT FORMATS


dbSNP is a sequence-based database that relies on the flanking unique sequence
of a variation to define it's map location and cluster neighbors. If a particular
experimental method only interrogates a small number of nucleotide bases during a
survey for variation, the flanking sequence may be insufficient in length to accurately
map the variation onto genome sequence. 

dbSNP recognizes this fact, and encourages submitters to append known flanking sequence 
to the surveyed regions to provide a minimum of 100 b.p. of flanking sequence. However,
dbSNP distinguishes these two regions as 'flank' for cut-and-paste regions of adjacent,
unsurved sequence, and 'assay' for regions of sequence directly flanking the submitted
variation and acutally surveyed for variation in some number of chromosomes. Furthermore,
all flanking sequence is stored in the dbSNP database as an ordered set of string fragments
of less than 255 characters and are reported as a sequence of seq_E tagged' strings.

In general, the flanking sequence of an individual submission (subsnp, or ss#) is
made up of 5 parts in the following order:

       5' flank               observed                 3' flank
      1   2   3  4    1   2  3        1   2   3    1   2   3   4  5  <-  seq_E fragments
      |   |   |  |    |   |  |        |   |   |    |   |   |   |  |
  5' --- --- --- -   +++ ++ ++ [A/G] ~~~ ~~~ ~~~  === === === === =  3'
       (optional)    5' assay        3' assay        (optional)


The sequence of a refSNP cluster is instantiated as the sequence of its longest subsnp
member, or exemplar. At the refsnp level, the distinction is no longer preserved
between flank and assay, and the sequence is simply reported as 5' and 3'. Again, however,
the sequences are stored in the database and reported in the ASN.1 and XML formats 
as an ordered list of sequence fragments that must be concatenated.

In cases where no assay sequence was provided, the flanking sequence is constructed
solely from the 'flank' fields themselves:

       5' flank     observed      3' flank
      1   2   3  4        1   2   3   4  5  <-  seq_E fragments
      |   |   |  |        |   |   |   |  |
  5' --- --- --- - [A/G] === === === === =  3'
       
The flanking sequence segments are stored in the Sybase tables as follows:

5' flank:  SubSNP5Flank.line    segments ordered by column 'line_num': 0,1,2,etc.
5' assay (250 bp. nearest to variation): SubSNP.seq_5
5' assay (all additional assay sequence): SubSNP5primExtra.line  ordered by 'line_num'
3' flank: SubSNP3Flank.line                                      ordered by 'line_num'
3' assay (250 bp. nearest to variation): SubSNP.seq_3
3' assay (all additional assay sequence): SubSNP3primExtra.line  ordered by 'line_num'


The sequence of a refSNP cluster is instantiated as the sequence of its longest subsnp
member, or exemplar. At the refsnp level, the distinction is no longer preserved
between flank and assay, and the sequence is simply reported as 5' and 3'. Again, however,
the sequences are stored in the database and reported in the ASN.1 and XML formats 
as an ordered list of sequence fragments that must be concatenated.



VI. HISTORY OF REVISIONS TO FTP SITE:

Rev: Dec 10, 2002
Updates for directory structure

Rev: October 9, 2002
The "Getting Started Using the dbSNP FTP site" was added to NCBI to provide
a guide for new users to dbSNP FTP site.
http://www.ncbi.nlm.nih.gov/About/outreach/gettingstarted/snpftp/index.html

Rev: October 1, 2002
Added section VII. DATA PROCESSING AND SUMMARY MEASURES

Rev: May 31, 2002
The ftp address for dbSNP data (and NCBI resources in general)
has changed from ncbi.nlm.nih.gov to ftp.ncbi.nih.gov

Rev: April 30, 2002
A data dictionary for the dbSNP schema is available online in the 
Documentation section or as a Rich Text File report in the file 
/Sybase/schema/dbSNPdataDictionary.rtf.gz


Rev: March 21, 2002
A dbsnp-announce mailing list has been created to report the release of new
builds, announce new features, and report corrections or problems with past or
present builds.

Follow the 'Announcements' link  on the dbSNP home page to subscribe to this
mailing list or visit the web page at:
http://www.ncbi.nlm.nih.gov/mailman/listinfo/dbsnp-announce.

Rev: April 29, 2003
XML genotype report description added


-- 6/7: added 2 columns to chromosome reports to track create/update build ID's for refSNP
   clusters, and expanded the FASTA defline to include the full set of alleles for the variation.

-- 5/31/02: ftp addresses updated to new NCBI ftp server address.

-- 3/21/02: fixed typo in documentation of GP tag lines for flatfile format. GoldenPath
   coordinates are indicated by GP tag lines.

-- 11/29/01: UCSC contig coordinates added to MAP (for summary data) and GP lines added
   for variation positions on the golden path. Amino acid position in peptide coordinates
   added to functional data (LOC)

-- 7/12/01: LOC lines in FLATFILE format were extended to include allele-specific
   functional data for variations in coding regions (allele, reading frame, amino acide residue)

-- 4/04/01: Additional report formats available via the dbSNP web-based
   Batch Query service introduced. RS Cluster Report format defined.

-- 1/22/01: exchange directory is moved to organism-specific XML
   subdirectories where brief and full versions of the docsum
   will be dumped by chromosome. The asn.1 and XML DTD files have
   been moved to the /specs subdirectory. The chromosome report
   format is fully described.

-- 10/30/00: ftp site is reorganized to keep files to a managable 
   size (< 2 Gb) and provide more information in the structure of 
   the directories themselves.

-- 10/20/00: Only fasta file is updated. Due to filesize, only compressed
     files are on ftp site.

-- 10/20/00 fasta file is chunked. The file name convention is:
     [rs]fas.#m:
          rfas - rs fasta; sfas - ss fasta
          1m - fasta sequence for rs# or ss# under 1,000,000.
          2m - fasta sequence for rs# or ss# between 1,000,001 and 2,000,000.


-- 10/04/00 position of variation on a contig is changed from
   a single value (asn-from) to a two-value begin/end format.
   The field 'loctype' indicates if the two-value format
   represents a range, single position, or insertion between
   adjacent bases.

-- 7/28/00 Due to the rapid recent growth of the database, we
   no longer dump the SYBASE format table dumps in the 
   uncompressed directory. Full Sybase table dumps are still
   available in the PC_compressed and Unix_compressed 
   subdirectories.

-- 7/28/00 the docsum data has been removed from the SNP table.
   RefSNP docsum data is now kept in a packed binary format in 
   the table SNPDocsum. 

   FORMAT OF SNPDocsum Table in SYBASE DUMP:
     'id'   corresponds to 'snp_id' in table SNP
     'size' size of docsum data in bytes
     'vars' number of completely filled varbinary columns
     'tail' is the number of bytes in the final
               partially filled varbinary column
     'var1 - var7' seven varbinary columns used to store the 
               docsum
     'blob' image type column used to store docsums that
               are too large to fit in the seven varbinary columns


-- 6/7/00 bcp data for SNP table has two flavor:

   SNP.bcp doesn't contain docsum column, can't be used to bcp data
     into SNP table. Since it doesn't have docsum column, it's much smaller 
     and easy to read into a spreadsheet.
   SNP_bcp.bcp has docsum column and can be used to bcp data
     into SNP table.

-- 6/06/00 FASTA defline redefined for rs set. The new
   defline includes tax ID and variation class.

-- 6/05/00 ASN.1 flatfile format now includes information on
   all submissions in the refSNP cluster. This information
   is grouped in lines starting with "ss" and give the ss#
   (dbSNP accession for the submission), submitter's handle,
   and submitter's local ID for the variation.

-- 6/01/00 ASN.1 flatfile format changed slightly. LOC, CTG,
   SEQ lines are no longer interleaved. Instead, there are
   seperate blocks for locus information (LOC), contig
   coordinates (CTG), and finished/draft seqeuence coordinates
   (SEQ).

   lines will be dumped as a set, 
-- 5/22/00 refSNP docsum definition extended to include map
   positions in coordinates of contig components (finished and/or
   draft sequence). New coordinates are available in RefSNPSeqHit
   data structure in ASN.1 format, and as SEQ lines in flatfile
   format.

-- 4/6/00 refSNP docsum files are now available. These denormalized
   summaries of each refSNP cluster provide summary information
   on each variation in the dbSNP database. Summary information
   is detailed in the ASN.1 definition file 'docsum.asn'
   which can be found in the 'specs' subdirectory of the dbSNP
   ftp site.


VII. DATA PROCESSING AND SUMMARY MEASURES

"How variations are mapped to genome sequence"

When reference genome assemblies are available, we use them as anchor
sequence to place refSNP clusters into a genomic context. We clean
dbSNP-flanking sequence with Repeat Masker and then remap them to the
most current build of each genome using MegaBLAST. The mapping results
then define a new non-redundant set of variations for the genome.

In general a word size of 28 is used in MegaBLAST computations, but a
small subset of our data has a half flank (i.e. 5' or 3' flanking
sequence taken individually as MegaBLAST query sequence) size of 25
bases and this is blasted with a word size of 22. To map a deletion we
required that both flanking sequences are returned in the alignment, and
furthermore both penultimate bases flanking the allele are returned. In
other words, the gap as defined in the alignment exactly matches the
deletion as defined in dbSNP.

The complete command line from MegaBLAST is:

megablast -U T -F m -J F -X 180 -r 10 -q -20 -P 1000 -R T -W 28

In addition, we filter the MegaBLAST results into two classes in dbSNP 
database:

0) better than 95% sequence alignment with fewer than 6 mismatches
1) better than 75% alignment with less than 3% mismatches.

Anything below the lower threshold is discarded.

A non-negligible subset of dbSNP RS fails to map because of heavy repeat
masking on the flanking sequence. When variations align at more than one
site on the genome, they sometimes map with distinct mismatch counts.
dbSNP does not make any effort to reduce map redundancy by comparing
individual quality scores at each site.

Generally, dbSNP reports all mapping results against the current
assembly. This is certainly not, however, everything in dbSNP. There are
three major cases where we do not map and/or annotate:

 	a. Submissions that are completely masked as repetitive
elements. These are dropped from any further computations. This set of
refSNPs are dumped in chromosome "rs_chMasked" on our ftp site.

	b. Submissions that are defined in a cDNA context with extensive
splicing. These SNPs are typically annotated on refseq mRNAs through a
separate annotation process. We are working to reverse map these
variations back to contig coordinates, but that has not been
implemented. For now, you can find this set of variations in
"rs_chNotOn" on the ftp site.

	c. variations with excessive hits to the genome. Variations with
3+ hits to the genome are neither annotated as variation features on
contigs nor included in variation tracks for either NCBI or UCSC map
viewer resource. These data are in "rs_chMulti" on our ftp site.

Furthermore, the heuristics for non-SNP variations (i.e. named elements
and STRs) are probably a bit too conservative so some of these are
consequently lost. While we prefer to err on the side of caution to
avoid false annotation of variation in inappropriate locations, we are
working to improve the success hit-rate for these variations as well.



"Why do the functional classifications for some variations change when a
genome is re-assembled?"

Functional annotation varies from build to build because the underlying
substrate, namely the reference genome sequence, is itself changing from
assembly to assembly. During each assembly, the algorithms used to
define 'genes' are refined to improve accuracy. Since gene features can
be defined by various classes of evidence that vary in their certainty,
there is currently some thrashing in estimates of gene numbers and their
precise exon structure on the genome. Duplicates are identified and
merged, spurious annotations are removed, and new evidence is included
as the annotation pipelines are developed.

The net result for the dbSNP user is that SNPs may be in an exon (or
gene more generally) in one build, and in an intron or UTR (or
intergenic DNA) in the next build if the exon (gene) is subsequently
removed. The genome sequencing community is converging on a stable
reference sequence. When it is finished, annotation (including SNP
function) will be much more 
stable.

"How average heterozygosity is computed"
 
Average heterozygosity is computed for each refSNP cluster as described
here (http://www.ncbi.nlm.nih.gov/SNP/Hetfreq.html).


VIII. XML Genotype Exchange Format
------------------------
Documentation Source:
ftp://ncbi.nlm.nih.gov/snp/specs/genoex.xsd

The files in this directory are intended to provide a data structure (with 
data types) that can be read or manipulated by other computer programs.  
Due to large size of the "by chromosome" genotype exchange files, it is 
inadvisable to open the XML download files using your internet browser; 
doing so may cause your computer to crash.

ANNOTATION
To view the data structure definition for the genotype exchange format, 
download the XML schema file, genoex.xsd, located in the
ftp://ncbi.nlm.nih.gov/snp/specs/genoex.xsd directory.  
You can use the annotation tags (<annotation>) located throughout the schema 
to find a description of the data contained in the genotype exchange structure, 
since the annotation tag precedes each incidence of data description. 

For general information about XML schema, please see http://www.w3.org/2001/XMLSchema.


TOOLS FOR USING VIEWING XML FILES
If you require XML tools that can either read or manipulate the SNP genotype 
XML files, there are several proprietary and open source XML tools that can 
be used for this purpose. Examples of  such tools can be found at: 
http://www.google.com/search?q=XML+parser+tools+editor


Although the SNP development team has rectified any obvious cases of 
redundancy or overlap in dbSNP's genotype data, we will continue to 
curate for such cases, and correct them as they arise.

Please send comments concerning dbSNP genotypes to:snpdev@ncbi.nlm.nih.gov