GNU Info

Info Node: (gawkinet.info)PROTBASE

(gawkinet.info)PROTBASE


Prev: STOXPRED Up: Some Applications and Techniques
Enter node , (file) or (file)node

PROTBASE: Searching Through A Protein Database
==============================================

     Hoare's Law of Large Problems: Inside every large problem is a
     small    problem struggling to get out.

   Yahoo's database of stock market data is just one among the many
large databases on the Internet. Another one is located at NCBI
(National Center for Biotechnology Information). Established in 1988 as
a national resource for molecular biology information, NCBI creates
public databases, conducts research in computational biology, develops
software tools for analyzing genome data, and disseminates biomedical
information. In this section, we look at one of NCBI's public services,
which is called BLAST (Basic Local Alignment Search Tool).

   You probably know that the information necessary for reproducing
living cells is encoded in the genetic material of the cells. The
genetic material is a very long chain of four base nucleotides. It is
the order of appearance (the sequence) of nucleotides which contains
the information about the substance to be produced. Scientists in
biotechnology often find a specific fragment, determine the nucleotide
sequence, and need to know where the sequence at hand comes from. This
is where the large databases enter the game. At NCBI, databases store
the knowledge about which sequences have ever been found and where they
have been found.  When the scientist sends his sequence to the BLAST
service, the server looks for regions of genetic material in its
database which look the most similar to the delivered nucleotide
sequence. After a search time of some seconds or minutes the server
sends an answer to the scientist. In order to make access simple, NCBI
chose to offer their database service through popular Internet
protocols. There are four basic ways to use the so-called BLAST
services:

   * The easiest way to use BLAST is through the web. Users may simply
     point their browsers at the NCBI home page and link to the BLAST
     pages.  NCBI provides a stable URL that may be used to perform
     BLAST searches without interactive use of a web browser. This is
     what we will do later in this section.  A demonstration client and
     a `README' file demonstrate how to access this URL.

   * Currently, `blastcl3' is the standard network BLAST client.  You
     can download `blastcl3' from the anonymous FTP location.

   * BLAST 2.0 can be run locally as a full executable and can be used
     to run BLAST searches against private local databases, or
     downloaded copies of the NCBI databases. BLAST 2.0 executables may
     be found on the NCBI anonymous FTP server.

   * The NCBI BLAST Email server is the best option for people without
     convenient access to the web. A similarity search can be performed
     by sending a properly formatted mail message containing the
     nucleotide or protein query sequence to <blast@ncbi.nlm.nih.gov>.
     The query sequence is compared against the specified database
     using the BLAST algorithm and the results are returned in an email
     message. For more information on formulating email BLAST searches,
     you can send a message consisting of the word "HELP" to the same
     address, <blast@ncbi.nlm.nih.gov>.

   Our starting point is the demonstration client mentioned in the
first option.  The `README' file that comes along with the client
explains the whole process in a nutshell. In the rest of this section,
we first show what such requests look like. Then we show how to use
`gawk' to implement a client in about 10 lines of code. Finally, we
show how to interpret the result returned from the service.

   Sequences are expected to be represented in the standard IUB/IUPAC
amino acid and nucleic acid codes, with these exceptions:  lower-case
letters are accepted and are mapped into upper-case; a single hyphen or
dash can be used to represent a gap of indeterminate length; and in
amino acid sequences, `U' and `*' are acceptable letters (see below).
Before submitting a request, any numerical digits in the query sequence
should either be removed or replaced by appropriate letter codes (e.g.,
`N' for unknown nucleic acid residue or `X' for unknown amino acid
residue).  The nucleic acid codes supported are:

     A --> adenosine               M --> A C (amino)
     C --> cytidine                S --> G C (strong)
     G --> guanine                 W --> A T (weak)
     T --> thymidine               B --> G T C
     U --> uridine                 D --> G A T
     R --> G A (purine)            H --> A C T
     Y --> T C (pyrimidine)        V --> G C A
     K --> G T (keto)              N --> A G C T (any)
                                   -  gap of indeterminate length

   Now you know the alphabet of nucleotide sequences. The last two lines
of the following example query show you such a sequence, which is
obviously made up only of elements of the alphabet just described.
Store this example query into a file named `protbase.request'. You are
now ready to send it to the server with the demonstration client.

     PROGRAM blastn
     DATALIB month
     EXPECT 0.75
     BEGIN
     >GAWK310 the gawking gene GNU AWK
     tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
     caccaccatggacagcaaa

   The actual search request begins with the mandatory parameter
`PROGRAM' in the first column followed by the value `blastn' (the name
of the program) for searching nucleic acids.  The next line contains
the mandatory search parameter `DATALIB' with the value `month' for the
newest nucleic acid sequences.  The third line contains an optional
`EXPECT' parameter and the value desired for it. The fourth line
contains the mandatory `BEGIN' directive, followed by the query
sequence in FASTA/Pearson format.  Each line of information must be
less than 80 characters in length.

   The "month" database contains all new or revised sequences released
in the last 30 days and is useful for searching against new sequences.
There are five different blast programs, `blastn' being the one that
compares a nucleotide  query  sequence  against a nucleotide sequence
database.

   The last server directive that must appear in every request is the
`BEGIN' directive. The query sequence should immediately follow the
`BEGIN' directive and must appear in FASTA/Pearson format.  A sequence
in FASTA/Pearson format begins with a single-line description.  The
description line, which is required, is distinguished from the lines of
sequence data that follow it by having a greater-than (`>') symbol in
the first column.  For the purposes of the BLAST server, the text of
the description is arbitrary.

   If you prefer to use a client written in `gawk', just store the
following 10 lines of code into a file named `protbase.awk' and use
this client instead. Invoke it with `gawk -f protbase.awk
protbase.request'.  Then wait a minute and watch the result coming in.
In order to replicate the demonstration client's behaviour as closely
as possible, this client does not use a proxy server. We could also
have extended the client program in Note: Retrieving Web Pages,
to implement the client request from `protbase.awk' as a special case.

     { request = request "\n" $0 }
     
     END {
       BLASTService     = "/inet/tcp/0/www.ncbi.nlm.nih.gov/80"
       printf "POST /cgi-bin/BLAST/nph-blast_report HTTP/1.0\n" |& BLASTService
       printf "Content-Length: " length(request) "\n\n"         |& BLASTService
       printf request                                           |& BLASTService
       while ((BLASTService |& getline) > 0)
           print $0
       close(BLASTService)
     }

   The demonstration client from NCBI is 214 lines long (written in C)
and it is not immediately obvious what it does. Our client is so short
that it _is_ obvious what it does. First it loops over all lines of the
query and stores the whole query into a variable. Then the script
establishes an Internet connection to the NCBI server and transmits the
query by framing it with a proper HTTP request. Finally it receives and
prints the complete result coming from the server.

   Now, let us look at the result. It begins with an HTTP header, which
you can ignore. Then there are some comments about the query having been
filtered to avoid spuriously high scores. After this, there is a
reference to the paper that describes the software being used for
searching the data base. After a repitition of the original query's
description we find the list of significant alignments:

     Sequences producing significant alignments:                        (bits)  Value
     
     gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733...    38  0.20
     gb|AC021056.12|AC021056 Homo sapiens chromosome 3 clone RP11-115...    38  0.20
     emb|AL160278.10|AL160278 Homo sapiens chromosome 9 clone RP11-57...    38  0.20
     emb|AL391139.11|AL391139 Homo sapiens chromosome X clone RP11-35...    38  0.20
     emb|AL365192.6|AL365192 Homo sapiens chromosome 6 clone RP3-421H...    38  0.20
     emb|AL138812.9|AL138812 Homo sapiens chromosome 11 clone RP1-276...    38  0.20
     gb|AC073881.3|AC073881 Homo sapiens chromosome 15 clone CTD-2169...    38  0.20

   This means that the query sequence was found in seven human
chromosomes.  But the value 0.20 (20%) means that the probability of an
accidental match is rather high (20%) in all cases and should be taken
into account.  You may wonder what the first column means. It is a key
to the specific database in which this occurence was found.  The unique
sequence identifiers reported in the search results can be used as
sequence retrieval keys via the NCBI server. The syntax of sequence
header lines used by the NCBI BLAST server depends on the database from
which each sequence was obtained.  The table below lists the
identifiers for the databases from which the sequences were derived.

     Database Name                     Identifier Syntax
     ============================      ========================
     GenBank                           gb|accession|locus
     EMBL Data Library                 emb|accession|locus
     DDBJ, DNA Database of Japan       dbj|accession|locus
     NBRF PIR                          pir||entry
     Protein Research Foundation       prf||name
     SWISS-PROT                        sp|accession|entry name
     Brookhaven Protein Data Bank      pdb|entry|chain
     Kabat's Sequences of Immuno...    gnl|kabat|identifier
     Patents                           pat|country|number
     GenInfo Backbone Id               bbs|number

   For example, an identifier might be `gb|AC021182.14|AC021182', where
the `gb' tag indicates that the identifier refers to a GenBank sequence,
`AC021182.14' is its GenBank ACCESSION, and `AC021182' is the GenBank
LOCUS.  The identifier contains no spaces, so that a space indicates
the end of the identifier.

   Let us continue in the result listing. Each of the seven alignments
mentioned above is subsequently described in detail. We will have a
closer look at the first of them.

     >gb|AC021182.14|AC021182 Homo sapiens chromosome 7 clone RP11-733N23, WORKING DRAFT SEQUENCE, 4
                  unordered pieces
               Length = 176383
     
      Score = 38.2 bits (19), Expect = 0.20
      Identities = 19/19 (100%)
      Strand = Plus / Plus
     
     Query: 35    tggtgaagtgtgtttcttg 53
                  |||||||||||||||||||
     Sbjct: 69786 tggtgaagtgtgtttcttg 69804

   This alignment was located on the human chromosome 7. The fragment
on which part of the query was found had a total length of 176383. Only
19 of the nucleotides matched and the matching sequence ran from
character 35 to 53 in the query sequence and from 69786 to 69804 in the
fragment on chromosome 7.  If you are still reading at this point, you
are probably interested in finding out more about Computational Biology
and you might appreciate the following hints.

  1. There is a book called `Introduction to Computational Biology' by
     Michael S. Waterman, which is worth reading if you are seriously
     interested. You can find a good book review on the Internet.

  2. While Waterman's book can explain to you the algorithms employed
     internally in the database search engines, most practicioners
     prefer to approach the subject differently. The applied side of
     Computational Biology is called Bioinformatics, and emphasizes the
     tools available for day-to-day work as well as how to actually
     _use_ them. One of the very few affordable books on Bioinformatics
     is `Developing Bioinformatics Computer Skills'.

  3. The sequences _gawk_ and _gnuawk_ are in widespread use in the
     genetic material of virtually every earthly living being. Let us
     take this as a clear indication that the divine creator has
     intended `gawk' to prevail over other scripting languages such as
     `perl', `tcl', or `python' which are not even proper sequences.
     (:-)


automatically generated by info2www version 1.2.2.9