SA-Search
A program to search for 3D fragments similar
to (fragments of) a query structure.
Introduction
This program uses the compression of the 3D structures of proteins
using a structural alphabet to transpose the classical amino acid
sequence methods to the search of 3D similarities in proteins. The
score matrix quantifying the
similarities between the letters of the structural alphabet have been
derived from the probabilities of observing each letter at each
position when encoding series of protein structures. An example of
possible scoring matrix is reported here
SA-search is not currently as performant as true 3D similarity search
methods, but it provides a very fast way of mining collection of
structures, especially for medium range fragments, where true 3D
similarity search methods are confronted with a large number of
comparisons.
Also, in our experience, the differences between true 3D similarity
search methods and
SA-Search often come from the lenghts of the structural alignements,
more than from the structures identified as "hits".
Some other search services are listed here.
How to use SA-Search ?
SA-search expects on entry a "query"
file and an "against" what it
will be searched.
The "query" as well as the "against" can be specified using
several formats that can be selected using the radio buttons.
Make sure to have the entry specified according to the radio button
selected.
- If you choose the PDBid format, simply enter a 4 letter PDB code
such as "1tim". You can specify a chain by entering "1timA", A denoting
the chain selected in the 1tim file.
- If you upload a file, please note that the encoding into the SA
space cannot be effective if there are missing residues in the
structure. Even if each fragment longer than 4 residues will be
encoded, the searches will be performed separately. Also, note that the
quality of the structure has an impact on the quality of the encoding.
SA-Search will check that the geometry is suitable for encoding. This
may result in some case to reject structures of poor stereochemical
quality.
- If you choose a "query" against "bank" search, you can select
among several banks extracted from the PDB.
The proteins encontained in each bank were selected among the complete
Protein Data Bank using an local version
of an algorithm similar to that employed to
generate the culledPDB.
However, proteins with poor geometry, or proteins with fragments have
been discarded from the PDB before defining the lists. Also, the lists
are defined in a recursive manner: all the chains in the 30% list are
enclosed in the 50% list and so on.
Parameters of the search:
The search parameters allow mostly to control the depth of the search
that is achieved.
- Search
method: currently, search can be achieved using a standard Smith
& Waterman algorithm, a faster version of it not allowing gap, or a
suffix tree (denoted as "exact words").
For the Smith & Waterman, the gap penalty is an internal parameter
of SA-Search. Its value has been adjusted to be compatible with the
substitution matrix.
- Minimal
Match Length: corresponds
to a minimal size (letters of the alphabet) for a match to be returned.
- RMSd max:
It corresponds to an
alpha-carbon best fit RMSd. All matches with a RMSd more than this
value will be discarded. Note that specifying a value of 10. or more
will inhibit the computation of RMSd.
- Minimal Score (0. - 1.):
corresponds to the minimal
score for a match to be kept for further analysis by RMSd computation.
This score is normalized as to be of 1 for a perfect match of the
fragment.
- Maximal number of matches:
the matches found beyond this
number will not be listed.
- New: Search using amino acid sequence
(instead of HMM encoding sequence): You can now aternatively apply the
different search algorithms to perform a search among amino-acid
sequences instead of the HMM encoding of the structures. Using such
facility is similar to performing classical amino-acid sequence
similarity search.
Results:
The results are returned using the NBRF/PIR format (since it
allows to specify the localisation of the matches in the
structures, which might be of interest if you want to get the
superimposed structures) or in a raw format (one match per line). Note
that the returned numbers of the
residues currently start from 1 for the first residue of the chain.
For each match, we return the PDB and SCOP Ids, the name of the
compound, the amino-acid and HMM-SA sequence identities, the
alpha-carbon best fit RMS deviation, and the aligned sequences. For
fragmented PDB files, the fragments are numbered from 1, and the PDB id
will be on the form: PDBid+PDBChn+_+Fragment number (1qmnA_1 for
example).