Visualisation repetitions

Swelfe - Help

SIM algorithm applied to the search for internal repeats in DNA and amino acid sequences and in 3D structures

Don't hesitate to read the README in Swelfe !!

Your command line must resemble this :
./swelfe -[s|a|b] file -P [0|1|2] options ...

Warning : options in upper or lower case are not the identical !!
For amino acid sequences, Swelfe read Blosum62 as default matrix, so check that this file is in the same directory as the binary.

Mandatory input

You have to provide a file with sequence(s) or structure(s).
-s : for sequence(s) (fasta format) or for structure(s) in PDB format

Files can contain several structures or sequences.
For PDB file, the program will not tell you if there are several chains and will compute repeats as if there was only one chain (all chains will be concatenated).
Specify the type of input :
- -P 0 for structures
- -P 1 for amino acid sequences
- -P 2 for DNA sequences

Example for structure :
./swelfe -s 1MP9A.pdb -P 0
Example for Amino Acid sequence :
./swelfe -s 1MP9A.prt -P 1
Example for DNA sequence :
./swelfe -s 1MP9A.dna -P 2

Optional parameters for both sequences and structures

Important information : if you do not select p-value for sequences, it will not be computed !!

Minimum score
-m minimum_score
This option is useful to filter structural repeats and is used by default (350).
It is equivalent to a match composed of 7 or 8 residues perfectly aligned. You can choose the minimum score

-z
The program uses default minimum score (instead of a p-value for sequences, no change for structures)
Default values : structures : 350 ; amino acid sequences : 30 ; DNA sequences : 15
Number of repeats
-n number
You can choose how many repeats you want to obtain. This option allows detecting small repeats that are not significant or that have a score lower than the one given by statistical threshold.
Minimum length of repeats
-L length
You can choose the minimum length for repeats.
Gap opening and extension penalty
-G gap opening penalty
-g gap extension penalty
Default parameter : structures (200, 50) ; amino acid sequences (8, 3) ; DNA sequences (4,1)
Overlap
-r value
Ratio of tolerated overlapping between the two repeats A and A' (between 0 and 1, default 0)
Output
-i file1 : standard output (default screen)
-j file2 : error output (default screen)
-l : concise output (one line)

Optional parameters for sequences only

P-value
-N p-value (between 0 and 1, 1 is a bad p-value)
This option checks that scores of repeats are statistically significant. This is done by Waterman and Vingron method (1994).
The smaller the p-value is, the more significant the repeat is. The default value is 0.01.
Random sequences
-D number
How many random sequences should be used to compute significance of scores?
Matrix
-x matrix
For amino acid sequences , you can enter the scoring matrix you prefer. By default BLOSUM62 is used for amino acid sequences. Several matrices are provided for download.

Optional parameters for structures only

Maximum Relative RMSd
-u MaxRRMSd
The Relative RMSd is calculated after SIM alignment and after best superimposition between the two repeats. It checks that the two repeats superimpose well. This RRMSD is independant of the length of the repeat. (Betancourt & Skolnick 2001)
By default the Maximum RRMSd is 0.5 . All repeats beyond this threshold are not shown.
Extend alignments
-e size in Angstrom
Extend alignments to residues distant less than x Angstrom.

Remove overlapping repeats

For structures with long or many alpha helices or for very repetitive sequences, successive repeats found can overlap each other (A overlap B and A' overlap B'). There is a python program that suppresses repeats that overlap each other more than X %.
This program processes on Swelfe results (only with -l option, concise output). You can download it with binaries.

How reading results with concise output (-l option ) ?

Lines beginning with '@' : reminder of parameters used / short explanation for reading column
Lines beginning with '#' : sequence or angles (for structures) alignment
Lines beginning with '==' : sequence alignment for structures
Other lines : details of repeats :
- column 1 : sequence/structure name
- column 2 : sequence/structure name
- column 3 : numbering of repeat
- column 4 : beginning of repeat 1 (numbering beginning to 0)
- column 5 : beginning of repeat 2 (numbering beginning to 0)
- column 6 : length of protein
- column 7 : score
- column 8 : RMS (for structure), p-value (for sequences)
- column 9 : recovering of the 2 repeats
- column 10 : length of alignment
- column 11 : length of repeat 1 (without gap)
- column 12 : length of repeat 2 (without gap)
- column 13 : blosum score for structures
- column 14 : Gerstein and Levitt score (1998)
- column 15 : compatible groups (2 repeats A and B are compatible if when you superimpose A and A', B and B' are superimposed)
- column 16 : beginning of repeat 1 with PDB numbering
- column 17 : beginning of repeat 2 with PDB numbering
- column 18 : end of repeat 1 with PDB numbering
- column 19 : end of repeat 2 with PDB numbering
- column 20 : Relative RMSD (Betancourt & Skolnick 2001)

Swelfe - Help

SIM algorithm applied to the search for internal repeats in DNA and amino acid sequences and in 3D structures

Mandatory input

You have to provide a file with sequence(s) or structure(s).

Specify the type of input :

Optional parameters for both sequences and structures

Minimum score

Number of repeats

Minimum length of repeats

Gap opening and extension penalty

Overlap

Output

Optional parameters for sequences only

P-value

Random sequences

Matrix

Optional parameters for structures only

Maximum Relative RMSd

Extend alignments

Remove overlapping repeats

How reading results with concise output (-l option ) ?