Swelfe - Help
SIM algorithm applied to the search for internal repeats in DNA and amino acid sequences and in 3D structures
Don't hesitate to read the README in Swelfe !!
Your command line must resemble this :
./swelfe -[s|a|b] file -P [0|1|2] options ...
Warning : options in upper or lower case are not the identical !!
For amino acid sequences, Swelfe read Blosum62 as default matrix, so check that this file is in the same directory as the binary.
You have to provide a file with sequence(s) or structure(s).
-s : for sequence(s) (fasta format) or for structure(s) in PDB format
Files can contain several structures or sequences.
For PDB file, the program will not tell you if there are several chains and will compute repeats as if there was only one chain (all chains will be concatenated).
Specify the type of input :
- -P 0 for structures
- -P 1 for amino acid sequences
- -P 2 for DNA sequences
- Example for structure :
./swelfe -s 1MP9A.pdb -P 0
- Example for Amino Acid sequence :
./swelfe -s 1MP9A.prt -P 1
- Example for DNA sequence :
./swelfe -s 1MP9A.dna -P 2
Optional parameters for both sequences and structures
Important information : if you do not select p-value for sequences, it will not be computed !!
This option is useful to filter structural repeats and is used by default (350).
It is equivalent to a match composed of 7 or 8 residues perfectly aligned. You can choose the minimum score
The program uses default minimum score (instead of a p-value for sequences, no change for structures)
Default values : structures : 350 ; amino acid sequences : 30 ; DNA sequences : 15
Number of repeats
You can choose how many repeats you want to obtain. This option allows detecting small repeats that are not significant or that have a score lower than the one given by statistical threshold.
Minimum length of repeats
You can choose the minimum length for repeats.
Gap opening and extension penalty
-G gap opening penalty
-g gap extension penalty
Default parameter : structures (200, 50) ; amino acid sequences (8, 3) ; DNA sequences (4,1)
Ratio of tolerated overlapping between the two repeats A and A' (between 0 and 1, default 0)
-i file1 : standard output (default screen)
-j file2 : error output (default screen)
-l : concise output (one line)
Optional parameters for sequences only
-N p-value (between 0 and 1, 1 is a bad p-value)
This option checks that scores of repeats are statistically significant. This is done by Waterman and Vingron method (1994).
The smaller the p-value is, the more significant the repeat is.
The default value is 0.01.
How many random sequences should be used to compute significance of scores?
For amino acid sequences , you can enter the scoring matrix you prefer. By default BLOSUM62 is used for amino acid sequences. Several matrices are provided for download.
Optional parameters for structures only
Maximum Relative RMSd
The Relative RMSd is calculated after SIM alignment and after best superimposition between the two repeats. It checks that the two repeats superimpose well. This RRMSD is independant of the length of the repeat. (Betancourt & Skolnick 2001)
By default the Maximum RRMSd is 0.5 . All repeats beyond this threshold are not shown.
-e size in Angstrom
Extend alignments to residues distant less than x Angstrom.
Remove overlapping repeats
For structures with long or many alpha helices or for very repetitive sequences, successive repeats found can overlap each other (A overlap B and A' overlap B'). There is a python program that suppresses repeats that overlap each other more than X %.
This program processes on Swelfe results (only with -l option, concise output). You can download it with binaries.
How reading results with concise output (-l option ) ?
- Lines beginning with '@' : reminder of parameters used / short explanation for reading column
- Lines beginning with '#' : sequence or angles (for structures) alignment
- Lines beginning with '==' : sequence alignment for structures
- Other lines : details of repeats :
- column 1 : sequence/structure name
- column 2 : sequence/structure name
- column 3 : numbering of repeat
- column 4 : beginning of repeat 1 (numbering beginning to 0)
- column 5 : beginning of repeat 2 (numbering beginning to 0)
- column 6 : length of protein
- column 7 : score
- column 8 : RMS (for structure), p-value (for sequences)
- column 9 : recovering of the 2 repeats
- column 10 : length of alignment
- column 11 : length of repeat 1 (without gap)
- column 12 : length of repeat 2 (without gap)
- column 13 : blosum score for structures
- column 14 : Gerstein and Levitt score (1998)
- column 15 : compatible groups (2 repeats A and B are compatible if when you superimpose A and A', B and B' are superimposed)
- column 16 : beginning of repeat 1 with PDB numbering
- column 17 : beginning of repeat 2 with PDB numbering
- column 18 : end of repeat 1 with PDB numbering
- column 19 : end of repeat 2 with PDB numbering
- column 20 : Relative RMSD (Betancourt & Skolnick 2001)