_______________________________________________ THREADLIZE 2000.1 F. Pazos; A. Valencia; B. Rost Protein Design Group CNB-CSIC - EMBL/CUBIC/Lion _______________________________________________ 0. Installation. 1. Configuration. 2. Using "threadlize". 2.1. Command line. 2.2. The interface. 3. Interactive alignment representation. 4. Direct opening from within the web browser. 5. Interface to other threading programs. 5.1. External filters. 5.2. One single protein. 5.3. A list of hits. 6. References. 7. Bugs/problems. 0. Installation. --------------- - Uncompress/unpack the distribution: "gunzip threadlize.tar.gz; tar xvf threadlize.tar". - In your shell configuration file ("~/.cshrc" for csh/tcsh users; "~/.bash_profile" for bash users): - Define the $THREADLIZE environment variable with the path where you have unpacked the files. Examples: setenv THREADLIZE /directory/threadlize (for csh/tcsh). THREADLIZE=/directory/threadlize (for bash). - Add this path to your $PATH environment variable. Examples: set path = ( $path /directory/threadlize) (for csh/tcsh). PATH=$PATH:/directory/threadlize (for bash). - In some shells you should export that variable (export THREADLIZE). 1. Configuration. ----------------- Each user who is going to use the package has to copy "$THREADLIZE/dot_threadlize" to "$HOME/.threadlize" and edit it to configure the package. The same can be done running the "threadlize_conf" program. This program will ask you for your local environment, test it and create the "$HOME/.threadlize" file. Take a look at this file. Every time a user wants to change something in his configuration he can run "threadlize_conf" again or edit "$HOME/.threadlize" by hand changing the values of the variables according to his local environment and needness. The first 4 characters are the variable identifier. The rest of the line contains the value. Every line whose first 4 characters don't match a variable name is ignored. For database file retrieving, you can use either locally installed databases (plain or compressed) or access to them remotely using a external web fetching program ("webget" for example). See the different examples in the configuration file for PDB files retrieving (PDBF variable). When the files are accessed remotely, a copy of the retrieved files is keep in the working directory. See point "4." for instructions on how to configure the "open from within the web browser" capability. If the background of the interfaces has a strange colour or you can not distinguish letters and graphics, add the following lines to your $HOME/.Xdefaults file: ! Threadlize background settings rasctrl*background: white threadlize*background: white 2. Using "threadlize". ----------------------- 2.1. Command line. ------------------ This program is intended to be a front-end to interactively analyse the results of fold-recognition programs. ** Usage: threadlize file[s] - PHD/TOPITS (http://cubic.bioc.columbia.edu/predictprotein/) Usage: threadlize file.hssp_topits 2Dfile(hssp_topits||phd) file.strip_topits or threadlize file - 3D-PSSM (http://www.bmm.icnet.uk/~3dpssm/) Usage: threadlize file - SAM(HMM) (http://www.cse.ucsc.edu/research/compbio/HMM-apps/model-library-search.html) Usage: threadlize file [sec_str_pred(PHD_or_HSSP_format)] - Threader2 (http://globin.bio.warwick.ac.uk/~jones/threader.html) Usage: threadlize file.out(sorted&&cut) file.aln [sec_str_pred(PHD_or_HSSP_format)] - UCLA-DOE (http://fold.doe-mbi.ucla.edu/) Usage: threadlize file(HTML) For 3D-PSSM, SAM(HMM) just use as input the file returned by e-mail by those servers. For UCLA/DOE the results will be stored as a web page. Save that page as "source", not "text", and use that file as input for "threadlize". For TOPITS, the program takes as input the topits "hssp" and "strip" files and, optionally a "phd" file. It can also take as input the file returned by e-mail by the PHD server upon a "Prediction based threading" request alone. In this case the 2 options "TOPITS: return topits hssp" and "TOPITS: return topits strip" must be checked in the server and neither any of the options "return HTML formatted" nor "return multiple sequence alignment in HSSP format" must be checked. So, use: threadlize file.hssp_topits 2Dfile(hssp_topits||phd) file.strip_topits or threadlize emailed_file With the first option, the second argument is the file from where the 2D prediction is read. It can be the "hssp_topits" file (same as the first argument) or a "phd" file. For Threader2 two files must be provided, the one containing the alignments (file.aln, see Threader2 documentation) and the one containing the hits (file.out). The last one must be sorted (by the zscore value, for example) in descending order and with only the hits you want to inspect. For example, if you want to analyse the best 20 hits, you can enter: sort -n -r +12 file.out |head -20 >threader.out.sort.cut and use "threader.out.sort.cut" as input for "threadlize". See Threader2 documentation. For files produced by external filters use: threadlize file For Threader2, SAM(HMM) and files comming from external filters, you can include another argument (optional) with the secondary structure prediction in PHD, HSSP or TOPITS format. You can find additional external filters to convert files from other threading programs in the "threadlize" web page (http://www.cnb.uam.es/~pazos/threadlize/). See "5.1. External filters". NOTE: When taking as input files coming from different servers/programs, for example, visualizing Threader2 results in combination with a PHD secondary structure prediction, the program could stop with a message similar to "both sequences don't match". Some times this can be due to insertions in the PHD file. Give as input to PHD a sequence without insertions or a multiple sequence alignment without insertions in the MASTER SEQUENCE. This will be fixed in the next versions. 2.2. The interface. ------------------- Once started, the X-display shows a list with the hits (up to 400) and their "zscore" values. On clicking one hit, the alignment parameters (%IDE %SIM, fold type, etc, depending on the program) and the header of the PDB file are shown for this selected hit. - "Quit" button: Exit the program after asking for confirmation. - "View Structure" button: Shows the structure of the selected hit using the PDB viewer specified by the "GENV" variable in the configuration file. - "More" menu: It's intended to enlarge the program capabilities. At this moment it contains: - "SCOP entry": Shows the SCOP record for this hit via the WWW using the WWW browser specified by the "WEBB" variable in the configuration file. - "FSSP entry": Shows the FSSP record for this hit via the WWW . - "What is": Calls a WWW cell biology dictionary. It's intended to be used for unknown words (protein function, etc.) in the PDB header. To use it (1) select the word you want to search for (by dragging with the mouse); (2) select "What is" in the "More" menu; (3) paste the word in the dialog box and press "Ok"; this will search the dictionary for the word. The URL of the dictionary is controlled by the "DICC" variable in the configuration file. Display options: - "acc" toggle button: Turns the representation of predicted accessibility ON and OFF in the "Interactive Alignment" (see below). - "Regions" menu: Export information on PROSITE or PRODOM regions to the "Interactive Alignment" (see below). For TOPITS files containing such information. - "Interactive Alignment" button: Starts the interactive alignment representation (see next point). - "Sublist": Intended to work with a set of hits separately, for example, to look at structural coincidences among them. - The ">>" and "<<" buttons add and remove the selected items from the list. - "Clear": Clear the entire sublist. - "SCOP str. class": Displays the structural class of the proteins in the list according with the SCOP classification. This is intended to compare the classes. The SCOP class of the PDB proteins are listed in the file "scop.table" which comes with the distribution. This file includes a WWW URL to download a updated version of it. - "FSSP str. class": Displays the structural class of the proteins in the list according with the FSSP classification. - "FSSP clusters": It displays a matrix showing FSSP data (zscore, % aligned, number of segments) for the FSSP alignment of all the possible pairs among the proteins. - "FSSP str. alignment": Graphically displays, using "RasMol", the structural alignment (if it exists) among the proteins in the list. - "Print/Save alignments" button: Generate representations of the threading alignments for all the proteins in the sublist and send them to a printer or to postscript files. The format of this representation is the same as in the "interactive alignment representation" (see below). You will be asked for including 2DRel and/or additional data representation and you have to answer with a 2 letter code. For example: BB, show both in bars representation; L-, show only 2DRel in lines representation, etc. If you select "save" instead of "print" alignments are saved as PostScript with the names "xxxx_aln.ps" where "xxxx" is the PDB code of each protein in the sublist. 3. Interactive alignment representation. ---------------------------------------- The control window shows an interactive representation of the threading alignment between the master sequence and the selected hit. The upper sequence is the master sequence (unknown structure) and the lower sequence is the selected threading hit (crystallized structure). Numbering for both sequences are shown too, numbering for the crystallized structure includes chain identifier. Matching residues in the alignment are marked. The predicted secondary structure of the master sequence and the secondary structure of the hit are shown. The scroll-bar at the bottom is used to move along the alignment. The "RasMol 2.6" window shows the structure of the hit and a representation of the master sequence aligned on it. - Thin backbone are regions of the crystallized structure not covered by the alignment. - Tick backbone are covered regions. - Very tick bond between to residues (like capsules) are regions of the alignment with an insertion of the master sequence. In other words, it represents the begin and the end of a piece of the master sequence not aligned in the structure. - Little points: atoms of residues predicted to be accessible. - Yellow region: predicted beta-strands. - Red region: predicted alpha-helix. - Other features like "prosite" patterns or "prodom" regions can be shown usually marked in blue colour and labeled with the name of the pattern/region. - Additionally, the user can mark other features. Both windows, the control one and the "RasMol 2.6" one can communicate events to each other: - Clicking a residue in the "RasMol 2.6" window it becomes marked in the control window. - Clicking a residue in the control window (or a set of them by dragging with the mouse) they become marked in the "RasMol 2.6" window. Controls in the alignment window: - "Quit" button: Exit the representation. - "Reset" button: Redraw the original representation. Un-do all changes. - "Save/Export" menu: Export any of the two representations (the RasMol one and/or the alignment one) to GIF, postscript, RasMol script (for the model to be represented later outside the program), ASCII files, etc. RasMol scripts can be loaded outside the program. Enter "rasmol -script saved_script". - "Num" toggle button: Turns the numbering ON and OFF. - "AddData" menu: - "2DRel": Turns the representation of the secondary structure reliability prediction ON and OFF and change it between lines and bars representation. - Additional data elements: The program can read up to 4 files with residue scales (hydrophobicity, beta-sheet propension, etc.). The program looks for files called "1.add", "2.add", "3.add" and "4.add" in the working directory and, if exist, it incorporates their information. The format of these files are very simple (see the file "x.add" that comes with the distribution): The first line is a label to identify this scale in the "AddData" menu; the rest of the lines contains the residue number and its value. The "AddData" menu will contain as many elements as were read. You can turn ON and OFF the representation of each element and also bars/lines representation. Each element has a different colour in order to indetify it. The "1,2,3,4.add" files cat be shared with the "plotcorr" program and they can be generated by your own or by automatic programs/www-servers that takes the sequence as input. - "Pairs" menu: If a file called "tmp_pairs" containing a set of residue pairs is present in the working directory the pairs are read and their equivalent residues in the structure are joined with a lines. The "tmp_pairs" file can be generated with the "plotcorr" program so, you can represent the pairs equivalent to the correlated ones. See the file "tmp_pairs.example" for format. The first column is a value associated with each pair that is not read by this program; it's included to maintain the format of the file and share it with applications where it has a meaning, like "plotcorr" where it's the correlation coefficient. - "Residues" button: The same for residues: It search for a file called "tmp_residues" (see "tmp_residues.example" for format) and mark these residues in the control window. Again, "tmp_residues" can be generated by "plotcorr" (conserved residues). - "Regions" button: The same for regions: It search for a file called "tmp_regions" (see "tmp_regions.example" for format). This regions are marked and labeled in the RasMol and in the alignment window. "threadlize" can generate this file with the PROSITE or PRODOM regions if they are writen in the input file (for TOPITS). - "RasMol comm." menu: You can send commands to the running RasMol with this menu. Its items are user-configurable: you can add new commands you used to use by adding "RACL", "RACM" and "RACC" lines to the configuration file "$HOME/.threadlize". This new commands will appear in the menu. - The black string entry widget below the scroll bar is for entering "RasMol" commands: which aren't in the "RasMol comm." menu. Just type a valid "RasMol" instruction and press [Enter]. NOTE. Always use the "quit" button in the alignment window to exit. This will close the alignment window and send to "RasMol" a command to close it. DON'T close the windows with the window control button. DON'T close "RasMol" alone. 4. Direct opening from within the web browser. ---------------------------------------------- The PHD server can return an active HTML page and you can open the "Interactive Alignment" program (+RasMol) from your web browser just by clicking in the TOPITS hit you want to inspect. For that you have to add the following two lines to your "$HOME/.mailcap" file: aplication/topitsview; rasctrl_fcgi %s You have to close and re-open your web browser for this to take effect. 5. Interface to other threading programs. ----------------------------------------- 5.1. External filters. ---------------------- Some filters writen by other people are available. Take a look at an updated list of available filters at http://www.cnb.uam.es/~pazos/threadlize. The package can be used to interactively represent the results coming from other threading programs for which neither built-in nor external filters exist. 5.2. One single protein. ------------------------ The program can read any alignment between a problem sequence and a crystallized protein provided it is written in a file with the following format (see the file "to_rasmol.aln.example" in the "DOCs" directory): +--------- Length of the problem sequence (5 characters justified (%5d)). | | +----- PDB file of the aligned 3D structure. | | _________________________________ 286 example.pdb 1 M | 0:B | 0 0 9 2 G H | 0:B | 7 0 2 3 Q H | 340:B K H | 9 0 0 4 A H | 341:B A H | 9 0 0 5 L H | 342:B L G | 9 0 0 .... 16 V H | 365:B V | 8 0 1 17 D H | 365:B . | 7 0 1 18 E H | 365:B . | 8 0 1 19 E H | 365:B . | 8 0 1 20 G H | 366:B G | 8 0 1 21 L H | 367:B L | 8 0 1 22 F H | 368:B P | 4 2 3 .... 32 R E | 391:B A H | 0 4 4 33 Q E | 392:B R H | 1 5 3 34 V E | 393:B R H | 1 5 3 .... _________________________________ | | | | | | | | | | | | | | | +------ Predicted 2D reliabilities (0-9). Use 0 if there is | | | | | | | not secondary structure prediction. | | | | | | +---------- Secondary structure of the 3D structure (or space). | | | | | +------------ Aminoacid in the 3D structure ('.' if no residue in | | | | | the 3D structure is aligned by the threading program | | | | | for this position in the problem sequence). | | | | +-------------- Chain of the aligned 3D structure. | | | +---------------- Residue number of the aligned 3D structure. | | | For insertions ("." as residue in the 3D structure) | | | use the number of the last aligned residue. See in the | | | example the insertion 17-19. | | +------------------------ Predicted secondary structure of the problem sequence. | | Use space if there is not 2D prediction. | +-------------------------- Aminoacid in the problem sequence. +---------------------------- Residue in the problem sequence. All residues in the problem sequence must be present. For the ones not aligned use "." in the corresponding residue of the 3D structure. Once this file is created, save it with the name "to_rasmol.aln" and run: rasctrl & ... then, you can work as described in "Interactive alignment representation". This kind of files can be generated from the alignment window with: "Save/Export" -> "Alignment/vertical" and then edited. 5.3. A list of hits. -------------------- - test version - The program can also read a list of sequence/structure alignments: the same sequence threaded in a number of structures, what is the typical output of threading programs. The input file must have the following format which is based in the 3D-PSSM one (L. Kelley, B. Maccallum, M. Sternberg). You have an example in the file "thread_list.example" ("DOCs" directory). _________________________________________________________ 3D-PSSM <- label to identify format Conf: 000005555550000000000000000007777770000 <- secondary str. pred. confidence Pred: EEEEEEEE HHHHHHH <- secondary str. pred. AA: GQALKNLLTLLNLEKIEEGLFRGQSEDLGLRQVFGGQVV <- sequence. Use ' '(space) and '0' if You don't have sec. str. pred. Conf: 000022222000000000000000000037373740000 Pred: HHHHHHHH EEEEEEE AA: PGDSKKPIIYDVETLRDGNSFSARRVAAIQNGKPIFYMT PDB_chain score other_scores¶meters /domaincoreclass/ <- List of hits. The fist line is the header : 1forL -29.140 3.21 alpha+beta containing the names of : 1nqbA -16.040 1.52 alpha the parameters, etc. : 1mcpL -15.210 6.27 ..... The label "/domaincoreclass/" : 1mak- -14.660 6.22 is required. : 1bbdL -13.160 2.18 The first number is the : 1mlbA -11.380 6.22 score consider by the : 8fabA -9.740 1.42 program, the rest of : 8fabC -9.740 0.22 parameters are read and : 1bjmA -8.970 0.42 shown as simple text. : 1ryt- -8.410 3.28 (...rest of the hits...) MSF: Type: <- Sequence to structure Name: xxxxx__Seq Len: 286 alignments in (a kind of) Name: 1forL__Seq Len: 286 MSF format. The label "_Seq" is required. // xxxxx__Seq GQALKNLLTLLNLEKIEEGLFRGQSEDLGLRQVFGGQV ------------ 1forL__Seq GSGTSYSLTISRMEAEDAATYYCQQ.RSSYPITFGS.. xxxxx__Seq PGDSKKPIIYDVETLRDGNSFSARRVAAIQNGKPIFYM ------------ 1forL__Seq ...................................... MSF: Type: Name: xxxxx__Seq Len: 286 Name: 1nqbA__Seq Len: 286 // xxxxx__Seq MGQALKNLLTLLNLEKIEEGLFRGQSEDLGLRQVFGGQ ------------ 1nqbA__Seq QV..........QLQQSGAELVK.PGASVKLSCKASGY xxxxx__Seq RPGDSKKPIIYDVETLRDGNSFSARRV...AAIQNGKP ------------ 1nqbA__Seq SGGTKYNEKFKSKATLTVDKPSSTAYMQLSSLTSEDSA (...rest of the alignments...) _________________________________________________________ 6. References. ------------- Florencio Pazos, Burkhard Rost and Alfonso Valencia (1999). A platform for integrating threading results with other information. Bioinformatics. 15(12):1062-1063. On-line version of the paper: http://www.cnb.uam.es/~pazos/papers/thread_int.html Updated information about the program: http://www.cnb.uam.es/~pazos/threadlize/ 7. Bugs/problems. ----------------- 4/11/2000 For some proteins the program can give the message "** Can't locate aligned region. Perhaps the aligned chain is not in HSSP" when, in the input file, there are residues of the 3D protein which are not in the HSSP file. This can happend when the 3D coordinates for that residue are not in the PDB file, the first residue, for example. This problem can be solved editing the input file and changing those residues by "-" or "." "Interface to other threading programs."/"One single protein." The name of the PDB file (first line of "to_rasmol.aln") can not start with a digit. ====================================================================================== This programs are still test versions. Please, report any suggestion, question or bug to me (pazos@gredos.cnb.uam.es). _____________________________________________________ Florencio Pazos Cabaleiro. Protein Design Group. Centro Nacional de Biotecnologia (C.N.B. - C.S.I.C.) Campus Universidad Autonoma. Cantoblanco. 28049 Madrid. Tlf: +34-91-5854669. Fax: +34-91-5854506. mailto:pazos@gredos.cnb.uam.es http://gredos.cnb.uam.es/pazos _____________________________________________________