This work took place in the Summer of 2011. The detailed timeline of the work is presented below in reverse chronological order.
Phase 4: Publication of Workflows as Linked Data and Demo application
- Publish workflow provenance
- Text analysis
- Quantifying reproducibility
- Estimate the effort in creating the workflow
- 29 August:
SPARQL public endpoint running on the Amazon EC2 cloud:(Update August 2013: Replaced with http://www.opmw.org/sparql)
- Load 5 abstract templates and 5 execution results (Ongoing)
- Example queries from the templates. (Ongoing)
- 28 August:
- Write documentation.
Phase 3: Mapping of the workflow template and execution to OPM
- 24 August:
- Mapper rationale: link
- 23 August:
- Mapper finished. I have to test it further with other templates, upload the rationale and modify the wiki
- 22 August:
- Template mapping finished.
- Execution mapping rationale finished. Mapper ongoing
- 19 August:
- Meeting with Ralph Bergmann to see how we could reuse his matcher.
- Still developing the mapping between WINGS and OPM. (Ongoing). I will upload the rationale next week.
- 18 August:
- Started Mapping between Wings and OPM (Ongoing)
- 17 August:
- Developed a small Allegro module from Java to connect to the repository and upload and retrieve data (independently from the test tutorials).
- Developed a small Allegro module from Java to connect to the repository and upload and retrieve data (independently from the test tutorials).
- 16 August:
- Meeting with Yolanda and Varun to discuss some of the mappings.
- Read the paper "Retrieval of semantic workflows with Knowledge Intensive Similarity Measures" by Ralph Bergmann and Yolanda Gil to see if the matcher could reuse the future export of WINGS results as OPM.
- To schedule future meeting with Ralph to see if the transformation is possible (it looks like so).
- Template conversion necessary as process recipe.
Phase 2: Implementation of Five Core Workflow Steps
- 25 August:
- Finished abstract workflow implementation. Tested in Marbles, but will put it in Wind too. Updated the wiki with all the workflow abstraction and instances.
- 15 August:
- Fixed the docking issue detected on friday. Now the results are copied to the right folder.
- Started the mapping between WINGS ontologies and OPMV.
- Final version version of the workflow:
On purple we can see the components that have been added to ensure the reproducibility of the workflow avoiding manual steps. I have avoided adding one additional component (getSMAPAlignementFolder), because it is not necessary for the global understanding of the workflow.
- 12 August:
- Docking checker for the ligand preparation done. It fixes some atomic numbers that the tool does not recognize to hydrogens.
- Checked SMAP subWF (Marbles):
- Checked FATCAT subWF (Marbles):
- Checked MakeGraphNetwork subWF (Marbles):
- Checked Docking subWF (Marbles): In this case, it an additional component is needed: either it fixes the ligand PDB files of the ideal ligands or either these files have to be provided manually. (Edit: dock checker finished and added)
- 11 August:
- Autodock Vina uploaded to Marbles successfully.
- First steps with the Allegro Database, installed in amazon. Matheus and I have tested successfully the load and querying of data remotely, from the RDF produced by WINGS (.ttl file).
- Next steps: to understand how WINGS generates the RDF and give the possibility to upload it to the triple store.
- Finished fixing the components without writing to the BASEDIR directory. All components are working individually (to check global connection).
- Edited the subworkflows connecting the components and the global subworkflow. To test tomorrow with a minimal set.
- 10 August:
- Fixed the SMAP component with the homology models (there was an issue when reading them). Now working on Marbles.
- Finished uploading all the components to Marbles. Still to test AutodockVina, because it needs the MLGtools installed.
- New issue: I have to change the writing in the BASEDIR directory, it is not a good practice (many of the components have to be changed).
- Fixed an error in the result sorter script that made other components to work incorrectly.
- AddedCountDrugs component.
- Updated diagram: (All but the Autodock Vina component working on marbles)
- 9 August:
- Finished the URL checker. It doesn't just check that the packages are right, but it also fixes the list with those that have been superseeded. (Working to WINGS)
- Migrated SMAP components, FATCAT components and Visualization components to Marbles.
- Detected an error in the SMAP sorter (TO BE FIXED)
- Detected that SMAP does not produce the same results for some of the inputs, though they are similar. I don't know why this could be (configuration settings??)
- 8 August:
- Uploaded getClip files component.
- Upload getIdealLig Component (requires LWP::Simple).
- Fixed some issues with the namings of the input files.
- Started the URL checker for FATCAT
- Started migration to Marbles.
- New diagram updated:
- 5 August:
- Finished the docking and it is running on WINGS.
- Fixed some problem with the format of the outputs.
- Tested ALL the subworkflows in WINGS, and they all work properly.
- The SMAP sub wf does not produce the right results because of the computer I'm running it in.
- The complete FatCat step still needs an URL checker for the results, or otherwise it will fail (to fix it on monday).
- I still need to complete the steps for creating the ideal ligands and the clip files. Also the count drugs.
- 4 August:
- I finished a script to run Vina automatically. I tested it with a subset and worked fine (after fixing the T44 and T3 pdb files).
- The command used to call the script:
./autodock_vina -i1 /Users/danielgarijo/WORK_ISI/DOCKING/Steps_Before_Docking/TESTSCRIPT/sig_results_1e-5_fatcat_filtered -i2 /Users/ danielgarijo/WORK_ISI/DOCKING/Steps_Before_Docking/TESTSCRIPT/align_struct_output/ -i3 /Users/danielgarijo/WORK_ISI/DOCKING/Steps_Before_Docking /TESTSCRIPT/ideal_ligs/ -i4 /Users/danielgarijo/WORK_ISI/DOCKING/Steps_Before_Docking/TESTSCRIPT/749_mtb_structures/ -o output
- 3 August:
- Autodock Vina works! The error I was obtaining was solved by changing some hydrogens in the file. I have manually done this change, but we will have to add a component to fix it dynamically in the future.
- Still to develop the script to automatize the Call from the FatCat output List.
- 2 August:
- Tried to reproduce the whole steps in the docking subworkflow. There are remaining errors:
1,2 .-We execute the clip file scripts+the lig scripts to have all the inputs available.
3.-Initial Line: 1K0R_A nusA 1HK3_LIG.T44 SERUM ALBUMIN P02768 T44 2.586E-08 80.39 0.75 2.23A
4.- We extract the coordinates using the script and the SMAP aligned output: ./ligand_center.csh 1HK3_LIG.T44-1K0R_A.pdb T44.
Results: (Example Dani) center_x = -42.625 center_y = 62.5417 center_z = 16.125
we redirect the output to a file csh ./ligand_center.csh 1HK3_LIG.T44-1K0R_A.pdb T44 > COORD_1HK3_LIG.T44-1K03_A
4a,b: test if the prepare receptor and ligand work (Sarah' s example): ./prepare_ligand4.py -l VDN.pdb ./prepare_receptor4.py -r 1BVR_F.pdb ./vina --receptor 1BVR_F.pdbqt --ligand VDN.pdbqt --center_x 20.34 --center_y -16.11 --center_z 53.21 --size_x 20 --size_y 20 --size_z 20
Example Dani ./prepare_ligand4.py -l T44.pdb OUTPUT:error. It looks like the PDB file is not right. I cannot do anything else here
./prepare_receptor4.py -r 1K0R_A.pdb LLamada real: /Users/danielgarijo/Downloads/mgltools_i86Darwin9_1.5.4/bin/pythonsh /Users/danielgarijo/Downloads/mgltools_i86Darwin9_1.5.4/MGLToolsPckgs/AutoDockTools/Utilities24/prepare_receptor4.py -r 1K0R.pdb (NOTA: there is no _A file) OUTPUT: 1K0R.pdbqt
- 1 August:
- SMAP step completely finished. The SMAP component can be reused for the homology models :)
- Built a script to add the homology models to the local PDB file in WINGS. It should be an additional component.
- Successfully deployed the FATCAT step getting the sigResults. Some preprocess from the output SMAP files may be needed though.
- Successfully invoked and created FATCAT component in Wings.
- Successfully deployed the FATCAT post filtering cript and component in WINGS.
- Linked all the steps of the FATCAT wf and they work. I have just obtained the error with the superseeded package, so we will need a url checker for the packages in PDB. It would go between the the fatCatList and the fatcat component.
- Updated figure with the current progress (Legend: steps that are working shown in green, steps having problems shown in blue, manual interventions required shown in red):
convert: unable to open file `/lfs1/www/html/site/wings/drugome/images/0/04/Script&ResultsConnection2.0.png' @ error/png.c/ReadPNGImage/3636.
convert: missing an image filename `/lfs1/www/html/site/wings/drugome/images/thumb/0/04/Script&ResultsConnection2.0.png/800px-Script&ResultsConnection2.0.png' @ error/convert.c/ConvertImageCommand/3015.
Legend: Green boxes are the steps already executed properly in WINGS. The FatCat needs a URL checker to make sure the packages in the list generated in the previous step have not been superseeded. Red boxes mean that there is something missing or that there is no script currently.
- 29 July:
- Successfully modified and executed the sorting part of the SMAP step on WINGS
- Successfully modified and executed the execution part of the SMAP step on WINGS (for solved structures. I is the same for the homology models). Since the tool does not work successfully, it does not reproduced the same results, but when moving everything to Linux there will be no problems (I already executed everything and worked).
- Note: in both cases the results have been reproduced and zipped, but the zip contains the whole path of folders from the root directory. Still to be fixed.
- I have forgotten to zip the alignement directory as an additional output. Since in mac it does not align well, I'll do some tests with an additional component/folder.
- 28 July:
- I checked the run_smap script in Linux, reproducing a subset of results instead of simply testing the SMAP tool.
- I talked to Sarah and finally clarified the process that takes place in the docking step. It is as follows:
- We start from the filtered results of Fatcat. An example line is:
1BVR_C-D-E-F inhA 1UHO_LIG.VDN CGMP-SPECIFIC 3',5'-CYCLIC PHOSPHODIESTERASE ....
- We extract 1UHO_LIG.VDN, which is the binding site.
- We download 1UHO from PDB.
- We download the ideal ligand from a special URL, using the script get_ideal_ligs.
- We extract the coordinates using the script and the SMAP aligned output:
./ligand_center.csh 1UHO_LIG.VDN-1BVR_CDEF.pdb VDN
- We execute the next steps with the python scripts:
./prepare_ligand4.py -l VDN.pdb (file obtained from step 4) ./prepare_receptor4.py -r 1BVR_F.pdb
- We execute Vina
- 27 July:
- First subworkflow completed: openBabel Step. It works in WINGS.
- Tested the visualization script. It works properly. Sarah added the missing 2 files in the data flow.
- Second subworkflow completed: Visualization subworkflow running on Wings!
- Successfully executed the first 2 scripts of the Docking step (create_ehits_clip_files, get_ligs_for_docking). The second was executed in Linux, since it doesn't work on the mac computer.
- Updated the flow figure (Legend: steps that are working shown in green, steps having problems shown in blue, manual interventions required shown in red):
In green are represented the sub workflows already working on Wings. In red the missing/scripts (or to do scripts). Note that some of the tools will not work in the Wings with mac (due to some compatibility issues), but the final version will be tested in Linux, where I have already tested all the scripts.
- 26 July:
- Started to create the workflows in WINGS. OpenBabel Step. I had to change the script in order to require input and output.
- Small progress in the docking part. I have managed to extract some coordinates from a protein file, but I still need information about which are the name of the ligands for each protein (in the example I have extracted them through the PDB file manually).
sudo csh ligand_center.csh 1BVR.pdb THT
center_x = 27.1258 center_y = -1.97632 center_z = 28.3574
This output is not written to a file, maybe additional scripts are needed?
- 25 July:
- Installed the mgltools (necessary for the preprocessing of Vina). Example of use with a receptor:
./pythonsh /Users/danielgarijo/Downloads/mgltools_i86Darwin9_1.5.4/MGLToolsPckgs/AutoDockTools/Utilities24/prepare_receptor.py -r /Users/danielgarijo/Downloads/1BVR.pdb
- Checked compatibility with Cytoscape, and it works properly. I haven't reproduced the output graph yet, but the .gml file provided by Sarah can be displayed with the console with the next command:
java -Xmx512M -jar cytoscape.jar -N /Users/danielgarijo/Desktop/graph_1e-5.gml
- Installed successfully autodock Vina (it is free software)
- Issues when trying to follow the instructions by Lei on the obtention of the solved protein file(not a crucial step right now):
- Installed Psi-blast, though blastpgp is not available. (The package does not exist).
- There are 5 different M.tb genomes at http://www.tbdb.org/,
- I don't know hot to refer to the PDB sequences.
- Updated the figure completing some steps and highlighting in red the missing ones:
- 22 July: The solution to address the issues with the missing package in Fatcat is to search for the packages in the PDB and see if the package has been superseeded by another one. So, when making the workflow we will need an additional component to check the availability of the packages and download them if they have been superseeded. This will be a final step, once the rest of the workflow is running ok.
The package that solves the problem with 3i3k is 3ld6. (Thanks to Sarah for her help with this issue)
All the results of this step have been reproduced
I've spent some time gathering and connecting all the scripts, inputs and outputs to know how the different steps are connected to each other. Also to identify the missing scripts between steps. The file can be watched below:
- What we need:
- How is the file 962drug_binding_sites obtained from the PDB + tanimotos file?
- How is the TB protein info file obtained?
- No scripts creating the Solved structures and homology models files are provided
- Drug key and drug counts files and scripts to obtain them are missing.
- No additional information for the docking part (last step)
- 21 July: FatCat works, as long as the ftp from the PDB where it downloads the data works properly. If a single resource is not available, the program crashes. (E.g ftp://ftp.wwpdb.org/pub/pdb/data/structures/all/pdb/pdb3i3k.ent.gz) Therefore I've asked Sarah to provide the proper PDB files, in order to avoid this error. Another alternative would be to delete conflictive entries from the list.
FatCat is run with the next command:
sudo sh ./runFatcat.sh -alignPairs fatcat_list -pdbFilePath pdb_files -autoFetch true -outFile fatcat_output -printFatCat true
Input: fatcat_list: I have to investigate how the smap_output gets converted into this list.
Output: fatcat_output: I assume that this list is the one to go to the eHITS step
I think there are some necessary preprocessing steps. For the moment, the subworkflow looks like this figure:
SMAP tested in Windows and in Linux, repeating successfully the first results in table S2 (2VBW_A-B with 1SN5_LIG.T3 and 2VBX_A-B with 1SN5_LIG.T3). In Linux it is required to install the qhull package and create soft links in the SMAP folder replacing the current ones (which don't work). In Mac I've tried to apply the same thing but it is still not working. However, since WINGs Marbles is in Linux, we don't need this step to be running on a Mac. Thus, we can conclude that the SMAP step is done.
- 20 July:Open Babel works. I've repeated the step taking as input the ligand smiles from Sarah and executing her perl script.
sudo perl calc_lig_tanimotos
- The file ligand_smiles_Sarah must be on the same folder. If necessary we can enter it as input for the script.
- The output is a file called 'tanimotos'. How is this file connected to the SMAP input??? Still to find out
- 19 July: Tested if the results obtained from SMAP correspond to the ones in table S2. I've taken the first 10 drugs and proteis, but the results are not the same. For some of them a p-value is not even produced by the SMAP software.
M.tb PDB code: 2VBW
Drug target PDB code: 1SN5
SMAP P-value: 6.08E-011
SMAP result: No hit found. Log:
SCORE_MATRIX = McLACHLAN CONFORMER_UNIT_DIR = /Users/danielgarijo/WORK_ISI/smap_v2_0/conformerUnit LIGAND_CONTACT_DISTANCE_CUTOFF = 10.0 Warn: PVALUE_CUTOFF is unknown. Set default value of 0.02 PRINT_PDB = false Warn: PRINT_TEMPLATE_LIGAND is unknown. Set default true Warn: PRINT_QUERY_LIGAND is unknown. Set default false TEMPLATE_LIGAND_SITE_ONLY = true Warn: QUERY_LIGAND_SITE_ONLY is unknown. Set default true /Users/danielgarijo/WORK_ISI/smap_v2_0/conformerUnit/2VBW_.object PdbLogger setting log level to: warn /Users/danielgarijo/WORK_ISI/smap_v2_0/conformerUnit/1SN5_.object No hit found
sudo sh ./smap_comp.sh 2VBW 1SN5 outputprueba
- 18 July: Varun fixed my WINGS installation during the weekend, (there was an incompatibility between the new version of modx and the WINGS package.
I separated and fixed the sorting part from the SMAP perl script. Now the processing occurs according to the figure below.
The call to the script is made through the command line this way:
sudo perl ./testDaniSORT.pl -l InputFromSarah/solved_tb_structures_listREDUCED.txt -d InputFromSarah/drug_binding_sites_listREDUCED.txt -o Output/smap_output_REDUCED -p 1e-5
Note: It can be optimized removing the drug binding sites list.
OpenBabel2.1.1 successfully installed: I just need the input parameters and datasets. Until I get more information about the process, inputs and outputs, the next figure explains this step.
- 15 July: Meeting with Varun to update my local WINGS installation. The execution of SMAP takes too long, so for the moment I'll use a small subset. Li says that we can pick any proteins and drugs randomly, so I'll take the first 20 form each input table. After a small discussion with Yolanda, we agreed that it wouold be better to separate the postprocessing (sort) from the perl script.
Therefore, the SMAP subworkflow would be modified as follows:
In WINGS it would be translated to 3 different components: SMAP for files, SMAP for chains and sorter.
- 14 July: Unfortunately, my WINGS installation setup is not complete. Received additional information about the scripts used for SMAP. Created the SMAP component in WINGS, but we have still to test it.
SMAP seems to work with the command ./smap_comp.sh 1M44 1M44 output, but when running the perl script included in the file (after correcting a couple things in the syntax and libraries) it produces a exception. Still to be fixed. Also, the tables provided by Sarah look like the output of some preprocessing step. I still have to get the scripts form her.
SMAPS works. The script must be executed with admin permission
Therefore, the complete SMAP step is as follows:
- How to invoke SMAP:
sudo perl ./run_smap.pl -l solved_tb_structures_list.txt -d drug_binding_sites_list.txt -o smap_output -p 1e-5
- Work to be done:
Fix SMAP settings to work with the lists of proteins and drug provided
- Get scripts for the preprocessing steps
- Get scripts for the combination of both tables into the Raw Network table file.
- Create WINGS components for the scripts using files and tables.
- Work to be done:
- 13 July: Successfully installed SMAP. Tried the application with with a single call, from the command line with success too. After configuring everything according to the installation instructions, I've tested that it works with the following command (taken from one of the PDB codes of table S4):
./smap_comp.sh 1M44 1M44 output
1M44 are the codes to be compared, and "output" is the name of the output file where the solution is stored. There is a perl script to run a lot of comparations, but I haven't figured out yet how should I pass the lists of codes.
- 12 July: WINGS local setup in Daniel's machine. Thanks to Matheus for his help.
Phase 1: Project Scoping
- 11 July: After the first meeting, we now know how all the subworkflows are connected, and their inputs and outputs. Resultant file: File:SanDiegosmeetingresearchAFTERMEETING.odt
We will focus on 5 steps right now:
- Comparison using SMAP
- Comparison of drug chemical similarity
- Filtering usinf FATCAT
- Docking using eHiTS
- Visualization using Cytoscape
The next figure illustrates the modified subworkflow connection after the meeting:
- First week : 4-11 July: Preparation for the meeting on the 11th. Resulting document: File:SanDiegosmeetingresearch.odt
- Analysis of the paper method section. That is, analysis of the tools and datasets that participate.
- What are the inputs, outputs and middle steps of the workflow?
- Can the software be run in WINGS as part of a workflow?
- Project Formulation: 12 April: File:ProjectDraft.pdf
- First Discussions: 19-21 January, 2011: Beyond the PDF workshop