Welcome to the Wings Drugome Workflow Wiki
The goal of this work is to extend articles with scientific workflows to 1) represent computations carried out to obtain the published results, essentially capturing explicitly data analysis pipelines, and 2) represent an abstraction of those computations that captures the semantics of the data analysis method in an execution-independent manner. This would make scientific results more reproducible because articles would have not just a textual description of the computational process described in the article but also a workflow that, as a computational artifact, could be analyzed and re-run automatically in other labs that have different software and execution infrastructure.
The initial project objectives are:
- Recreate the workflow used in the TB drugome paper (a preprint is available, see also the project web site), repeat it with their data (to test that the workflow is correct), then run it with new data.
- Describe the workflow as a method abstracting from the software used to execute it, using the Wings workflow to create semantic descriptions of the workflow with abstract steps in order to be able to run it with different components available at different execution sites. The method should also be such that one could run it years from now using a different set of software.
- Model the provenance of the results, capture workflow provenance with the Open Provenance Model (OPM), and publish the provenance records so they can be browsed and queried as Linked Data. Since a new standard for provenance representation has been published in 2013 by the W3C (PROV), we have extended our model to compy with it.
This article is the main reference for this work:
- "Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome." Daniel Garijo, Sarah Kinnings, Li Xie, Lei Xie, Yinliang Zhang, Philip E. Bourne, and Yolanda Gil. PLOS ONE, Nov 27, 2013. DOI: 10.1371/journal.pone.0080278
A comprehensive description of the workflow publication model with examples:
- The OPMW model specification. Garijo, D., and Gil, Y. January 2012 (latest update September 2012).
A shorter, earlier report on workflow publication:
- "A New Approach for Publishing Workflows: Abstractions, Standards, and Linked Data". Garijo, D., and Gil, Y. Proceedings of the Sixth Workshop on Workflows in Support of Large-Scale Science (WORKS'11), held in conjunction with Supercomputing 2011, Seattle, Washington, November 2011.
- Motivation and background for the project is in the Initial Project Description.
- An overview and summary of the inputs, tools and results used and obtained when reproducing the TB-Drugome.
Recreating the Drugome Workflow
Our goal is to publish as a reusable computational workflow the method to derive the drug-target network of an organism (i.e., its drugome) published in (Kinnings et al 11) (a preprint is available, see also the project web site). The original work did not use a workflow system, instead the computational steps were run separately and manually.
The Computational Method Described in the Original Article
The article describes a computational pipeline that accesses data from the Protein Data Base (PDB) and carries out a systematic analysis of the proteome of Mycobacterium tuberculosis (TB) against all approved drugs. The process uncovers protein receptors in the organism that could be targeted by drugs currently in use for other purposes. The result is a drug-target network (a “drugome”) that includes all known approved drugs. Although the article focuses on a particular organism (TB), the method itself can be used for other pathogens or pathways and has the potential to be a key resource to develop new more comprehensive treatments for other diseases of interest.
With a workflow, the method could be reproduced as new drugs become available. It could also be reused to create many drugomes for other organisms. In essence, the paper represents a novel method that takes a comprehensive and systematic approach to drug discovery, moving away from current practice which is neither.
With the help of the authors of the article, we created a workflow that reflects the steps that were described in the original article and run it with data used in the original experiments.
We used the “methods” section that describes conceptually what computations were carried out, which is usual in computational biology. However, we needed clarifications from the authors in order to reproduce the computations. Moreover, we found that some of the software originally used in the experiments is no longer available in the lab, so some of the steps already needed to be done differently.
The inputs to the workflow are: 1) a list of binding sites of approved drugs that can be associated with protein crystal structures in PDB, 2) a list of proteins of the TB proteome that have solved structures in PDB, and 3) homology models of annotated comparative protein structure models for TB. First, both the binding sites of protein structures and the homology models are compared against the drug binding sites. Next, the overall similarity of the global protein structures is compared, and only significant pairs are retained. A graph of the resulting interaction network is generated, which can be visualized in tools such as Cytoscape. Finally, molecular docking is performed to predict the affinity of drug molecules with the proteins.
Sketch of the overall workflow
An initial product of the work is a high-level sketch of the overall workflow in the TB drugome paper. This kind of overall view is very useful to someone wanting to reproduce the results of the work, and would be useful to include it as supplemental material of a publication.
We started working on five core steps of the workflow:
- Comparison of ligand binding sites
- Comparison of drug chemical similarity
- Comparison of global protein structures and filtering
- Molecular docking
Other steps not included in our initial work:
- Retrieval of M.tb proteome from PDB
- Retrieval of FDA-approved drug binding sites
Schematic of the computational workflow
This sketch gives an overview of all the computational steps implemented in the final workflow. Five core steps are included in the workflow, each corresponds to a subsection of the methods section in the article. Some intermediate components were added that were not needed in the original work (URL checker, Docking checker, in purple). Grey components were left out of the scope of the initial work.
Implementing the Drugome Workflow in WINGS
We use the Wings workflow system to represent both abstract and executable workflows.
We created workflow components using existing open source software packages:
- Comparison of ligand binding sites using SMAP
- Comparison of drug chemical similarity (using custom sripts)
- Comparison of global protein structures and filtering using FATCAT
- Molecular docking using Autodock Vina. The original work used eHiTS, which is proprietary commercial software. Autodock Vina is an open source package with similar functionality.
- Visualization using yEd, Circos and Gephi. The original work used Cytoscape.
Initial Implementation of Workflows
We started by creating Wings workflows that expose the codes and data for every step of the method. The overall initial executable workflow is shown here:
Initial Abstract Reusable Workflows
Based on the initial implementation of the workflows, we created Abstract workflows. The abstract steps of these workflows make them independent of the code implementations, making them more reusable by groups that use different implementations of the steps, and also more resilient to code changes over time. The system automatically specializes the abstract steps in the workflow into executable codes. The overall abstract workflow is shown here:
After refining the initial version of the workflow, a final version was released:
- The final workflow template can be seen here:
- A run of the workflow can be browsed here.
- The final abstract workflow template:
- The workflow was parallelized and re-executed to make it more efficient. Further details can be found clicking here.
Mapping Workflows to the Open Provenance Model: The OPMW Profile
The abstract workflow and the executed workflow are both mapped to the Open Provenance Model (OPM), maintaining links between them.
We created OPMW, a profile that extends OPM and PROV to accommodate the publication of abstract workflows and the provenance of their executions.
We mapped terms in Wings to OPMW. The design decisions and some examples are discussed in the Wings to OPM and PROV mapping rationale.
Publishing Workflows as Linked Data
We publish workflows as Linked Data. A version of the workflow as Linked Data can be accessed here. We also developed a simple application to browse the published workflows, which allows you to navigate through the templates and explore their metadata.
Publishing Input Data and Workflow Results
We also published the input data and output data (workflow results) with permanent URLs using Figshare. See the pointers to the individual datasets above where the input data and output data are described.
Processing Data more Efficiently through Parallel Computations
We created a version of the workflow that has parallel execution so that it can run faster.
Augmenting the original article
As a result of this work, the method of the original article can be fully documented and reproduced by augmenting it with explicit and citable information about the data, workflow, software, and figures.
The original paper included supplementary data, but that is not the raw data in the format used by the codes. The authors provided the following datasets that were used in the original work:
- A drug key file with the ids for the drugs used in the workflow. (Permanent link to the file and data citation).
- The drug binding sites file. (Permanent link to the file and data citation).
- The solved structures file, containing the solved structures for the proteins and homology models being compared against the drug binding sites. (Permanent link to the file and data citation).
- The homology list file with the ids of the homology models associated to the proteins to be compared to the drug binding sites. (Permanent link to the file and data citation).
- File with additional information about the proteins of the TB Drugome. (Permanent link to the file and data citation).
- Template file from the Protein Data Bank with the ids of the homology models. (Permanent link to the file and data citation).
- List of solved structures with the ids of the proteins being compared in the experiment. (Permanent link to the file and data citation).
- Homology model information file to filter up homology models. (Permanent link to the file and data citation).
- The configuration file for the SMAP tool (used for ligand binding sites comparison). (Permanent link to the file and data citation).
A bundle with ALL the input datasets is available at the following permanent URL: http://dx.doi.org/10.6084/m9.figshare.776910. If you reuse this dataset, please use the doi identifier for attribution.
- A highly connected drug file, with the resultant drug-protein pairs after the execution of the workflow (Permanent link to the file).
- The docking results obtianed after further filtering the highly connected drugs (a permanent identifier is not provided due to the size of the dataset).
- A list of non relevant results with the pairs which didn't match after executing the workflow (Permanent link to the file).
- The visualization file (in GML format) representing the highly connected drug network (Permanent link to the file).
A bundle with ALL the output datasets resulting from the workflow is available at the following permanent URL: http://dx.doi.org/10.6084/m9.figshare.776891. If you reuse this dataset, please use the doi identifier for attribution.
The workflows are available to run through the Wings Drugome workflow portal (password required, please contact us).
Different visualizations of the results obtained can be accessed here. This is an example:
A diagram of the workflow is:
A run of the workflow can be browsed here.
A diagram of a general version of the workflow is:
Getting started with the TB-Drugome
The TB-Drugome Workflow is complicated to use without the proper knowledge in bioniformatics and computer science. This section aims to provide the initial pointers and references to look at when trying to reuse the TB-Drugome workflow.
- It is recommended to first understand the flow of the data by looking at published run schema.
- Then look at the original Drugome paper for further information and the tools being used.
- Once you are familiar with the workflow, have a look at the published workflow run, decompose the workflow in the different modules and test them separately.
- In order to test the different steps of the workflow, you can access http://wind.isi.edu/marbles/ and import the "Drugome" domain. Different workflows already encode the sub parts of the workflow and are ready to be run. Note that you will need to ask for a user and password (contact Yolanda Gil, Varun Ratnakar or Daniel Garijo).
- You will find suitable inputs to test the workflow in the input data section of this document. If you reuse the inputs, the outputs should look like the ones described here.
- If you want to reuse the workflow with other data it is very important to follow the format of the input files. If you have any problem with any software in particular, check our detailed timeline. We may have experienced your problems before!.
- Are you stuck at some point of the workflow? Do you have problems accessing some data? Please contact me and I will be happy to help you.
Detailed Timeline of Reproducibility Work
We documented the effort to reproduce the workflow. See a Detailed Timeline of our progress.
Summary and Future Work
Summary of work to date
- Initial extraction of the subworkflow steps from the paper
- Check how the subworkflows are connected to each other
- Build the subworkflows
- Comparison using SMAP
- Comparison of drug chemical similarity
- Filtering using FATCAT
- Docking using Autodock Vina (instead of eHits)
- Visualization using Cytoscape </del> GML (format used for yEd) is compatible with Cytoscape
- Connect the subworkflows into one end-to-end workflow.
- Create an abstract workflow by describing semantically the components and datatypes in a way that is independent of software tools and execution details.
- Track the provenance of the executions.
- Publish provenance in OPM as Linked Data
- Develop a simple application to browse the published workflows as Linked Data
- Text analysis
- Published the workflow and associated computations and datasets as Linked Data using PROV and OPMW
- Quantifying reproducibility
- Estimate the effort in creating the workflow
- Parallelizing the results of the workflow to make it more efficient
Interesting areas for future work include:
- Link workflow to paper and its discourse annotation as context
- Include initial steps to retrieve protein and drug data
- Run the workflows with other organisms (e.g., malaria) or protein datasets
- Link workflow data to other linked data, particularly the PBD dataset and others
- Import the published workflows to other workflow systems
- Yolanda Gil (ISI)
- Daniel Garijo (ISI and UPM)
- Phil Bourne (UCSD)
- Li Xie (UCSD)
- Sarah Kinnings (UCSD)
- Lei Xie (UCSD)
This project is sponsored by Elsevier Labs, the National Science Foundation with award number IIS-0948429, the Air Force Office of Scientific Resarch with award number FA9550-11-1-0104, and by internal funds from the University of Southern California's Information Sciences Institute and from the University of California, San Diego.