The goal of this work is to extend articles with scientific workflows to 1) represent computations carried out to obtain the published results, essentially capturing explicitly data analysis pipelines, and 2) represent an abstraction of those computations that captures the semantics of the data analysis method in an execution-independent manner. This would make scientific results more reproducible because articles would have not just a textual description of the computational process described in the article but also a workflow that, as a computational artifact, could be analyzed and re-run automatically.
In recent years, a variety of systems have been developed that
export the workflows used to analyze data and make them part of
published articles. The workflows that are published
in current approaches are dependent on the specific codes used for
execution, the specific workflow system used, and the specific
workflow catalogs where they are published.
In this work,
we take a new approach that addresses these shortcomings and
makes workflows more reusable through: 1) the use of abstract
workflows to complement executable workflows to make them
reusable when the execution environment is different, 2) the
publication of both abstract and executable workflows using
standards such as the Open Provenance Model that can be
imported by other workflow systems, 3) the publication of
workflows as Linked Data that results in open web accessible
workflow repositories. Our initial focus is a
complex workflow that we re-created from an influential
drug discovery publication that describes the generation of ‘drugomes’.
The TB Drugome Workflow
Our initial focus is on a reusable computational workflow the method to derive the drug-target network of an organism (i.e., its drugome) published in (Kinnings et al 11)
is available, see also the project web site
The article describes a computational pipeline that accesses data from the Protein Data Base (PDB) and carries out a systematic analysis of the proteome of Mycobacterium tuberculosis (TB) against all approved drugs. The process uncovers protein receptors in the organism that could be targeted by drugs currently in use for other purposes. The result is a drug-target network (a “drugome”) that includes all known approved drugs. Although the article focuses on a particular organism (TB), the method itself can be used for other pathogens or pathways and has the potential to be a key resource to develop new more comprehensive treatments for other diseases of interest.
With the help of the authors of the article, we have created the executable workflow that reflects the steps that were described in the original article and run it with data used in the original experiments.
The final executable workflow can be seen here:
To export the workflows we developed OPMW as an extension of OPM that can represent abstract workflows.
OPM is a widely-used domain-independent provenance model result of the Provenance Challenge Series
and years of workflow provenance exchange and standardization in the scientific workflow community.
There are several reasons to use OPM. First, OPM has been already used successfully in many scientific workflow systems, thus making our published workflows more reusable. Another advantage is that the core definitions in OPM are domain independent and extensible to accommodate other purposes, in our case workflow representations. In addition, OPM can be considered the basis of the emerging W3C Provenance Interchange Language (PROV), which is currently being developed by the W3C Provenance Working Group
as a standard for representing and publishing provenance on the Web.
OPM offers several core concepts and relationships to represent provenance. OPM models the resources (datasets) as artifacts (immutable pieces of state), processes (action or series of actions performed on artifacts), and agents (controllers of processes). Their relationships are modeled in a provenance graph with five causal edges: used (a process used some artifact), wasControlledBy (an agent controlled some process), wasGeneratedBy (a process generated an artifact), wasDerivedFrom (an artifact was derived from another artifact) and wasTriggeredBy (a process was triggered by another process). It also introduces the concept of roles to assign the type of activity that artifacts, processes or agents played when interacting with each other, and the notion of accounts and provenance graphs to group sets of OPM assertions into different subgraphs. An account represents a particular view on the provenance of an artifact based on what was executed.
We mapped Wings ontologies to the OPM core model, extending OPM core concepts and relationships according to our needs in a new profile called OPMW.
We use two OPM ontologies for our mapping. OPMV
is a lightweight RDF vocabulary implementation of the OPM model that only has a subset of the concepts in OPM but it facilitates modeling and query formulation. OPMO
covers the full functionality of the OPM model, and we use it for mapping to OPM concepts that are not in OPMV, such as Account or OPM Graph.
Figure 1 shows a high level diagram of the mappings to OPM of an abstract workflow on the left and a specific execution on the right. The workflow shown here has one step (executionNode1), which runs the workflow component (specComp1) that has one input (execInput1) and one output (executionOutput1). For some of the concepts there is a straightforward mapping: datasets are a subtype of Artifacts, while workflow steps, also called nodes, map to OPM Processes. Notice that each node has a link to the component that is run in that step, for example the workflow in Figure 1 has two nodes that run the same component SMAPV2. There is no OPM term that can be mapped to components, so we used our own terms (represented with the ac prefix in the Figure 1).
In the figure, the terms taken from OPMO and OPMV are indicated using their namespaces. The new terms that we defined in our extension profile use the OPMW prefix.
The ontology can be browsed here
WINGS workflow system
is a workflow system that assists scientists with the design of computational experiments. A computational experiment specifies how selected datasets are to be processed by a series of software components in a particular configuration. Earth scientists use computational experiments to estimate seismic hazard through simulations of earthquake forecasts. Biologists use computational experiments for analysis of gene expression microarray data or molecular interaction networks and pathways. Social scientists analyze large social networks to discover structural regularities based on mining relations among individuals.
We use workflows to represent computational experiments. Workflows represent application components and their dependencies in terms of dataflow among them. Workflow systems have been developed to assist users with some aspect of the process, for example to assemble workflows out of large component libraries, to optimize execution performance, and for workflow sharing. None of these systems provides comprehensive support for workflow design and exploration. To learn more about the state of the art in workflow systems, please visit http://www.isi.edu/nsf-workflows06.
Jena semantic framework
is a Java framework for building Semantic Web applications. It provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine.
Jena is open source and grown out of work with the HP Labs Semantic Web Program.
Allegro Graph database
is a modern, high-performance, persistent graph database. AllegroGraph uses efficient memory utilization in combination with disk-based storage, enabling it to scale to billions of quads while maintaining superior performance. AllegroGraph supports SPARQL, RDFS++, and Prolog reasoning from numerous client applications.
can be used to add Linked Data interfaces to SPARQL endpoints. Much Semantic Web data lives inside triple stores and can be accessed only by sending SPARQL queries to a SPARQL endpoint. It is hard to connect information in these stores with other external data sources.
Pubby makes it easy to turn a SPARQL endpoint into a Linked Data server. It is implemented as a Java web application.
This project is sponsored by Elsevier Labs, the National Science Foundation with award number CCF-0725332, the Air Force Office of Scientific Resarch with award number FA9550-11-1-0104, and by internal funds from the University of Southern California's Information Sciences Institute and from the University of California, San Diego.