Making the workflows more efficient
Designing workflow templates for parallel execution
A parallelized version of the workflow template can support more efficient execution for large datasets. Some steps of the overall workflow require a lot of time to execute. For example, the SMAP step requires about 5 seconds approx per comparison. If the comparisons are made iteratively, then the execution can last for days:
749 solved structures * 962 drug binding sites = 720,538 comparisons *5 secs per comparison = around 3.6 million seconds = 41 days!!
Parallelization is important: since we have to make the comparisons against all the drug binding sites, we can divide the solved structure list and the homology models list according to the number of threads our machine supports. For instance, if we create 749 threads in parallel, we will reduce the time to:
962 comparisons * 5 sec per comparison = 4810 sec = 1.33 hours!!
The same applies for the docking step. Of course, the time reduction stated above is an example. It is very difficult to create more than 700 threads in a machine, although the time reduction will be significant.
Parallel execution in practice
Doing the parallelization work required modifications to some of the codes (eg, the SMAP software and the inputs) in order to be fully encapsulated and support concurrent execution of process instances.
The workflow was run in two parts, in order to test the improvements to the workflows separately.
Part 1 is the workflow without the docking part. The parallelized abstract workflow template can be seen below:
The parallelized workflow took 14 days to run. The original workflow (without parallelization) took 3 or 4 months to run.
Part 2 is the docking workflow. The docking workflow was not necessary to parallelize, as it took as input the filtered results produced by the part 1. The abstract workflow template can be seen below:
This docking workflow took 2 days to run.