Evaluate CWL jobs that should be skipped on the leader#5507
Conversation
CWL jobs that have a when condition that evaluates to False should not be executed. Currently anything that is not a Workflow or ExpressionTool will always run on a worker node, which means that the check on the Conditional is only done when the worker node is already allocated. If the job is instead run on the leader (local), the step won't be executed. Checking at the instantiation level makes it possible to determine dynamically if the step should be run on the leader or the worker. This prevents unnecessary overhead in scheduling systems.
| # If not using the Toil file store, output files just go directly to | ||
| # their final homes their space doesn't need to be accounted per-job. | ||
|
|
||
| options_dict: dict = {} # type: ignore |
There was a problem hiding this comment.
Is the # type: ignore here intentional? As far as I can tell dict = {} is valid Python and shouldn't produce a type error. Would dict[str, Any] be a more precise annotation, and would that remove the need for the ignore comment entirely?
| # their final homes their space doesn't need to be accounted per-job. | ||
|
|
||
| options_dict: dict = {} # type: ignore | ||
| run_local: bool = self.conditional.is_false(cwljob) |
There was a problem hiding this comment.
cwljob may still have unresolved Promise objects at init time if when references an output from an upstream step. Since Conditional.is_false resolves promises without a file store, could this either crash or return the wrong result in that case? The worst case I can think of is is_false incorrectly returning True here, setting local=True with no resources, but then the fully-resolved condition at run() time returning False, meaning real work runs on the leader with no reserved resources. Would wrapping this in a try/except that falls back to run_local = False be a safe way to handle that?
| isinstance(tool, cwltool.command_line_tool.ExpressionTool) | ||
| or run_local | ||
| ), | ||
| **options_dict, |
There was a problem hiding this comment.
When run_local is True, options_dict is empty so cores, memory, disk, accelerators, and preemptible all fall back to Job defaults. CWLJobWrapper, which also runs locally, explicitly passes cores=1, memory="1GiB", disk="1MiB" for its local run. Would it be worth doing the same here for consistency, rather than relying on the defaults being equivalent?
annagiroti
left a comment
There was a problem hiding this comment.
The overall approach appears to be clean and the options_dict pattern for conditionally passing resources is a nice solution. My main concern is the is_false being called at init time before promises are fully resolved. This is worth making sure that can't cause issues for when conditions that reference upstream step outputs. Would it also be worth adding test cases? For example, one where the when condition is false (verifying the job isn't submitted to the batch system) and one where it references an output from a previous step (to confirm it doesn't crash or mis-schedule).
CWL jobs that have a when condition that evaluates to False should not be executed. Currently anything that is not a Workflow or ExpressionTool will always run on a worker node, which means that the check on the Conditional is only done when the worker node is already allocated. If the job is instead run on the leader (local), the step won't be executed.
Checking at the instantiation level makes it possible to determine dynamically if the step should be run on the leader or the worker. This prevents unnecessary overhead in scheduling systems.
Resolves #3990.
Changelog Entry
To be copied to the draft changelog by merger:
whenconditional on the leader.Reviewer Checklist
issues/XXXX-fix-the-thingin the Toil repo, or from an external repo.camelCasethat want to be insnake_case.docs/running/{cliOptions,cwl,wdl}.rstMerger Checklist