class documentation

class ParallelRunner(AbstractRunner): (source)

View In Hierarchy

ParallelRunner is an AbstractRunner implementation. It can be used to run the Pipeline in parallel groups formed by toposort. Please note that this runner implementation validates dataset using the _validate_catalog method, which checks if any of the datasets are single process only using the _SINGLE_PROCESS dataset attribute.

Method __del__ Undocumented
Method __init__ Instantiates the runner by creating a Manager.
Method create_default_data_set Factory method for creating the default dataset for the runner.
Class Method _validate_catalog Ensure that all data sets are serialisable and that we do not have any non proxied memory data sets being used as outputs as their content will not be synchronized across threads.
Class Method _validate_nodes Ensure all tasks are serialisable.
Method _get_required_workers_count Calculate the max number of processes required for the pipeline, limit to the number of CPU cores.
Method _run The abstract interface for running pipelines.
Instance Variable _manager Undocumented
Instance Variable _max_workers Undocumented

Inherited from AbstractRunner:

Method run Run the Pipeline using the datasets provided by catalog and save results back to the same objects.
Method run_only_missing Run only the missing outputs from the Pipeline using the datasets provided by catalog, and save results back to the same objects.
Method _suggest_resume_scenario Suggest a command to the user to resume a run after it fails. The run should be started from the point closest to the failure for which persisted input exists.
Instance Variable _is_async Undocumented
Property _logger Undocumented
def __del__(self): (source)

Undocumented

def __init__(self, max_workers: int = None, is_async: bool = False): (source)

Instantiates the runner by creating a Manager.

Parameters
max_workers:intNumber of worker processes to spawn. If not set, calculated automatically based on the pipeline configuration and CPU core count. On windows machines, the max_workers value cannot be larger than 61 and will be set to min(61, max_workers).
is_async:boolIf True, the node inputs and outputs are loaded and saved asynchronously with threads. Defaults to False.
Raises
ValueErrorbad parameters passed
def create_default_data_set(self, ds_name: str) -> _SharedMemoryDataSet: (source)

Factory method for creating the default dataset for the runner.

Parameters
ds_name:strName of the missing dataset.
Returns
_SharedMemoryDataSetAn instance of _SharedMemoryDataSet to be used for all unregistered datasets.
@classmethod
def _validate_catalog(cls, catalog: DataCatalog, pipeline: Pipeline): (source)

Ensure that all data sets are serialisable and that we do not have any non proxied memory data sets being used as outputs as their content will not be synchronized across threads.

@classmethod
def _validate_nodes(cls, nodes: Iterable[Node]): (source)

Ensure all tasks are serialisable.

def _get_required_workers_count(self, pipeline: Pipeline): (source)

Calculate the max number of processes required for the pipeline, limit to the number of CPU cores.

def _run(self, pipeline: Pipeline, catalog: DataCatalog, hook_manager: PluginManager, session_id: str = None): (source)

The abstract interface for running pipelines.

Parameters
pipeline:PipelineThe Pipeline to run.
catalog:DataCatalogThe DataCatalog from which to fetch data.
hook_manager:PluginManagerThe PluginManager to activate hooks.
session_id:strThe id of the session.
Raises
AttributeErrorWhen the provided pipeline is not suitable for parallel execution.
RuntimeErrorIf the runner is unable to schedule the execution of all pipeline nodes.
ExceptionIn case of any downstream node failure.
_manager = (source)

Undocumented

_max_workers = (source)

Undocumented