kedro.io.DataCatalog

class documentation

class DataCatalog: (source)

DataCatalog stores instances of AbstractDataSet implementations to provide load and save capabilities from anywhere in the program. To use a DataCatalog, you need to instantiate it with a dictionary of data sets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying data sets.

Class Method	`from_config`	Create a `DataCatalog` instance from configuration. This is a factory method used to provide developers with a way to instantiate `DataCatalog` with configuration parsed from configuration files.
Method	`__eq__`	Undocumented
Method	`__init__`	`DataCatalog` stores instances of `AbstractDataSet` implementations to provide `load` and `save` capabilities from anywhere in the program. To use a `DataCatalog`, you need to instantiate it with a dictionary of data sets...
Method	`add`	Adds a new `AbstractDataSet` object to the `DataCatalog`.
Method	`add_all`	Adds a group of new data sets to the `DataCatalog`.
Method	`add_feed_dict`	Adds instances of `MemoryDataSet`, containing the data provided through feed_dict.
Method	`confirm`	Confirm a dataset by its name.
Method	`exists`	Checks whether registered data set exists by calling its `exists()` method. Raises a warning and returns False if `exists()` is not implemented.
Method	`list`	List of all `DataSet` names registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.
Method	`load`	Loads a registered data set.
Method	`release`	Release any cached data associated with a data set
Method	`save`	Save data to a registered data set.
Method	`shallow_copy`	Returns a shallow copy of the current object.
Instance Variable	`datasets`	Undocumented
Instance Variable	`layers`	Undocumented
Method	`_get_dataset`	Undocumented
Instance Variable	`_data_sets`	Undocumented
Property	`_logger`	Undocumented

@classmethod
def from_config(cls: Type, catalog: Optional[Dict[str, Dict[str, Any]]], credentials: Dict[str, Dict[str, Any]] = None, load_versions: Dict[str, str] = None, save_version: str = None) -> DataCatalog: (source) ¶

Create a DataCatalog instance from configuration. This is a factory method used to provide developers with a way to instantiate DataCatalog with configuration parsed from configuration files.

Example:

>>> config = {
>>>     "cars": {
>>>         "type": "pandas.CSVDataSet",
>>>         "filepath": "cars.csv",
>>>         "save_args": {
>>>             "index": False
>>>         }
>>>     },
>>>     "boats": {
>>>         "type": "pandas.CSVDataSet",
>>>         "filepath": "s3://aws-bucket-name/boats.csv",
>>>         "credentials": "boats_credentials",
>>>         "save_args": {
>>>             "index": False
>>>         }
>>>     }
>>> }
>>>
>>> credentials = {
>>>     "boats_credentials": {
>>>         "client_kwargs": {
>>>             "aws_access_key_id": "<your key id>",
>>>             "aws_secret_access_key": "<your secret>"
>>>         }
>>>      }
>>> }
>>>
>>> catalog = DataCatalog.from_config(config, credentials)
>>>
>>> df = catalog.load("cars")
>>> catalog.save("boats", df)

Parameters
catalog:`Optional[Dict[str, Dict[str, Any]]]`	A dictionary whose keys are the data set names and the values are dictionaries with the constructor arguments for classes implementing `AbstractDataSet`. The data set class to be loaded is specified with the key `type` and their fully qualified class name. All `kedro.io` data set can be specified by their class name only, i.e. their module name can be omitted.
credentials:`Dict[str, Dict[str, Any]]`	A dictionary containing credentials for different data sets. Use the `credentials` key in a `AbstractDataSet` to refer to the appropriate credentials as shown in the example below.
load_versions:`Dict[str, str]`	A mapping between dataset names and versions to load. Has no effect on data sets without enabled versioning.
save_version:`str`	Version string to be used for `save` operations by all data sets with enabled versioning. It must: a) be a case-insensitive string that conforms with operating system filename limitations, b) always return the latest version when sorted in lexicographical order.
Returns
`DataCatalog`	An instantiated `DataCatalog` containing all specified data sets, created and ready to use.
Raises
`DataSetError`	When the method fails to create any of the data sets from their config.
`DataSetNotFoundError`	When `load_versions` refers to a dataset that doesn't exist in the catalog.

def __eq__(self, other): (source) ¶

Undocumented

def __init__(self, data_sets: Dict[str, AbstractDataSet] = None, feed_dict: Dict[str, Any] = None, layers: Dict[str, Set[str]] = None): (source) ¶

Example:

>>> from kedro.extras.datasets.pandas import CSVDataSet
>>>
>>> cars = CSVDataSet(filepath="cars.csv",
>>>                   load_args=None,
>>>                   save_args={"index": False})
>>> io = DataCatalog(data_sets={'cars': cars})

Parameters
data_sets:`Dict[str, AbstractDataSet]`	A dictionary of data set names and data set instances.
feed_dict:`Dict[str, Any]`	A feed dict with data to be added in memory.
layers:`Dict[str, Set[str]]`	A dictionary of data set layers. It maps a layer name to a set of data set names, according to the data engineering convention. For more details, see https://kedro.readthedocs.io/en/stable/faq/faq.html#what-is-data-engineering-convention

def add(self, data_set_name: str, data_set: AbstractDataSet, replace: bool = False): (source) ¶

Adds a new AbstractDataSet object to the DataCatalog.

Example:

>>> from kedro.extras.datasets.pandas import CSVDataSet
>>>
>>> io = DataCatalog(data_sets={
>>>                   'cars': CSVDataSet(filepath="cars.csv")
>>>                  })
>>>
>>> io.add("boats", CSVDataSet(filepath="boats.csv"))

Parameters
data_set_name:`str`	A unique data set name which has not been registered yet.
data_set:`AbstractDataSet`	A data set object to be associated with the given data set name.
replace:`bool`	Specifies whether to replace an existing `DataSet` with the same name is allowed.
Raises
`DataSetAlreadyExistsError`	When a data set with the same name has already been registered.

def add_all(self, data_sets: Dict[str, AbstractDataSet], replace: bool = False): (source) ¶

Adds a group of new data sets to the DataCatalog.

Example:

>>> from kedro.extras.datasets.pandas import CSVDataSet, ParquetDataSet
>>>
>>> io = DataCatalog(data_sets={
>>>                   "cars": CSVDataSet(filepath="cars.csv")
>>>                  })
>>> additional = {
>>>     "planes": ParquetDataSet("planes.parq"),
>>>     "boats": CSVDataSet(filepath="boats.csv")
>>> }
>>>
>>> io.add_all(additional)
>>>
>>> assert io.list() == ["cars", "planes", "boats"]

Parameters
data_sets:`Dict[str, AbstractDataSet]`	A dictionary of `DataSet` names and data set instances.
replace:`bool`	Specifies whether to replace an existing `DataSet` with the same name is allowed.
Raises
`DataSetAlreadyExistsError`	When a data set with the same name has already been registered.

def add_feed_dict(self, feed_dict: Dict[str, Any], replace: bool = False): (source) ¶

Adds instances of MemoryDataSet, containing the data provided through feed_dict.

Example:

>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'col1': [1, 2],
>>>                    'col2': [4, 5],
>>>                    'col3': [5, 6]})
>>>
>>> io = DataCatalog()
>>> io.add_feed_dict({
>>>     'data': df
>>> }, replace=True)
>>>
>>> assert io.load("data").equals(df)

Parameters
feed_dict:`Dict[str, Any]`	A feed dict with data to be added in memory.
replace:`bool`	Specifies whether to replace an existing `DataSet` with the same name is allowed.

def confirm(self, name: str): (source) ¶

Confirm a dataset by its name.

Parameters
name:`str`	Name of the dataset.
Raises
`DataSetError`	When the dataset does not have `confirm` method.

def exists(self, name: str) -> bool: (source) ¶

Checks whether registered data set exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.

Parameters
name:`str`	A data set to be checked.
Returns
`bool`	Whether the data set output exists.

def list(self, regex_search: Optional[str] = None) -> List[str]: (source) ¶

List of all DataSet names registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.

Example:

>>> io = DataCatalog()
>>> # get data sets where the substring 'raw' is present
>>> raw_data = io.list(regex_search='raw')
>>> # get data sets which start with 'prm' or 'feat'
>>> feat_eng_data = io.list(regex_search='^(prm|feat)')
>>> # get data sets which end with 'time_series'
>>> models = io.list(regex_search='.+time_series$')

Parameters
regex_search:`Optional[str]`	An optional regular expression which can be provided to limit the data sets returned by a particular pattern.
Returns
`List[str]`	A list of `DataSet` names available which match the `regex_search` criteria (if provided). All data set names are returned by default.
Raises
`SyntaxError`	When an invalid regex filter is provided.

def load(self, name: str, version: str = None) -> Any: (source) ¶

Loads a registered data set.

Example:

>>> from kedro.io import DataCatalog
>>> from kedro.extras.datasets.pandas import CSVDataSet
>>>
>>> cars = CSVDataSet(filepath="cars.csv",
>>>                   load_args=None,
>>>                   save_args={"index": False})
>>> io = DataCatalog(data_sets={'cars': cars})
>>>
>>> df = io.load("cars")

Parameters
name:`str`	A data set to be loaded.
version:`str`	Optional argument for concrete data version to be loaded. Works only with versioned datasets.
Returns
`Any`	The loaded data as configured.
Raises
`DataSetNotFoundError`	When a data set with the given name has not yet been registered.

def release(self, name: str): (source) ¶

Release any cached data associated with a data set

Parameters
name:`str`	A data set to be checked.
Raises
`DataSetNotFoundError`	When a data set with the given name has not yet been registered.

def save(self, name: str, data: Any): (source) ¶

Save data to a registered data set.

Example:

>>> import pandas as pd
>>>
>>> from kedro.extras.datasets.pandas import CSVDataSet
>>>
>>> cars = CSVDataSet(filepath="cars.csv",
>>>                   load_args=None,
>>>                   save_args={"index": False})
>>> io = DataCatalog(data_sets={'cars': cars})
>>>
>>> df = pd.DataFrame({'col1': [1, 2],
>>>                    'col2': [4, 5],
>>>                    'col3': [5, 6]})
>>> io.save("cars", df)

Parameters
name:`str`	A data set to be saved to.
data:`Any`	A data object to be saved as configured in the registered data set.
Raises
`DataSetNotFoundError`	When a data set with the given name has not yet been registered.

def shallow_copy(self) -> DataCatalog: (source) ¶

Returns a shallow copy of the current object.

Returns
`DataCatalog`	Copy of the current object.

datasets = (source) ¶

Undocumented

layers = (source) ¶

Undocumented

def _get_dataset(self, data_set_name: str, version: Version = None, suggest: bool = True) -> AbstractDataSet: (source) ¶

Undocumented

_data_sets = (source) ¶

Undocumented

@property
_logger = (source) ¶

Undocumented