class documentation

class DataCatalog: (source)

View In Hierarchy

DataCatalog stores instances of AbstractDataSet implementations to provide load and save capabilities from anywhere in the program. To use a DataCatalog, you need to instantiate it with a dictionary of data sets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying data sets.

Class Method from_config Create a DataCatalog instance from configuration. This is a factory method used to provide developers with a way to instantiate DataCatalog with configuration parsed from configuration files.
Method __eq__ Undocumented
Method __init__ DataCatalog stores instances of AbstractDataSet implementations to provide load and save capabilities from anywhere in the program. To use a DataCatalog, you need to instantiate it with a dictionary of data sets...
Method add Adds a new AbstractDataSet object to the DataCatalog.
Method add_all Adds a group of new data sets to the DataCatalog.
Method add_feed_dict Adds instances of MemoryDataSet, containing the data provided through feed_dict.
Method confirm Confirm a dataset by its name.
Method exists Checks whether registered data set exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.
Method list List of all DataSet names registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.
Method load Loads a registered data set.
Method release Release any cached data associated with a data set
Method save Save data to a registered data set.
Method shallow_copy Returns a shallow copy of the current object.
Instance Variable datasets Undocumented
Instance Variable layers Undocumented
Method _get_dataset Undocumented
Instance Variable _data_sets Undocumented
Property _logger Undocumented
@classmethod
def from_config(cls: Type, catalog: Optional[Dict[str, Dict[str, Any]]], credentials: Dict[str, Dict[str, Any]] = None, load_versions: Dict[str, str] = None, save_version: str = None) -> DataCatalog: (source)

Create a DataCatalog instance from configuration. This is a factory method used to provide developers with a way to instantiate DataCatalog with configuration parsed from configuration files.

Example:

>>> config = {
>>>     "cars": {
>>>         "type": "pandas.CSVDataSet",
>>>         "filepath": "cars.csv",
>>>         "save_args": {
>>>             "index": False
>>>         }
>>>     },
>>>     "boats": {
>>>         "type": "pandas.CSVDataSet",
>>>         "filepath": "s3://aws-bucket-name/boats.csv",
>>>         "credentials": "boats_credentials",
>>>         "save_args": {
>>>             "index": False
>>>         }
>>>     }
>>> }
>>>
>>> credentials = {
>>>     "boats_credentials": {
>>>         "client_kwargs": {
>>>             "aws_access_key_id": "<your key id>",
>>>             "aws_secret_access_key": "<your secret>"
>>>         }
>>>      }
>>> }
>>>
>>> catalog = DataCatalog.from_config(config, credentials)
>>>
>>> df = catalog.load("cars")
>>> catalog.save("boats", df)
Parameters
catalog:Optional[Dict[str, Dict[str, Any]]]A dictionary whose keys are the data set names and the values are dictionaries with the constructor arguments for classes implementing AbstractDataSet. The data set class to be loaded is specified with the key type and their fully qualified class name. All kedro.io data set can be specified by their class name only, i.e. their module name can be omitted.
credentials:Dict[str, Dict[str, Any]]A dictionary containing credentials for different data sets. Use the credentials key in a AbstractDataSet to refer to the appropriate credentials as shown in the example below.
load_versions:Dict[str, str]A mapping between dataset names and versions to load. Has no effect on data sets without enabled versioning.
save_version:strVersion string to be used for save operations by all data sets with enabled versioning. It must: a) be a case-insensitive string that conforms with operating system filename limitations, b) always return the latest version when sorted in lexicographical order.
Returns
DataCatalogAn instantiated DataCatalog containing all specified data sets, created and ready to use.
Raises
DataSetErrorWhen the method fails to create any of the data sets from their config.
DataSetNotFoundErrorWhen load_versions refers to a dataset that doesn't exist in the catalog.
def __eq__(self, other): (source)

Undocumented

def __init__(self, data_sets: Dict[str, AbstractDataSet] = None, feed_dict: Dict[str, Any] = None, layers: Dict[str, Set[str]] = None): (source)

DataCatalog stores instances of AbstractDataSet implementations to provide load and save capabilities from anywhere in the program. To use a DataCatalog, you need to instantiate it with a dictionary of data sets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying data sets.

Example:

>>> from kedro.extras.datasets.pandas import CSVDataSet
>>>
>>> cars = CSVDataSet(filepath="cars.csv",
>>>                   load_args=None,
>>>                   save_args={"index": False})
>>> io = DataCatalog(data_sets={'cars': cars})
Parameters
data_sets:Dict[str, AbstractDataSet]A dictionary of data set names and data set instances.
feed_dict:Dict[str, Any]A feed dict with data to be added in memory.
layers:Dict[str, Set[str]]A dictionary of data set layers. It maps a layer name to a set of data set names, according to the data engineering convention. For more details, see https://kedro.readthedocs.io/en/stable/faq/faq.html#what-is-data-engineering-convention
def add(self, data_set_name: str, data_set: AbstractDataSet, replace: bool = False): (source)

Adds a new AbstractDataSet object to the DataCatalog.

Example:

>>> from kedro.extras.datasets.pandas import CSVDataSet
>>>
>>> io = DataCatalog(data_sets={
>>>                   'cars': CSVDataSet(filepath="cars.csv")
>>>                  })
>>>
>>> io.add("boats", CSVDataSet(filepath="boats.csv"))
Parameters
data_set_name:strA unique data set name which has not been registered yet.
data_set:AbstractDataSetA data set object to be associated with the given data set name.
replace:boolSpecifies whether to replace an existing DataSet with the same name is allowed.
Raises
DataSetAlreadyExistsErrorWhen a data set with the same name has already been registered.
def add_all(self, data_sets: Dict[str, AbstractDataSet], replace: bool = False): (source)

Adds a group of new data sets to the DataCatalog.

Example:

>>> from kedro.extras.datasets.pandas import CSVDataSet, ParquetDataSet
>>>
>>> io = DataCatalog(data_sets={
>>>                   "cars": CSVDataSet(filepath="cars.csv")
>>>                  })
>>> additional = {
>>>     "planes": ParquetDataSet("planes.parq"),
>>>     "boats": CSVDataSet(filepath="boats.csv")
>>> }
>>>
>>> io.add_all(additional)
>>>
>>> assert io.list() == ["cars", "planes", "boats"]
Parameters
data_sets:Dict[str, AbstractDataSet]A dictionary of DataSet names and data set instances.
replace:boolSpecifies whether to replace an existing DataSet with the same name is allowed.
Raises
DataSetAlreadyExistsErrorWhen a data set with the same name has already been registered.
def add_feed_dict(self, feed_dict: Dict[str, Any], replace: bool = False): (source)

Adds instances of MemoryDataSet, containing the data provided through feed_dict.

Example:

>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'col1': [1, 2],
>>>                    'col2': [4, 5],
>>>                    'col3': [5, 6]})
>>>
>>> io = DataCatalog()
>>> io.add_feed_dict({
>>>     'data': df
>>> }, replace=True)
>>>
>>> assert io.load("data").equals(df)
Parameters
feed_dict:Dict[str, Any]A feed dict with data to be added in memory.
replace:boolSpecifies whether to replace an existing DataSet with the same name is allowed.
def confirm(self, name: str): (source)

Confirm a dataset by its name.

Parameters
name:strName of the dataset.
Raises
DataSetErrorWhen the dataset does not have confirm method.
def exists(self, name: str) -> bool: (source)

Checks whether registered data set exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.

Parameters
name:strA data set to be checked.
Returns
boolWhether the data set output exists.
def list(self, regex_search: Optional[str] = None) -> List[str]: (source)

List of all DataSet names registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.

Example:

>>> io = DataCatalog()
>>> # get data sets where the substring 'raw' is present
>>> raw_data = io.list(regex_search='raw')
>>> # get data sets which start with 'prm' or 'feat'
>>> feat_eng_data = io.list(regex_search='^(prm|feat)')
>>> # get data sets which end with 'time_series'
>>> models = io.list(regex_search='.+time_series$')
Parameters
regex_search:Optional[str]An optional regular expression which can be provided to limit the data sets returned by a particular pattern.
Returns
List[str]A list of DataSet names available which match the regex_search criteria (if provided). All data set names are returned by default.
Raises
SyntaxErrorWhen an invalid regex filter is provided.
def load(self, name: str, version: str = None) -> Any: (source)

Loads a registered data set.

Example:

>>> from kedro.io import DataCatalog
>>> from kedro.extras.datasets.pandas import CSVDataSet
>>>
>>> cars = CSVDataSet(filepath="cars.csv",
>>>                   load_args=None,
>>>                   save_args={"index": False})
>>> io = DataCatalog(data_sets={'cars': cars})
>>>
>>> df = io.load("cars")
Parameters
name:strA data set to be loaded.
version:strOptional argument for concrete data version to be loaded. Works only with versioned datasets.
Returns
AnyThe loaded data as configured.
Raises
DataSetNotFoundErrorWhen a data set with the given name has not yet been registered.
def release(self, name: str): (source)

Release any cached data associated with a data set

Parameters
name:strA data set to be checked.
Raises
DataSetNotFoundErrorWhen a data set with the given name has not yet been registered.
def save(self, name: str, data: Any): (source)

Save data to a registered data set.

Example:

>>> import pandas as pd
>>>
>>> from kedro.extras.datasets.pandas import CSVDataSet
>>>
>>> cars = CSVDataSet(filepath="cars.csv",
>>>                   load_args=None,
>>>                   save_args={"index": False})
>>> io = DataCatalog(data_sets={'cars': cars})
>>>
>>> df = pd.DataFrame({'col1': [1, 2],
>>>                    'col2': [4, 5],
>>>                    'col3': [5, 6]})
>>> io.save("cars", df)
Parameters
name:strA data set to be saved to.
data:AnyA data object to be saved as configured in the registered data set.
Raises
DataSetNotFoundErrorWhen a data set with the given name has not yet been registered.
def shallow_copy(self) -> DataCatalog: (source)

Returns a shallow copy of the current object.

Returns
DataCatalogCopy of the current object.
datasets = (source)

Undocumented

Undocumented

def _get_dataset(self, data_set_name: str, version: Version = None, suggest: bool = True) -> AbstractDataSet: (source)

Undocumented

_data_sets = (source)

Undocumented

Undocumented