kedro.extras.datasets.pandas.GenericDataSet

class documentation

class GenericDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame]): (source)

pandas.GenericDataSet loads/saves data from/to a data file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to dynamically select the appropriate type of read/write target on a best effort basis.

Example usage for the YAML API:

cars:
  type: pandas.GenericDataSet
  file_format: csv
  filepath: s3://data/01_raw/company/cars.csv
  load_args:
    sep: ","
    na_values: ["#NA", NA]
  save_args:
    index: False
    date_format: "%Y-%m-%d"

This second example is able to load a SAS7BDAT file via the pd.read_sas method. Trying to save this dataset will raise a DataSetError since pandas does not provide an equivalent pd.DataFrame.to_sas write method.

flights:
   type: pandas.GenericDataSet
   file_format: sas
   filepath: data/01_raw/airplanes.sas7bdat
   load_args:
      format: sas7bdat

Example usage for the Python API:

>>> from kedro.extras.datasets.pandas import GenericDataSet
>>> import pandas as pd
>>>
>>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
>>>                      'col3': [5, 6]})
>>>
>>> data_set = GenericDataSet(filepath="test.csv", file_format='csv')
>>> data_set.save(data)
>>> reloaded = data_set.load()
>>> assert data.equals(reloaded)

Method	`__init__`	Creates a new instance of `GenericDataSet` pointing to a concrete data file on a specific filesystem. The appropriate pandas load/save methods are dynamically identified by string matching on a best effort basis.
Constant	`DEFAULT_LOAD_ARGS`	Undocumented
Constant	`DEFAULT_SAVE_ARGS`	Undocumented
Method	`_describe`	Undocumented
Method	`_ensure_file_system_target`	Undocumented
Method	`_exists`	Undocumented
Method	`_invalidate_cache`	Invalidate underlying filesystem caches.
Method	`_load`	Undocumented
Method	`_release`	Undocumented
Method	`_save`	Undocumented
Instance Variable	`_file_format`	Undocumented
Instance Variable	`_fs`	Undocumented
Instance Variable	`_fs_open_args_load`	Undocumented
Instance Variable	`_fs_open_args_save`	Undocumented
Instance Variable	`_load_args`	Undocumented
Instance Variable	`_protocol`	Undocumented
Instance Variable	`_save_args`	Undocumented

Inherited from AbstractVersionedDataSet:

Method	`exists`	Checks whether a data set's output already exists by calling the provided _exists() method.
Method	`load`	Loads data by delegation to the provided load method.
Method	`resolve_load_version`	Compute the version the dataset should be loaded with.
Method	`resolve_save_version`	Compute the version the dataset should be saved with.
Method	`save`	Saves data by delegation to the provided save method.
Method	`_fetch_latest_load_version`	Undocumented
Method	`_fetch_latest_save_version`	Generate and cache the current save version
Method	`_get_load_path`	Undocumented
Method	`_get_save_path`	Undocumented
Method	`_get_versioned_path`	Undocumented
Instance Variable	`_exists_function`	Undocumented
Instance Variable	`_filepath`	Undocumented
Instance Variable	`_glob_function`	Undocumented
Instance Variable	`_version`	Undocumented
Instance Variable	`_version_cache`	Undocumented

Inherited from AbstractDataSet (via AbstractVersionedDataSet):

Class Method	`from_config`	Create a data set instance using the configuration provided.
Method	`__str__`	Undocumented
Method	`release`	Release any cached data.
Method	`_copy`	Undocumented
Property	`_logger`	Undocumented

def __init__(self, filepath: str, file_format: str, load_args: Dict[str, Any] = None, save_args: Dict[str, Any] = None, version: Version = None, credentials: Dict[str, Any] = None, fs_args: Dict[str, Any] = None): (source) ¶

overrides kedro.io.AbstractVersionedDataSet.__init__

Creates a new instance of GenericDataSet pointing to a concrete data file on a specific filesystem. The appropriate pandas load/save methods are dynamically identified by string matching on a best effort basis.

Parameters
filepath:`str`	Filepath in POSIX format to a file prefixed with a protocol like `s3://`. If prefix is not provided, `file` protocol (local filesystem) will be used. The prefix should be any protocol supported by `fsspec`. Key assumption: The first argument of either load/save method points to a filepath/buffer/io type location. There are some read/write targets such as 'clipboard' or 'records' that will fail since they do not take a filepath like argument.
file_format:`str`	String which is used to match the appropriate load/save method on a best effort basis. For example if 'csv' is passed in the `pandas.read_csv` and `pandas.DataFrame.to_csv` will be identified. An error will be raised unless at least one matching `read_{file_format}` or `to_{file_format}` method is identified.
load_args:`Dict[str, Any]`	Pandas options for loading files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/reference/io.html All defaults are preserved.
save_args:`Dict[str, Any]`	Pandas options for saving files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/reference/io.html All defaults are preserved, but "index", which is set to False.
version:`Version`	If specified, should be an instance of `kedro.io.core.Version`. If its `load` attribute is None, the latest version will be loaded. If its `save` attribute is None, save version will be autogenerated.
credentials:`Dict[str, Any]`	Credentials required to get access to the underlying filesystem. E.g. for `GCSFileSystem` it should look like `{"token": None}`.
fs_args:`Dict[str, Any]`	Extra arguments to pass into underlying filesystem class constructor (e.g. `{"project": "my-project"}` for `GCSFileSystem`), as well as to pass to the filesystem's `open` method through nested keys `open_args_load` and `open_args_save`. Here you can find all available arguments for `open`: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open All defaults are preserved, except `mode`, which is set to `r` when loading and to `w` when saving.
Raises
`DataSetError`	Will be raised if at least less than one appropriate read or write methods are identified.