class GenericDataSet(AbstractVersionedDataSet[
pandas.GenericDataSet
loads/saves data from/to a data file using an underlying
filesystem (e.g.: local, S3, GCS). It uses pandas to dynamically select the
appropriate type of read/write target on a best effort basis.
Example usage for the YAML API:
cars: type: pandas.GenericDataSet file_format: csv filepath: s3://data/01_raw/company/cars.csv load_args: sep: "," na_values: ["#NA", NA] save_args: index: False date_format: "%Y-%m-%d"
This second example is able to load a SAS7BDAT file via the pd.read_sas method. Trying to save this dataset will raise a DataSetError since pandas does not provide an equivalent pd.DataFrame.to_sas write method.
flights: type: pandas.GenericDataSet file_format: sas filepath: data/01_raw/airplanes.sas7bdat load_args: format: sas7bdat
Example usage for the Python API:
>>> from kedro.extras.datasets.pandas import GenericDataSet >>> import pandas as pd >>> >>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], >>> 'col3': [5, 6]}) >>> >>> data_set = GenericDataSet(filepath="test.csv", file_format='csv') >>> data_set.save(data) >>> reloaded = data_set.load() >>> assert data.equals(reloaded)
Method | __init__ |
Creates a new instance of GenericDataSet pointing to a concrete data file on a specific filesystem. The appropriate pandas load/save methods are dynamically identified by string matching on a best effort basis. |
Constant | DEFAULT |
Undocumented |
Constant | DEFAULT |
Undocumented |
Method | _describe |
Undocumented |
Method | _ensure |
Undocumented |
Method | _exists |
Undocumented |
Method | _invalidate |
Invalidate underlying filesystem caches. |
Method | _load |
Undocumented |
Method | _release |
Undocumented |
Method | _save |
Undocumented |
Instance Variable | _file |
Undocumented |
Instance Variable | _fs |
Undocumented |
Instance Variable | _fs |
Undocumented |
Instance Variable | _fs |
Undocumented |
Instance Variable | _load |
Undocumented |
Instance Variable | _protocol |
Undocumented |
Instance Variable | _save |
Undocumented |
Inherited from AbstractVersionedDataSet
:
Method | exists |
Checks whether a data set's output already exists by calling the provided _exists() method. |
Method | load |
Loads data by delegation to the provided load method. |
Method | resolve |
Compute the version the dataset should be loaded with. |
Method | resolve |
Compute the version the dataset should be saved with. |
Method | save |
Saves data by delegation to the provided save method. |
Method | _fetch |
Undocumented |
Method | _fetch |
Generate and cache the current save version |
Method | _get |
Undocumented |
Method | _get |
Undocumented |
Method | _get |
Undocumented |
Instance Variable | _exists |
Undocumented |
Instance Variable | _filepath |
Undocumented |
Instance Variable | _glob |
Undocumented |
Instance Variable | _version |
Undocumented |
Instance Variable | _version |
Undocumented |
Inherited from AbstractDataSet
(via AbstractVersionedDataSet
):
Class Method | from |
Create a data set instance using the configuration provided. |
Method | __str__ |
Undocumented |
Method | release |
Release any cached data. |
Method | _copy |
Undocumented |
Property | _logger |
Undocumented |
str
, file_format: str
, load_args: Dict[ str, Any]
= None, save_args: Dict[ str, Any]
= None, version: Version
= None, credentials: Dict[ str, Any]
= None, fs_args: Dict[ str, Any]
= None):
(source)
¶
Creates a new instance of GenericDataSet pointing to a concrete data file on a specific filesystem. The appropriate pandas load/save methods are dynamically identified by string matching on a best effort basis.
Parameters | |
filepath:str | Filepath in POSIX format to a file prefixed with a protocol like s3:// .
If prefix is not provided, file protocol (local filesystem) will be used.
The prefix should be any protocol supported by fsspec.
Key assumption: The first argument of either load/save method points to a
filepath/buffer/io type location. There are some read/write targets such
as 'clipboard' or 'records' that will fail since they do not take a
filepath like argument. |
filestr | String which is used to match the appropriate load/save method on a best
effort basis. For example if 'csv' is passed in the pandas.read_csv and
pandas.DataFrame.to_csv will be identified. An error will be raised unless
at least one matching read_{file_format} or to_{file_format} method is
identified. |
loadDict[ | Pandas options for loading files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/reference/io.html All defaults are preserved. |
saveDict[ | Pandas options for saving files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/reference/io.html All defaults are preserved, but "index", which is set to False. |
version:Version | If specified, should be an instance of kedro.io.core.Version. If its load attribute is None, the latest version will be loaded. If its save attribute is None, save version will be autogenerated. |
credentials:Dict[ | Credentials required to get access to the underlying filesystem.
E.g. for GCSFileSystem it should look like {"token": None} . |
fsDict[ | Extra arguments to pass into underlying filesystem class constructor
(e.g. {"project": "my-project"} for GCSFileSystem), as well as
to pass to the filesystem's open method through nested keys
open_args_load and open_args_save .
Here you can find all available arguments for open :
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open
All defaults are preserved, except mode , which is set to r when loading
and to w when saving. |
Raises | |
DataSetError | Will be raised if at least less than one appropriate read or write methods are identified. |