kedro.extras.datasets.pandas.ExcelDataSet

class documentation

class ExcelDataSet(AbstractVersionedDataSet[Union[pd.DataFrame, Dict[str, pd.DataFrame]], Union[pd.DataFrame, Dict[str, pd.DataFrame]]]): (source)

View In Hierarchy

ExcelDataSet loads/saves data from/to a Excel file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to handle the Excel file.

Example usage for the YAML API:

rockets:
  type: pandas.ExcelDataSet
  filepath: gcs://your_bucket/rockets.xlsx
  fs_args:
    project: my-project
  credentials: my_gcp_credentials
  save_args:
    sheet_name: Sheet1
  load_args:
    sheet_name: Sheet1

shuttles:
  type: pandas.ExcelDataSet
  filepath: data/01_raw/shuttles.xlsx

Example usage for the Python API:

>>> from kedro.extras.datasets.pandas import ExcelDataSet
>>> import pandas as pd
>>>
>>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
>>>                      'col3': [5, 6]})
>>>
>>> data_set = ExcelDataSet(filepath="test.xlsx")
>>> data_set.save(data)
>>> reloaded = data_set.load()
>>> assert data.equals(reloaded)

To save a multi-sheet Excel file, no special save_args are required. Instead, return a dictionary of Dict[str, pd.DataFrame] where the string keys are your sheet names.

Example usage for the YAML API for a multi-sheet Excel file:

trains:
  type: pandas.ExcelDataSet
  filepath: data/02_intermediate/company/trains.xlsx
  load_args:
    sheet_name: [Sheet1, Sheet2, Sheet3]

Example usage for the Python API for a multi-sheet Excel file:

>>> from kedro.extras.datasets.pandas import ExcelDataSet
>>> import pandas as pd
>>>
>>> dataframe = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
>>>                      'col3': [5, 6]})
>>> another_dataframe = pd.DataFrame({"x": [10, 20], "y": ["hello", "world"]})
>>> multiframe = {"Sheet1": dataframe, "Sheet2": another_dataframe}
>>> data_set = ExcelDataSet(filepath="test.xlsx", load_args = {"sheet_name": None})
>>> data_set.save(multiframe)
>>> reloaded = data_set.load()
>>> assert multiframe["Sheet1"].equals(reloaded["Sheet1"])
>>> assert multiframe["Sheet2"].equals(reloaded["Sheet2"])

Method	`__init__`	Creates a new instance of `ExcelDataSet` pointing to a concrete Excel file on a specific filesystem.
Constant	`DEFAULT_LOAD_ARGS`	Undocumented
Constant	`DEFAULT_SAVE_ARGS`	Undocumented
Method	`_describe`	Undocumented
Method	`_exists`	Undocumented
Method	`_invalidate_cache`	Invalidate underlying filesystem caches.
Method	`_load`	Undocumented
Method	`_release`	Undocumented
Method	`_save`	Undocumented
Instance Variable	`_fs`	Undocumented
Instance Variable	`_load_args`	Undocumented
Instance Variable	`_protocol`	Undocumented
Instance Variable	`_save_args`	Undocumented
Instance Variable	`_storage_options`	Undocumented
Instance Variable	`_writer_args`	Undocumented

Inherited from AbstractVersionedDataSet:

Method	`exists`	Checks whether a data set's output already exists by calling the provided _exists() method.
Method	`load`	Loads data by delegation to the provided load method.
Method	`resolve_load_version`	Compute the version the dataset should be loaded with.
Method	`resolve_save_version`	Compute the version the dataset should be saved with.
Method	`save`	Saves data by delegation to the provided save method.
Method	`_fetch_latest_load_version`	Undocumented
Method	`_fetch_latest_save_version`	Generate and cache the current save version
Method	`_get_load_path`	Undocumented
Method	`_get_save_path`	Undocumented
Method	`_get_versioned_path`	Undocumented
Instance Variable	`_exists_function`	Undocumented
Instance Variable	`_filepath`	Undocumented
Instance Variable	`_glob_function`	Undocumented
Instance Variable	`_version`	Undocumented
Instance Variable	`_version_cache`	Undocumented

Inherited from AbstractDataSet (via AbstractVersionedDataSet):

Class Method	`from_config`	Create a data set instance using the configuration provided.
Method	`__str__`	Undocumented
Method	`release`	Release any cached data.
Method	`_copy`	Undocumented
Property	`_logger`	Undocumented

def __init__(self, filepath: str, engine: str = 'openpyxl', load_args: Dict[str, Any] = None, save_args: Dict[str, Any] = None, version: Version = None, credentials: Dict[str, Any] = None, fs_args: Dict[str, Any] = None): (source) ¶

overrides kedro.io.AbstractVersionedDataSet.__init__

Creates a new instance of ExcelDataSet pointing to a concrete Excel file on a specific filesystem.

Parameters
filepath:`str`	Filepath in POSIX format to a Excel file prefixed with a protocol like `s3://`. If prefix is not provided, `file` protocol (local filesystem) will be used. The prefix should be any protocol supported by `fsspec`. Note: `http(s)` doesn't support versioning.
engine:`str`	The engine used to write to Excel files. The default engine is 'openpyxl'.
load_args:`Dict[str, Any]`	Pandas options for loading Excel files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html All defaults are preserved, but "engine", which is set to "openpyxl". Supports multi-sheet Excel files (include `sheet_name = None` in `load_args`).
save_args:`Dict[str, Any]`	Pandas options for saving Excel files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html All defaults are preserved, but "index", which is set to False. If you would like to specify options for the `ExcelWriter`, you can include them under the "writer" key. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html
version:`Version`	If specified, should be an instance of `kedro.io.core.Version`. If its `load` attribute is None, the latest version will be loaded. If its `save` attribute is None, save version will be autogenerated.
credentials:`Dict[str, Any]`	Credentials required to get access to the underlying filesystem. E.g. for `GCSFileSystem` it should look like `{"token": None}`.
fs_args:`Dict[str, Any]`	Extra arguments to pass into underlying filesystem class constructor (e.g. `{"project": "my-project"}` for `GCSFileSystem`).
Raises
`DataSetError`	If versioning is enabled while in append mode.