class ExcelDataSet(AbstractVersionedDataSet[
ExcelDataSet loads/saves data from/to a Excel file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to handle the Excel file.
Example usage for the YAML API:
rockets: type: pandas.ExcelDataSet filepath: gcs://your_bucket/rockets.xlsx fs_args: project: my-project credentials: my_gcp_credentials save_args: sheet_name: Sheet1 load_args: sheet_name: Sheet1 shuttles: type: pandas.ExcelDataSet filepath: data/01_raw/shuttles.xlsx
Example usage for the Python API:
>>> from kedro.extras.datasets.pandas import ExcelDataSet >>> import pandas as pd >>> >>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], >>> 'col3': [5, 6]}) >>> >>> data_set = ExcelDataSet(filepath="test.xlsx") >>> data_set.save(data) >>> reloaded = data_set.load() >>> assert data.equals(reloaded)
To save a multi-sheet Excel file, no special save_args are required. Instead, return a dictionary of Dict[str, pd.DataFrame] where the string keys are your sheet names.
Example usage for the YAML API for a multi-sheet Excel file:
trains:
type: pandas.ExcelDataSet
filepath: data/02_intermediate/company/trains.xlsx
load_args:
sheet_name: [Sheet1, Sheet2, Sheet3]
Example usage for the Python API for a multi-sheet Excel file:
>>> from kedro.extras.datasets.pandas import ExcelDataSet >>> import pandas as pd >>> >>> dataframe = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], >>> 'col3': [5, 6]}) >>> another_dataframe = pd.DataFrame({"x": [10, 20], "y": ["hello", "world"]}) >>> multiframe = {"Sheet1": dataframe, "Sheet2": another_dataframe} >>> data_set = ExcelDataSet(filepath="test.xlsx", load_args = {"sheet_name": None}) >>> data_set.save(multiframe) >>> reloaded = data_set.load() >>> assert multiframe["Sheet1"].equals(reloaded["Sheet1"]) >>> assert multiframe["Sheet2"].equals(reloaded["Sheet2"])
Method | __init__ |
Creates a new instance of ExcelDataSet pointing to a concrete Excel file on a specific filesystem. |
Constant | DEFAULT |
Undocumented |
Constant | DEFAULT |
Undocumented |
Method | _describe |
Undocumented |
Method | _exists |
Undocumented |
Method | _invalidate |
Invalidate underlying filesystem caches. |
Method | _load |
Undocumented |
Method | _release |
Undocumented |
Method | _save |
Undocumented |
Instance Variable | _fs |
Undocumented |
Instance Variable | _load |
Undocumented |
Instance Variable | _protocol |
Undocumented |
Instance Variable | _save |
Undocumented |
Instance Variable | _storage |
Undocumented |
Instance Variable | _writer |
Undocumented |
Inherited from AbstractVersionedDataSet
:
Method | exists |
Checks whether a data set's output already exists by calling the provided _exists() method. |
Method | load |
Loads data by delegation to the provided load method. |
Method | resolve |
Compute the version the dataset should be loaded with. |
Method | resolve |
Compute the version the dataset should be saved with. |
Method | save |
Saves data by delegation to the provided save method. |
Method | _fetch |
Undocumented |
Method | _fetch |
Generate and cache the current save version |
Method | _get |
Undocumented |
Method | _get |
Undocumented |
Method | _get |
Undocumented |
Instance Variable | _exists |
Undocumented |
Instance Variable | _filepath |
Undocumented |
Instance Variable | _glob |
Undocumented |
Instance Variable | _version |
Undocumented |
Instance Variable | _version |
Undocumented |
Inherited from AbstractDataSet
(via AbstractVersionedDataSet
):
Class Method | from |
Create a data set instance using the configuration provided. |
Method | __str__ |
Undocumented |
Method | release |
Release any cached data. |
Method | _copy |
Undocumented |
Property | _logger |
Undocumented |
str
, engine: str
= 'openpyxl', load_args: Dict[ str, Any]
= None, save_args: Dict[ str, Any]
= None, version: Version
= None, credentials: Dict[ str, Any]
= None, fs_args: Dict[ str, Any]
= None):
(source)
¶
Creates a new instance of ExcelDataSet pointing to a concrete Excel file on a specific filesystem.
Parameters | |
filepath:str | Filepath in POSIX format to a Excel file prefixed with a protocol like
s3:// . If prefix is not provided, file protocol (local filesystem) will be used.
The prefix should be any protocol supported by fsspec.
Note: http(s) doesn't support versioning. |
engine:str | The engine used to write to Excel files. The default engine is 'openpyxl'. |
loadDict[ | Pandas options for loading Excel files.
Here you can find all available arguments:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
All defaults are preserved, but "engine", which is set to "openpyxl".
Supports multi-sheet Excel files (include sheet_name = None in load_args ). |
saveDict[ | Pandas options for saving Excel files.
Here you can find all available arguments:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html
All defaults are preserved, but "index", which is set to False.
If you would like to specify options for the ExcelWriter ,
you can include them under the "writer" key. Here you can
find all available arguments:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html |
version:Version | If specified, should be an instance of kedro.io.core.Version. If its load attribute is None, the latest version will be loaded. If its save attribute is None, save version will be autogenerated. |
credentials:Dict[ | Credentials required to get access to the underlying filesystem.
E.g. for GCSFileSystem it should look like {"token": None} . |
fsDict[ | Extra arguments to pass into underlying filesystem class constructor
(e.g. {"project": "my-project"} for GCSFileSystem). |
Raises | |
DataSetError | If versioning is enabled while in append mode. |