kedro.extras.datasets.pandas.ParquetDataSet

class documentation

class ParquetDataSet(AbstractVersionedDataSet[pd.DataFrame, pd.DataFrame]): (source)

ParquetDataSet loads/saves data from/to a Parquet file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to handle the Parquet file.

Example usage for the YAML API:

boats:
  type: pandas.ParquetDataSet
  filepath: data/01_raw/boats.parquet
  load_args:
    engine: pyarrow
    use_nullable_dtypes: True
  save_args:
    file_scheme: hive
    has_nulls: False
    engine: pyarrow

trucks:
  type: pandas.ParquetDataSet
  filepath: abfs://container/02_intermediate/trucks.parquet
  credentials: dev_abs
  load_args:
    columns: [name, gear, disp, wt]
    index: name
  save_args:
    compression: GZIP
    partition_on: [name]

Example usage for the Python API:

>>> from kedro.extras.datasets.pandas import ParquetDataSet
>>> import pandas as pd
>>>
>>> data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
>>>                      'col3': [5, 6]})
>>>
>>> data_set = ParquetDataSet(filepath="test.parquet")
>>> data_set.save(data)
>>> reloaded = data_set.load()
>>> assert data.equals(reloaded)

Method	`__init__`	Creates a new instance of `ParquetDataSet` pointing to a concrete Parquet file on a specific filesystem.
Constant	`DEFAULT_LOAD_ARGS`	Undocumented
Constant	`DEFAULT_SAVE_ARGS`	Undocumented
Method	`_describe`	Undocumented
Method	`_exists`	Undocumented
Method	`_invalidate_cache`	Invalidate underlying filesystem caches.
Method	`_load`	Undocumented
Method	`_load_from_pandas`	Undocumented
Method	`_release`	Undocumented
Method	`_save`	Undocumented
Instance Variable	`_fs`	Undocumented
Instance Variable	`_load_args`	Undocumented
Instance Variable	`_protocol`	Undocumented
Instance Variable	`_save_args`	Undocumented
Instance Variable	`_storage_options`	Undocumented

Inherited from AbstractVersionedDataSet:

Method	`exists`	Checks whether a data set's output already exists by calling the provided _exists() method.
Method	`load`	Loads data by delegation to the provided load method.
Method	`resolve_load_version`	Compute the version the dataset should be loaded with.
Method	`resolve_save_version`	Compute the version the dataset should be saved with.
Method	`save`	Saves data by delegation to the provided save method.
Method	`_fetch_latest_load_version`	Undocumented
Method	`_fetch_latest_save_version`	Generate and cache the current save version
Method	`_get_load_path`	Undocumented
Method	`_get_save_path`	Undocumented
Method	`_get_versioned_path`	Undocumented
Instance Variable	`_exists_function`	Undocumented
Instance Variable	`_filepath`	Undocumented
Instance Variable	`_glob_function`	Undocumented
Instance Variable	`_version`	Undocumented
Instance Variable	`_version_cache`	Undocumented

Inherited from AbstractDataSet (via AbstractVersionedDataSet):

Class Method	`from_config`	Create a data set instance using the configuration provided.
Method	`__str__`	Undocumented
Method	`release`	Release any cached data.
Method	`_copy`	Undocumented
Property	`_logger`	Undocumented

def __init__(self, filepath: str, load_args: Dict[str, Any] = None, save_args: Dict[str, Any] = None, version: Version = None, credentials: Dict[str, Any] = None, fs_args: Dict[str, Any] = None): (source) ¶

overrides kedro.io.AbstractVersionedDataSet.__init__

Creates a new instance of ParquetDataSet pointing to a concrete Parquet file on a specific filesystem.

Parameters
filepath:`str`	Filepath in POSIX format to a Parquet file prefixed with a protocol like `s3://`. If prefix is not provided, `file` protocol (local filesystem) will be used. The prefix should be any protocol supported by `fsspec`. It can also be a path to a directory. If the directory is provided then it can be used for reading partitioned parquet files. Note: `http(s)` doesn't support versioning.
load_args:`Dict[str, Any]`	Additional options for loading Parquet file(s). Here you can find all available arguments when reading single file: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html Here you can find all available arguments when reading partitioned datasets: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset.read All defaults are preserved.
save_args:`Dict[str, Any]`	Additional saving options for saving Parquet file(s). Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html All defaults are preserved. `partition_cols` is not supported.
version:`Version`	If specified, should be an instance of `kedro.io.core.Version`. If its `load` attribute is None, the latest version will be loaded. If its `save` attribute is None, save version will be autogenerated.
credentials:`Dict[str, Any]`	Credentials required to get access to the underlying filesystem. E.g. for `GCSFileSystem` it should look like `{"token": None}`.
fs_args:`Dict[str, Any]`	Extra arguments to pass into underlying filesystem class constructor (e.g. `{"project": "my-project"}` for `GCSFileSystem`).