kedro.extras.datasets.spark.SparkHiveDataSet

class documentation

class SparkHiveDataSet(AbstractDataSet[DataFrame, DataFrame]): (source)

SparkHiveDataSet loads and saves Spark dataframes stored on Hive. This data set also handles some incompatible file types such as using partitioned parquet on hive which will not normally allow upserts to existing data without a complete replacement of the existing file/partition.

This DataSet has some key assumptions:

Schemas do not change during the pipeline run (defined PKs must be present for the duration of the pipeline)
Tables are not being externally modified during upserts. The upsert method is NOT ATOMIC

to external changes to the target table while executing. Upsert methodology works by leveraging Spark DataFrame execution plan checkpointing.

Example usage for the YAML API:

hive_dataset:
  type: spark.SparkHiveDataSet
  database: hive_database
  table: table_name
  write_mode: overwrite

Example usage for the Python API:

>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import (StructField, StringType,
>>>                                IntegerType, StructType)
>>>
>>> from kedro.extras.datasets.spark import SparkHiveDataSet
>>>
>>> schema = StructType([StructField("name", StringType(), True),
>>>                      StructField("age", IntegerType(), True)])
>>>
>>> data = [('Alex', 31), ('Bob', 12), ('Clarke', 65), ('Dave', 29)]
>>>
>>> spark_df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
>>>
>>> data_set = SparkHiveDataSet(database="test_database", table="test_table",
>>>                             write_mode="overwrite")
>>> data_set.save(spark_df)
>>> reloaded = data_set.load()
>>>
>>> reloaded.take(4)

Method	`__getstate__`	Undocumented
Method	`__init__`	Creates a new instance of `SparkHiveDataSet`.
Constant	`DEFAULT_SAVE_ARGS`	Undocumented
Static Method	`_get_spark`	This method should only be used to get an existing SparkSession with valid Hive configuration. Configuration for Hive is read from hive-site.xml on the classpath. It supports running both SQL and HiveQL commands...
Method	`_create_hive_table`	Undocumented
Method	`_describe`	Undocumented
Method	`_exists`	Undocumented
Method	`_load`	Undocumented
Method	`_save`	Undocumented
Method	`_upsert_save`	Undocumented
Method	`_validate_save`	Undocumented
Instance Variable	`_database`	Undocumented
Instance Variable	`_eager_checkpoint`	Undocumented
Instance Variable	`_format`	Undocumented
Instance Variable	`_full_table_address`	Undocumented
Instance Variable	`_save_args`	Undocumented
Instance Variable	`_table`	Undocumented
Instance Variable	`_table_pk`	Undocumented
Instance Variable	`_write_mode`	Undocumented

Inherited from AbstractDataSet:

Class Method	`from_config`	Create a data set instance using the configuration provided.
Method	`__str__`	Undocumented
Method	`exists`	Checks whether a data set's output already exists by calling the provided _exists() method.
Method	`load`	Loads data by delegation to the provided load method.
Method	`release`	Release any cached data.
Method	`save`	Saves data by delegation to the provided save method.
Method	`_copy`	Undocumented
Method	`_release`	Undocumented
Property	`_logger`	Undocumented

def __getstate__(self): (source) ¶

Undocumented

def __init__(self, database: str, table: str, write_mode: str = 'errorifexists', table_pk: List[str] = None, save_args: Dict[str, Any] = None): (source) ¶

Creates a new instance of SparkHiveDataSet.

Note

For users leveraging the upsert functionality, a checkpoint directory must be set, e.g. using spark.sparkContext.setCheckpointDir("/path/to/dir") or directly in the Spark conf folder.

Parameters
database:`str`	The name of the hive database.
table:`str`	The name of the table within the database.
write_mode:`str`	`insert`, `upsert` or `overwrite` are supported.
table_pk:`List[str]`	If performing an upsert, this identifies the primary key columns used to resolve preexisting data. Is required for `write_mode="upsert"`.
save_args:`Dict[str, Any]`	Optional mapping of any options, passed to the `DataFrameWriter.saveAsTable` as kwargs. Key example of this is `partitionBy` which allows data partitioning on a list of column names. Other `HiveOptions` can be found here: https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#specifying-storage-format-for-hive-tables
Raises
`DataSetError`	Invalid configuration supplied

DEFAULT_SAVE_ARGS: Dict[str, Any] = (source) ¶

Undocumented

Value

{}

@staticmethod
def _get_spark() -> SparkSession: (source) ¶

This method should only be used to get an existing SparkSession with valid Hive configuration. Configuration for Hive is read from hive-site.xml on the classpath. It supports running both SQL and HiveQL commands. Additionally, if users are leveraging the upsert functionality, then a checkpoint directory must be set, e.g. using spark.sparkContext.setCheckpointDir("/path/to/dir")

def _create_hive_table(self, data: DataFrame, mode: str = None): (source) ¶

Undocumented

def _describe(self) -> Dict[str, Any]: (source) ¶

overrides kedro.io.AbstractDataSet._describe

Undocumented

def _exists(self) -> bool: (source) ¶

overrides kedro.io.AbstractDataSet._exists

Undocumented

def _load(self) -> DataFrame: (source) ¶

overrides kedro.io.AbstractDataSet._load

Undocumented

def _save(self, data: DataFrame): (source) ¶

overrides kedro.io.AbstractDataSet._save

Undocumented

def _upsert_save(self, data: DataFrame): (source) ¶

Undocumented

def _validate_save(self, data: DataFrame): (source) ¶

Undocumented

_database = (source) ¶

Undocumented

_eager_checkpoint = (source) ¶

Undocumented

_format = (source) ¶

Undocumented

_full_table_address = (source) ¶

Undocumented

_save_args = (source) ¶

Undocumented

_table = (source) ¶

Undocumented

_table_pk = (source) ¶

Undocumented

_write_mode = (source) ¶

Undocumented