SQLQueryDataSet loads data from a provided SQL query. It uses pandas.DataFrame internally, so it supports all allowed pandas options on read_sql_query. Since Pandas uses SQLAlchemy behind the scenes, when instantiating SQLQueryDataSet one needs to pass a compatible connection string either in credentials (see the example code snippet below) or in load_args. Connection string formats supported by SQLAlchemy can be found here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls
It does not support save method so it is a read only data set. To save data to a SQL server use SQLTableDataSet.
Example usage for the YAML API:
shuttle_id_dataset: type: pandas.SQLQueryDataSet sql: "select shuttle, shuttle_id from spaceflights.shuttles;" credentials: db_credentials
Advanced example using the stream_results and chunksize options to reduce memory usage:
shuttle_id_dataset: type: pandas.SQLQueryDataSet sql: "select shuttle, shuttle_id from spaceflights.shuttles;" credentials: db_credentials execution_options: stream_results: true load_args: chunksize: 1000
Sample database credentials entry in credentials.yml:
db_credentials: con: postgresql://scott:tiger@localhost/test
Example usage for the Python API:
>>> from kedro.extras.datasets.pandas import SQLQueryDataSet >>> import pandas as pd >>> >>> data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], >>> "col3": [5, 6]}) >>> sql = "SELECT * FROM table_a" >>> credentials = { >>> "con": "postgresql://scott:tiger@localhost/test" >>> } >>> data_set = SQLQueryDataSet(sql=sql, >>> credentials=credentials) >>> >>> sql_data = data_set.load()
Class Method | create |
Given a connection string, create singleton connection to be used across all instances of SQLQueryDataSet that need to connect to the same source. |
Method | __init__ |
Creates a new SQLQueryDataSet. |
Class Variable | engines |
Undocumented |
Method | _describe |
Undocumented |
Method | _load |
Undocumented |
Method | _save |
Undocumented |
Instance Variable | _connection |
Undocumented |
Instance Variable | _execution |
Undocumented |
Instance Variable | _filepath |
Undocumented |
Instance Variable | _load |
Undocumented |
Inherited from AbstractDataSet
:
Class Method | from |
Create a data set instance using the configuration provided. |
Method | __str__ |
Undocumented |
Method | exists |
Checks whether a data set's output already exists by calling the provided _exists() method. |
Method | load |
Loads data by delegation to the provided load method. |
Method | release |
Release any cached data. |
Method | save |
Saves data by delegation to the provided save method. |
Method | _copy |
Undocumented |
Method | _exists |
Undocumented |
Method | _release |
Undocumented |
Property | _logger |
Undocumented |
Given a connection string, create singleton connection
to be used across all instances of SQLQueryDataSet
that
need to connect to the same source.
str
= None, credentials: Dict[ str, Any]
= None, load_args: Dict[ str, Any]
= None, fs_args: Dict[ str, Any]
= None, filepath: str
= None, execution_options: Optional[ Dict[ str, Any]]
= None):
(source)
¶
Creates a new SQLQueryDataSet.
Parameters | |
sql:str | The sql query statement. |
credentials:Dict[ | A dictionary with a SQLAlchemy connection string.
Users are supposed to provide the connection string 'con'
through credentials. It overwrites con parameter in
load_args and save_args in case it is provided. To find
all supported connection string formats, see here:
https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls |
loadDict[ | Provided to underlying pandas read_sql_query function along with the connection string. To find all supported arguments, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html To find all supported connection string formats, see here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls |
fsDict[ | Extra arguments to pass into underlying filesystem class constructor
(e.g. {"project": "my-project"} for GCSFileSystem), as well as
to pass to the filesystem's open method through nested keys
open_args_load and open_args_save .
Here you can find all available arguments for open :
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open
All defaults are preserved, except mode , which is set to r when loading. |
filepath:str | A path to a file with a sql query statement. |
executionOptional[ | A dictionary with non-SQL advanced options for the connection to be applied to the underlying engine. To find all supported execution options, see here: https://docs.sqlalchemy.org/en/12/core/connections.html#sqlalchemy.engine.Connection.execution_options Note that this is not a standard argument supported by pandas API, but could be useful for handling large datasets. |
Raises | |
DataSetError | When either sql or con parameters is empty. |