lakehouse.etlloader
- class lakehouse.etlloader.ETLLoader(spark: SparkSession, **options: Dict[str, Any])
Bases:
InterfaceA generic class integrating the loading of data.
Use the function load to set the loader configs. Use _load() to execute the loading process.
- Overwrite functions as required:
custom_load(self, table: str) -> DataFrame: Function to customize the way or the source data is loaded. required, if load(mode=”custom”) else ignored. custom_filter(self, sdf: DataFrame, table: str) -> DataFrame: Function to filter the loaded dataframe and making use of predicate pushdown. required, if load(filter=”custom”) else ignored.
- spark
Spark Session as provided to process the data
- Type:
SparkSession
- \*\*options
Kwargs, Any options provided into the class
- Type:
Dict[str, Any]
- catalog
Name of the created catalog recognized by spark e.g. from Hive Metastore or Unity Catalogue
- Type:
str
- source_schema
Name of the source_schema
- Type:
str
- target_schema
Name of the target_schema
- Type:
str
- __init__(spark: SparkSession, **options: Dict[str, Any]) None
Initializes the Loader class with user-provided options.
- Parameters:
spark (SparkSession) – existing Spark Session
**options (Dict[str, Any]) – Kwargs, Any options provided into the class
- Kwargs options:
catalog (str): Name of the created catalog recognized by spark e.g. from Hive Metastore or Unity Catalogue, required source_schema (str): Name of the source_schema, required target_schema (str): Name of the target_schema, required
- custom_filter(sdf: DataFrame, table: str) DataFrame
Abstract function which can be overwritten to filter the loaded dataframe and making use of predicate pushdown.
Filter rows and columns not needed here before applying any other transformations in transform()
- Parameters:
sdf (DataFrame) – DataFrame
table (str) – name of the table
- Returns:
filtered DataFrame
- custom_load(table: str) DataFrame
Abstract function to be overwritten to load data based on custom implemenatation and return a DataFrame
- Parameters:
table (str) – name of the table
- Returns:
loaded data as DataFrame
- load(**options)
Function to set the loader configs.
- Parameters:
**options (Dict[str, Any]) – Kwargs, Any load options provided
- Kwargs options:
mode (str): default or custom, default: default, Mode of loading loads either the table from source_schema.table as default or as defined in the custom_load function. In Bronze always a custom_load function is needed meaning the default is custom filter (str): all or custom, default: all, Allows applying directly filters on the loaded data for predicate pushdown using “custom”, otherwise “all” data is loaded. In Bronze always all data is loaded based on the custom_load function source_tbl (str): Name of the source table, default: None, If provided the source table is used instead of the provided target table to load data
- Returns:
self