lakehouse.etlloader

class lakehouse.etlloader.ETLLoader(spark: SparkSession, **options: Dict[str, Any])

Bases: Interface

A generic class integrating the loading of data.

Use the function load to set the loader configs. Use _load() to execute the loading process.

Overwrite functions as required:

custom_load(self, table: str) -> DataFrame: Function to customize the way or the source data is loaded. required, if load(mode=”custom”) else ignored. custom_filter(self, sdf: DataFrame, table: str) -> DataFrame: Function to filter the loaded dataframe and making use of predicate pushdown. required, if load(filter=”custom”) else ignored.

spark

Spark Session as provided to process the data

Type:

SparkSession

\*\*options

Kwargs, Any options provided into the class

Type:

Dict[str, Any]

catalog

Name of the created catalog recognized by spark e.g. from Hive Metastore or Unity Catalogue

Type:

str

source_schema

Name of the source_schema

Type:

str

target_schema

Name of the target_schema

Type:

str

__init__(spark: SparkSession, **options: Dict[str, Any]) None

Initializes the Loader class with user-provided options.

Parameters:
  • spark (SparkSession) – existing Spark Session

  • **options (Dict[str, Any]) – Kwargs, Any options provided into the class

Kwargs options:

catalog (str): Name of the created catalog recognized by spark e.g. from Hive Metastore or Unity Catalogue, required source_schema (str): Name of the source_schema, required target_schema (str): Name of the target_schema, required

custom_filter(sdf: DataFrame, table: str) DataFrame

Abstract function which can be overwritten to filter the loaded dataframe and making use of predicate pushdown.

Filter rows and columns not needed here before applying any other transformations in transform()

Parameters:
  • sdf (DataFrame) – DataFrame

  • table (str) – name of the table

Returns:

filtered DataFrame

custom_load(table: str) DataFrame

Abstract function to be overwritten to load data based on custom implemenatation and return a DataFrame

Parameters:

table (str) – name of the table

Returns:

loaded data as DataFrame

load(**options)

Function to set the loader configs.

Parameters:

**options (Dict[str, Any]) – Kwargs, Any load options provided

Kwargs options:

mode (str): default or custom, default: default, Mode of loading loads either the table from source_schema.table as default or as defined in the custom_load function. In Bronze always a custom_load function is needed meaning the default is custom filter (str): all or custom, default: all, Allows applying directly filters on the loaded data for predicate pushdown using “custom”, otherwise “all” data is loaded. In Bronze always all data is loaded based on the custom_load function source_tbl (str): Name of the source table, default: None, If provided the source table is used instead of the provided target table to load data

Returns:

self