lakehouse.etl

class lakehouse.etl.ETL(spark: SparkSession, **options: Dict[str, Any])

Bases: ETLLoader, ETLTransformer, ETLWriter, Interface

A generic class building a framework how to process data from one layer to the other in a Medallion architecture.

Use the functions load(), transform() and write() to specify configs. Use execute() to execute the defined steps.

Overwrite functions as required:

custom_load(self, table: str) -> DataFrame: Function to customize the way or the source data is loaded. required, if load(mode=”custom”) else ignored. custom_filter(self, sdf: DataFrame, table: str) -> DataFrame: Function to filter the loaded dataframe and making use of predicate pushdown. required, if load(filter=”custom”) else ignored. custom_transform(self, sdf: DataFrame, table: str) -> DataFrame: Function to be optionally overwritten to add custom transformations, only executed if transform() is defined custom_filter(self, sdf: DataFrame, table: str) -> DataFrame: Can be overwritten to add default transformations executed after the the custom transformations. Defaults create a timestamp column with the current timestamp of transformations. Only executed if transform(ignore_defaults=False) get_replace_condition(self, sdf: DataFrame, table: str) -> str: Allows you to define the filter used for the replace where overwrite operation. required if write(mode=”replace”). get_delta_merge_builder(self, sdf: DataFrame, delta_table: DeltaTable) -> DeltaMergeBuilder: Allows you to define the merge builder for the merge write into delta. required if write(mode=”merge”) custom_write(self, sdf: DataFrame, table: str) -> None: Allows to define a custom write operation. required if write(mode=”custom”)

spark

Spark Session as provided to process the data

Type:

SparkSession

\*\*options

Kwargs, Any options provided into the class

Type:

Dict[str, Any]

catalog

Name of the created catalog recognized by spark e.g. from Hive Metastore or Unity Catalogue

Type:

str

source_schema

Name of the source_schema

Type:

str

target_schema

Name of the target_schema

Type:

str

data

Intermediate DataFrame per table based on the specified options before execute()

Type:

Dict[str, DataFrame]

__init__(spark: SparkSession, **options: Dict[str, Any]) None

Initializes the Loader class with user-provided options.

Parameters:
  • spark (SparkSession) – existing Spark Session

  • **options (Dict[str, Any]) – Kwargs, Any options provided into the class

Kwargs options:

catalog (str): Name of the created catalog recognized by spark e.g. from Hive Metastore or Unity Catalogue, required source_schema (str): Name of the source_schema, required target_schema (str): Name of the target_schema, required

execute(*tables) None

Executes the elt process via _execute_one() for one or multiple tables as specified.

Parameters:

*tables – List of tables as args