lakehouse.etltransformer
- class lakehouse.etltransformer.ETLTransformer(spark: SparkSession, **options: Dict[str, Any])
Bases:
InterfaceA generic class integrating the transformation of data.
Use the function transform to set the transform configs. Use _transform() to execute the transform process.
- Overwrite functions as required:
custom_transform(self, sdf: DataFrame, table: str) -> DataFrame: Function to be optionally overwritten to add custom transformations, only executed if transform() is defined custom_filter(self, sdf: DataFrame, table: str) -> DataFrame: Can be overwritten to add default transformations executed after the the custom transformations. Defaults create a timestamp column with the current timestamp of transformations. Only executed if transform(ignore_defaults=False)
- spark
Spark Session as provided to process the data
- Type:
SparkSession
- \*\*options
Kwargs, Any options provided into the class
- Type:
Dict[str, Any]
- catalog
Name of the created catalog recognized by spark e.g. from Hive Metastore or Unity Catalogue
- Type:
str
- source_schema
Name of the source_schema
- Type:
str
- target_schema
Name of the target_schema
- Type:
str
- __init__(spark: SparkSession, **options: Dict[str, Any]) None
Initializes the Transformer class with user-provided options.
- Parameters:
spark (SparkSession) – existing Spark Session
**options (Dict[str, Any]) – Kwargs, Any options provided into the class
- Kwargs options:
catalog (str): Name of the created catalog recognized by spark e.g. from Hive Metastore or Unity Catalogue, required source_schema (str): Name of the source_schema, required target_schema (str): Name of the target_schema, required
- custom_transform(sdf: DataFrame, table: str) DataFrame
Function to be optionally overwritten to add custom transformations, only executed if transform() is defined
- Parameters:
sdf (DataFrame) – DataFrame
table (str) – name of the table
- Returns:
transformed DataFrame
- default_transform(sdf: DataFrame, table: str) DataFrame
Can be overwritten to add default transformations executed after the the custom transformations. Defaults create a timestamp column with the current timestamp of transformations. Only executed if transform(ignore_defaults=False)
- Parameters:
sdf (DataFrame) – DataFrame
table (str) – name of the table
- Returns:
transformed DataFrame with internal transformations
- test1(sdf: DataFrame, table: str) DataFrame
- transform(**options)
Function to set the transformer configs.
- Parameters:
**options (Dict[str, Any]) – Kwargs, Any transform options provided
- Kwargs options:
ignore_defaults (bool): default: False, Ignores executing default transformations as defined in default_transform function if True. Usually used during debugging tbl_transformations (Dict[str, str]): default: {}, Allows to define custom transformations per table by specifying the table name as key and the function name as value. For the tables it is not defined custom_transform is used.
- Returns:
self