Python Connector for the DataFrame Rules Engine

The Python Connector allows users to validate data quality of their PySpark DataFrames using Python.

validation_results = RuleSet(df)
  .add(myRules)
  .validate()

Currently, the Python Connector supports the following Rule types:

List of Values (Strings only)
Boolean Check
User-defined Functions (must evaluate to a Boolean)

Boolean Check

Validate that an column expression evaluates to True.

# Ensure that the temperature is a valid reading
valid_temp_rule = Rule("valid_temperature", F.col("temperature") > -100.0)

List of Values (LOVs)

Validate that a Column only contains values present in a List of Strings.

# Create a List of Strings (LOS)
building_sites = ["SiteA", "SiteB", "SiteC"]

# Build a Rule that validates that a column only contains values from LOS
building_name_rule = Rule("Building_LOV_Rule",
                          column=F.col("site_name"),
                          valid_strings=building_sites)

User-Defined Functions (UDFs)

UDFs are great when you need to add custom business logic for validating dataset quality. You can use User-defined Functions with the DataFrame Rules Engine that return a Boolean value.

# Create a UDF to validate date entry 
def valid_date_udf(ts_column):
    return ts_column.isNotNull() & F.year(ts_column).isNotNull() & F.month(ts_column).isNotNull()

# Create a Rule that uses the UDF to validate data
valid_date_rule = Rule("valid_date_reading", valid_date_udf(F.col("reading_date")))

Building the project

A Python .whl file can be generated by navigating to the /python directory and executing the following command :

$ python3 -m build

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Connector for the DataFrame Rules Engine

Boolean Check

List of Values (LOVs)

User-Defined Functions (UDFs)

Building the project

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Python Connector for the DataFrame Rules Engine

Boolean Check

List of Values (LOVs)

User-Defined Functions (UDFs)

Building the project