The Python Connector allows users to validate data quality of their PySpark DataFrames using Python.
validation_results = RuleSet(df)
.add(myRules)
.validate()Currently, the Python Connector supports the following Rule types:
- List of Values (Strings only)
- Boolean Check
- User-defined Functions (must evaluate to a Boolean)
Validate that an column expression evaluates to True.
# Ensure that the temperature is a valid reading
valid_temp_rule = Rule("valid_temperature", F.col("temperature") > -100.0)Validate that a Column only contains values present in a List of Strings.
# Create a List of Strings (LOS)
building_sites = ["SiteA", "SiteB", "SiteC"]
# Build a Rule that validates that a column only contains values from LOS
building_name_rule = Rule("Building_LOV_Rule",
column=F.col("site_name"),
valid_strings=building_sites)UDFs are great when you need to add custom business logic for validating dataset quality. You can use User-defined Functions with the DataFrame Rules Engine that return a Boolean value.
# Create a UDF to validate date entry
def valid_date_udf(ts_column):
return ts_column.isNotNull() & F.year(ts_column).isNotNull() & F.month(ts_column).isNotNull()
# Create a Rule that uses the UDF to validate data
valid_date_rule = Rule("valid_date_reading", valid_date_udf(F.col("reading_date")))A Python .whl file can be generated by navigating to the /python directory and executing the following command :
$ python3 -m build