A lightweight, simple data quality testing tool for CSV files and databases.
DQ Tester allows you to define data quality checks in SQL and organize them into reusable test catalogs. Execute tests against CSV files or databases, and monitor results through an interactive Streamlit dashboard.
Key Features:
- 📝 Simple YAML configuration for tests and catalogs
- 🗄️ Support for CSV files and databases (via ODBC)
- 📊 Built-in Streamlit dashboard for monitoring test results
- 🔍 Customizable SQL-based data quality checks
- 💾 Test results stored in DuckDB for analysis
- Python 3.9+
- ODBC driver for your database (currently tested with PostgreSQL)
pip install dq_testerTo use DQ Tester with Claude:
- Upload
prompt_custom_instructions.txtas a project instruction file - Upload the following example files to your project:
examples/catalog.yamlexamples/connections.yamlexamples/db_test_plan.yamlexamples/file_test_plan.yaml
Claude will use these files to help you create and manage data quality tests.
DQ Tester can be used directly in Python for custom workflows and integrations:
import dq_tester
import sys
results = dq_tester.run_tests('examples/catalog.yaml', 'examples/file_test_plan.yaml')
failed = [r for r in results if r['status'] == 'FAIL']
if failed:
print(f"{len(failed)} tests failed")
sys.exit(1)
print("All tests passed")Create a connections.yaml file:
connections:
- name: sample_db
driver: "PostgreSQL"
server: "myserver"
database: "demo_source"
username: "user"
password: "password"
- name: results_db # Required: results stored here
driver: "DuckDB"
database: "./results/test_results.duckdb"Note: The results_db connection is required for storing test results.
Create a catalog.yaml file defining reusable data quality checks:
dq_checks:
- name: null_values
type: sql
sql: |
select count(1)
from {table_name}
where {column_name} is null
- name: duplicate_key
type: sql
sql: |
select count(1)
from (
select {key_cols}
from {table_name}
group by {key_cols}
having count(1) > 1
) t1
- name: invalid_email_duckdb
type: sql
sql: |
select count(1)
from {table_name}
WHERE NOT REGEXP_FULL_MATCH(
{column_name},
'^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{{2,}}$'
)Parameters in SQL: Use {parameter_name} in your SQL to define parameters. These must be provided in your test plan.
Quick Start with csv-to-yaml:
DQ Tester can generate the YAML structure for your CSV file automatically:
dq_tester -a csv-to-yaml --csv-path examples/datasets/customers.csvThis outputs the column definitions which you can copy into your test plan:
has_header: true
delimiter: ','
columns:
- name: Customer Id
type: VARCHAR
- name: First Name
type: VARCHAR
- name: Email
type: VARCHAR
# ... etcComplete Test Plan:
Create a test plan (e.g., file_test_plan.yaml) by adding the generated output plus source, file_name, and dq_tests:
source: csv
file_name: examples/datasets/customers.csv
has_header: true
delimiter: ','
columns:
- name: Customer Id
type: VARCHAR
- name: First Name
type: VARCHAR
- name: Email
type: VARCHAR
dq_tests:
- dq_test_name: null_values
column_name: '"Customer Id"'
threshold:
type: count
operator: "=="
value: 0
- dq_test_name: invalid_email_duckdb
column_name: '"Email"'
threshold:
type: count
operator: "=="
value: 0
- dq_test_name: total_records
threshold:
type: count
operator: ">"
value: 1000CSV Requirements:
file_name: Path to the CSV filehas_header: Whether the CSV has a header rowdelimiter: Column delimiter (usually,)columns: Column definitions (types are optional)
Create a test plan (e.g., db_test_plan.yaml):
source: database
connection_name: sample_db
table_name: sales.salesorderdetail
dq_tests:
- dq_test_name: null_values
column_name: salesorderid
threshold:
type: count
operator: "=="
value: 0
- dq_test_name: duplicate_key
key_cols: salesorderdetailid
threshold:
type: count
operator: "=="
value: 0Database Requirements:
connection_name: Name fromconnections.yamltable_name: Fully qualified table name (e.g.,schema.table)
Execute a test plan against your data:
dq_tester -a run \
-c examples/catalog.yaml \
-t examples/file_test_plan.yaml \
--connections-path connections.yamlRequired Options:
-a runor--action run: Action to execute-cor--catalog-path: Path to catalog YAML file-tor--test-plan-path: Path to test plan YAML file
Optional:
--connections-path: Path to connections YAML file (defaults to searching standard locations)
Validate your catalog and test plan files without running tests:
dq_tester -a validate \
-c examples/catalog.yaml \
-t examples/file_test_plan.yamlRequired Options:
-a validateor--action validate: Action to execute-cor--catalog-path: Path to catalog YAML file-tor--test-plan-path: Path to test plan YAML file
This checks for:
- Valid YAML syntax
- Required fields present
- Proper test configuration
- Parameter matching between catalog and test plan
Automatically generate the YAML structure for a CSV file:
dq_tester -a csv-to-yaml --csv-path examples/datasets/customers.csvRequired Options:
-a csv-to-yamlor--action csv-to-yaml: Action to execute--csv-path: Path to the CSV file to analyze
This analyzes your CSV file and outputs the column definitions with inferred data types. Copy this output into your test plan to get started quickly.
Output example:
has_header: true
delimiter: ','
columns:
- name: Index
type: BIGINT
- name: Customer Id
type: VARCHAR
- name: Email
type: VARCHAR
# ... additional columnsStart the interactive Streamlit dashboard:
# Use default connections.yaml locations
dq_dashboard
# Specify connections file
dq_dashboard connections.yaml
# Use custom port
dq_dashboard --port 8080 connections.yamlOptions:
connections_file(positional, optional): Path to connections YAML file--port: Port number for the dashboard (default: 8501)
Default connections.yaml locations:
If no connections file is specified, the dashboard searches in order:
./connections.yaml(current directory)~/.dq_tester/connections.yaml(user home directory)/etc/dq_tester/connections.yaml(system-wide)
# Show help
dq_tester -h
# Available actions
dq_tester -a {validate,run,csv-to-yaml}
# Full options
dq_tester [-h]
[-c CATALOG_PATH]
[-t TEST_PLAN_PATH]
[--csv-path CSV_PATH]
[-a {validate,run,csv-to-yaml}]
[--connections-path CONNECTIONS_PATH]invalid_records- Count of records that fail to parsetotal_records- Total number of recordsexpected_delimiter- Validates the delimiter usedvalid_header- Validates header structure
total_records- Total number of records in the table
Note: Thresholds for built-in checks can be configured in your test plan.
Tests use thresholds to determine PASS/FAIL status:
count: Compare absolute count valuespct: Compare percentage values (0-100)
==: Equal to!=: Not equal to<: Less than<=: Less than or equal to>: Greater than>=: Greater than or equal to
# Count-based threshold
threshold:
type: count
operator: "=="
value: 0
# Percentage-based threshold
threshold:
type: pct
operator: "<="
value: 5 # 5% or lessEach test produces one of three statuses:
- PASS: The test result meets the threshold criteria (threshold comparison returns TRUE)
- FAIL: The test result does not meet the threshold criteria
- ERROR: The test execution failed (SQL error, connection issue, etc.)
Results are stored in the results_db DuckDB database and can be viewed in the dashboard.
my-dq-project/
├── connections.yaml
├── catalog.yaml
├── test_plans/
│ ├── customers_test_plan.yaml
│ └── orders_test_plan.yaml
└── results/
└── test_results.duckdb
dq_checks:
- name: check_name # Unique name for the check
type: sql # Currently only 'sql' is supported
sql: | # SQL query template
select count(1)
from {table_name}
where {parameter_name} condition# For CSV
source: csv | database
file_name: path/to/file.csv # CSV only
has_header: true | false # CSV only
delimiter: ',' # CSV only
columns: # CSV only
- name: column_name
type: data_type # Optional
# For Database
connection_name: name # Database only
table_name: schema.table # Database only
dq_tests:
- dq_test_name: check_name # References catalog
parameter_name: value # Match parameters in SQL
threshold:
type: count | pct
operator: "==|!=|<|<=|>|>="
value: numberconnections:
- name: connection_name
driver: "PostgreSQL" # ODBC driver name
server: "hostname"
database: "database_name"
username: "user"
password: "password"
- name: results_db # Required for storing results
driver: "DuckDB"
database: "./path/to/results.duckdb"Linux:
sudo apt-get install unixodbc unixodbc-dev odbc-postgresqlmacOS:
brew install unixodbc psqlodbcWindows: Download and install the PostgreSQL ODBC driver from the official website.
DQ Tester should work with any database that has an ODBC driver. Install the appropriate ODBC driver for your database and configure it in connections.yaml.
The Streamlit dashboard provides:
- 📊 Key Metrics: Total tests, pass rate, recent failures, columns tested
- 🎯 Status Distribution: Visual breakdown of PASS/FAIL/ERROR
- 📈 Trends Over Time: Historical test results
- 🔗 Connection Health: Test results by connection
- 📋 Object Analysis: Test results by tested object
- 🏷️ Column-Level Health: Identify problematic columns
- 🔍 Cascading Filters: Filter by connection → object → column
- 📥 CSV Export: Download filtered results
Complete examples are available in the examples/ directory:
catalog.yaml- Sample DQ check definitionsfile_test_plan.yaml- CSV testing exampledb_test_plan.yaml- Database testing exampleconnections.yaml- Connection configuration template
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or feature requests, please contact: chad@kodda.io