datadiff/README.Rmd at main · ThinkR-open/datadiff · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# datadiff

Data Validation with YAML Rules

## Overview

`datadiff` is a comprehensive R package for data validation that lets you compare datasets using configurable validation rules defined in YAML files.
It supports exact matching, tolerance-based numeric comparisons, text normalization, and row count validation.

## Installation

```r
remotes::install_github("thinkr-open/datadiff")
```

## Quick Start

```r
library(datadiff)

# Create reference and candidate datasets
reference <- data.frame(
  id = 1:3,
  amount = c(100.00, 200.00, 300.00),
  category = c("A", "B", "C")
)

candidate <- data.frame(
  id = 1:3,
  amount = c(100.01, 200.01, 300.001),  # Small differences within tolerance
  category = c("a", "b", "c")  # Different case
)

# Generate a rules template
write_rules_template(reference, key = "id",
                     path = "validation_rules.yaml",
                     numeric_abs = 0.01,
                     character_case_insensitive = TRUE)

# Edit the rules file to configure validation (e.g., set tolerances, case sensitivity)

# Compare datasets
result <- compare_datasets_from_yaml(reference, candidate, path = "validation_rules.yaml")
print(result$reponse)
```

## Key Features

- **YAML Configuration**: Define validation rules in human-readable YAML files
- **Flexible Comparisons**: Support for exact matching and tolerance-based comparisons
- **Text Normalization**: Case-insensitive comparison and whitespace trimming
- **Type-Aware Validation**: Different rules for different data types
- **Comprehensive Reporting**: Detailed validation reports powered by pointblank
- **Row Count Validation**: Ensure datasets have the expected number of rows

## Dependencies

This package is built on top of the excellent [pointblank](https://github.com/rstudio/pointblank) package, which provides the robust data validation engine.
Pointblank handles all the heavy lifting for validation logic, error reporting, and test execution.

## Configuration

`write_rules_template()` generates a starting YAML. You then edit `by_name` to override any rule for a specific column.

### Full example with column-level overrides (`by_name`)

```yaml
version: 1
defaults:
  na_equal: yes
  ignore_columns:
    documentation
  keys: id
  label: reference vs candidate
row_validation:
  check_count: yes
  expected_count: ~   # null = use reference row count
  tolerance: 0
by_type:
  numeric:
    abs: 1.0e-09      # near-exact by default for all numeric columns
    rel: 0
  integer:
    abs: 0
  character:
    equal_mode: exact
    case_insensitive: no
    trim: no
  date:
    equal_mode: exact
  datetime:
    equal_mode: exact
  logical:
    equal_mode: exact
by_name:
  id: []              # no override — inherits integer rule (abs: 0)
  amount:
    abs: 0.01         # override: accept differences up to 0.01 for this column
    rel: 0
  unit_price:
    abs: 0.001
    rel: 0.0001       # mixed mode: 0.001 + 0.01% of reference value
  category:
    case_insensitive: yes   # override: ignore case for this column only
    trim: yes
  created_at:
    equal_mode: exact       # redundant here, but explicit for documentation
```

**How `by_name` interacts with `by_type`**: rules are merged, with `by_name` taking precedence. A column not listed in `by_name` uses its `by_type` defaults unchanged.

| Column | Effective rule | Source |
|---|---|---|
| `id` | `abs: 0` | `by_type.integer` |
| `amount` | `abs: 0.01, rel: 0` | `by_name.amount` overrides `by_type.numeric` |
| `unit_price` | `abs: 0.001, rel: 0.0001` | `by_name.unit_price` overrides `by_type.numeric` |
| `category` | `case_insensitive: yes, trim: yes` | `by_name.category` overrides `by_type.character` |
| `created_at` | `equal_mode: exact` | `by_type.date` (by_name is redundant here) |

### Numeric tolerance: formula, edge effects, and best practices

#### Threshold formula

For numeric columns, validation relies on a single **combined threshold**:

```
threshold  = abs + rel × |reference_value|
result = OK  if  |candidate - reference| ≤ threshold
```

The two parameters **add up**—they are not two independent guards.

| Parameter | Role | Default value |
|-----------|------|---------------|
| `abs` | Absolute tolerance: fixed floor, independent of the magnitude of values | `1e-9` |
| `rel` | Relative tolerance: fraction of the reference value added to the threshold | `0` |

#### Pure absolute mode (recommended by default)

With `rel: 0`, the threshold is **constant** regardless of the magnitude of the values being compared:

```yaml
by_type:
  numeric:
    abs: 0.01   # fixed threshold: any difference > 0.01 is detected
    rel: 0
```

| Reference | Candidate | Difference | Threshold | Result |
|-----------|----------:|-----------:|----------:|--------|
| 1.00 | 1.005 | 0.005 | 0.01 | OK |
| 1 000 000.00 | 1 000 000.005 | 0.005 | 0.01 | OK |
| 0.000001 | 0.000001 + 0.005 | 0.005 | 0.01 | OK |
| 1.00 | 1.02 | 0.02 | 0.01 | **ERROR** |
| 1 000 000.00 | 1 000 000.02 | 0.02 | 0.01 | **ERROR** |

The same difference of `0.02` is detected whether you compare thousandths or millions.

#### Pure relative mode

With `abs: 0`, the threshold is **proportional** to the reference value:

```yaml
by_type:
  numeric:
    abs: 0      # no fixed floor
    rel: 0.01   # tolerance of 1% of the reference value
```

| Reference | Candidate | Difference | Threshold (1%) | Result |
|-----------|----------:|-----------:|---------------:|--------|
| 100.00 | 100.50 | 0.50 | 1.00 | OK |
| 1 000 000.00 | 1 005 000.00 | 5 000 | 10 000 | OK |
| 0.00 | 0.001 | 0.001 | **0** | **ERROR** (implicit division by zero) |

> **Warning**: if the reference value is `0`, the relative threshold is `0`—any difference, even tiny, will be detected as an error.
> This is why `abs` acts as a safety floor.

#### Mixed mode: when and how to use it

Combining both parameters is useful when values can be close to zero **and** very large, and you want a tolerance that adapts in both cases:

```yaml
by_type:
  numeric:
    abs: 0.001   # floor: protects the ref ≈ 0 case
    rel: 0.01    # +1% for large values
```

For `ref = 1 000 000`: `threshold = 0.001 + 0.01 × 1 000 000 = 10 000.001`

> **Pitfall**: with `rel > 0` and large reference values, the threshold can become much wider than you intuitively expect.
> For example, `rel: 1e-9` on a value of `12 000 000` yields a threshold of `≈ 0.012`, so a difference of `0.000565` would pass undetected, even with `abs: 1e-9`.
>
> **Rule of thumb**: use `rel: 0` (the default) unless you explicitly need a tolerance proportional to the magnitude of the data.

### Character comparison options

For character columns, you can configure text normalization and comparison behavior:

- **equal_mode**: Comparison mode (`"exact"` or `"normalized"`)
- **case_insensitive**: Whether to ignore case differences (`true`/`false`)
- **trim**: Whether to trim whitespace before comparison (`true`/`false`)

**Example:**
```yaml
by_type:
  character:
    equal_mode: normalized      # Apply normalization before comparison
    case_insensitive: true      # Ignore case differences
    trim: true                  # Remove leading/trailing whitespace
```

For reference value `"Hello World"`:

- Candidate `"hello world"` **passes** with `case_insensitive: true`
- Candidate `"  Hello World  "` **passes** with `trim: true`
- Candidate `"HELLO WORLD"` **passes** with both options enabled
- Candidate `"Hello Universe"` **fails** regardless of normalization

### Row count validation

You can validate that datasets have an expected number of rows using the `row_validation` section:

- **check_count**: Whether to enable row count validation (`true`/`false`)
- **expected_count**: Expected number of rows (if `null`, uses the reference dataset row count)
- **tolerance**: Allowed deviation from the expected count

**Example:**
```yaml
row_validation:
  check_count: true
  expected_count: 1000
  tolerance: 50
```

This validates that the candidate dataset has between 950 and 1050 rows (1000 ± 50).

If `expected_count` is not specified, the reference dataset's row count is used as the expected value.

## Comparing Parquet files (large datasets)

`datadiff` can compare Parquet files that are too large to fit in RAM.
The recommended approach uses `arrow::open_dataset()` — **do not call
`arrow::to_duckdb()` yourself** before passing to `datadiff`; the package
handles the Arrow → DuckDB conversion internally with a single connection.

### Recommended strategy: Arrow Dataset (lazy, out-of-core)

```r
library(datadiff)
library(arrow)

ds_ref  <- arrow::open_dataset("path/to/reference/")
ds_cand <- arrow::open_dataset("path/to/candidate/")

# Generate a rules template from the schema (no data loaded)
write_rules_template(ds_ref, key = "ID", path = "rules.yaml")

# Compare — stays lazy until the final boolean slim table
result <- compare_datasets_from_yaml(
  data_reference = ds_ref,
  data_candidate = ds_cand,
  key = "ID",
  path = "rules.yaml"
)
```

Internally, `compare_datasets_from_yaml()`:

1. Opens a private DuckDB connection (`fresh_con`).
2. Materialises each Parquet dataset as a DuckDB physical temp table via
   `read_parquet()` — all memory is managed by DuckDB's buffer pool, so disk
   spilling works correctly.
3. Runs the full join + 125 boolean expressions as a single lazy SQL query.
4. `dplyr::compute()` materialises only the slim boolean result table
   (~125 logical columns × N rows) — the wide source data is never loaded
   into R.
5. `dplyr::collect()` brings the slim boolean table (~few GB) into R and
   passes a plain `data.frame` to pointblank — no live DuckDB connection
   needed during interrogation.
6. Disconnects and destroys `fresh_con` on exit.

### Memory tuning: `duckdb_memory_limit`

DuckDB's default memory cap (80 % of total RAM) can leave insufficient
headroom when R, Arrow, and the OS are already using significant memory.
The `duckdb_memory_limit` parameter controls how much RAM DuckDB may use
before spilling intermediate results to `tempdir()`:

```r
result <- compare_datasets_from_yaml(
  data_reference    = ds_ref,
  data_candidate    = ds_cand,
  key               = "ID",
  path              = "rules.yaml",
  duckdb_memory_limit = "8GB"   # default — safe for machines with >= 16 GB
)
```

| Machine RAM | Recommended `duckdb_memory_limit` | Notes |
|-------------|-----------------------------------|-------|
| 8 GB | `"3GB"` | Leaves room for R + OS |
| 16 GB | `"6GB"` | Balanced |
| 32 GB | `"8GB"` (default) | Spills when needed |
| 64 GB+ | `"20GB"` | Reduces spilling, faster |

Increasing the limit reduces disk I/O and speeds up the comparison; decreasing
it protects against OOM on memory-constrained machines. The limit only applies
when Arrow datasets are used — it has no effect for `data.frame` or `tbl_lazy`
inputs.

### Strategy comparison

| Strategy | Input type | RAM usage | Requires |
|----------|-----------|-----------|---------|
| **Arrow Dataset** ✅ recommended | `arrow::open_dataset()` | Slim boolean table only (~few GB) | arrow, duckdb |
| Arrow Table | `arrow::read_parquet(as_data_frame=FALSE)` | Same as Arrow Dataset | arrow, duckdb |
| Lazy table (dbplyr) | `tbl(con, "table_name")` | Slim boolean table only | DBI, dbplyr |
| `data.frame` | `read.csv()`, `readr::read_csv()`, etc. | Full data in RAM | — |

### What NOT to do

```r
# ❌ DO NOT convert to DuckDB yourself before passing to datadiff
ds_ref_duckdb  <- arrow::to_duckdb(ds_ref)   # creates Arrow's internal connection
ds_cand_duckdb <- arrow::to_duckdb(ds_cand)  # different connection!
result <- compare_datasets_from_yaml(ds_ref_duckdb, ds_cand_duckdb, ...)
# → cross-connection join fails in pointblank

# ❌ DO NOT collect before passing
ref_df <- dplyr::collect(ds_ref)   # loads full 4 GB into R RAM
```

### Disk spilling location

DuckDB spills to `tempdir()` when the memory limit is reached. On Windows
this is typically `C:\Users\<user>\AppData\Local\Temp`. Ensure that directory
has sufficient free disk space (up to ~2–3× the size of your Parquet files
in the worst case).

## Main Functions

- `write_rules_template()`: Generate a YAML rules template
- `read_rules()`: Load and validate rules from YAML
- `compare_datasets_from_yaml()`: Main validation function
- `detect_column_types()`: Infer column data types
- `preprocess_dataframe()`: Apply text normalization
- `analyze_columns()`: Compare column structures

## Testing

Run the test suite:

```r
devtools::test()
```


## Sponsor

The development of this package has been sponsored by:

<a href = "https://www.atih.sante.fr/"><img src = "man/figures/atih.png"></img></a>


## License

MIT License

## Dev part

This `README` has been compiled on the

```{r}
Sys.time()
```

Here are the test & coverage results:

```{r error = TRUE}
devtools::check(quiet = TRUE)
```

```{r echo = FALSE}
unloadNamespace("datadiff")
```

```{r error = TRUE}
Sys.setenv("NOT_CRAN" = TRUE)
covr::package_coverage()
```