From a7c224eb5d79de345c81edf22cafa9f15c96459f Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Wed, 20 May 2026 09:10:04 -0400 Subject: [PATCH 1/8] Start v04 documentation sprint --- DESCRIPTION | 2 +- README.md | 50 ++++++- _pkgdown.yml | 1 + development/v04-documentation-sprint.md | 57 +++++++ vignettes/getting-started-v04.qmd | 189 ++++++++++++++++++++++++ 5 files changed, 292 insertions(+), 7 deletions(-) create mode 100644 development/v04-documentation-sprint.md create mode 100644 vignettes/getting-started-v04.qmd diff --git a/DESCRIPTION b/DESCRIPTION index dbb0f20..b10b0fc 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: MockData Title: Generate Mock Data from Metadata Specifications -Version: 0.3.0 +Version: 0.4.0.9000 Authors@R: c( person("Juan", "Li", role = "aut", email = "juli@ohri.ca"), person("Douglas", "Manuel", role = c("aut", "cre"), email = "dmanuel@ohri.ca"), diff --git a/README.md b/README.md index fef0d79..715f74a 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) -[![Version: 0.3.0](https://img.shields.io/badge/version-0.3.0-blue.svg)](https://github.com/Big-Life-Lab/MockData) +[![Version: 0.4.0-dev](https://img.shields.io/badge/version-0.4.0--dev-blue.svg)](https://github.com/Big-Life-Lab/MockData) [![pkgdown](https://github.com/Big-Life-Lab/MockData/actions/workflows/pkgdown.yaml/badge.svg)](https://github.com/Big-Life-Lab/MockData/actions/workflows/pkgdown.yaml) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) @@ -12,10 +12,12 @@ **Status: Experimental, pre-release software** MockData is a work-in-progress R package for generating mock testing data from -small metadata specifications. It is useful -today for development and documentation workflows, especially when paired -with recodeflow-style metadata (see below), but it should be treated as experimental -infrastructure rather than a stable released package. +small metadata specifications. The `dev` branch now contains the v0.4 +`mock_spec` architecture: direct specification helpers, a recodeflow metadata +adapter, native generation, optional `simstudy` generation, and post-processing +diagnostics. It is useful today for development and documentation workflows, +especially when paired with recodeflow-style metadata (see below), but it should +be treated as experimental infrastructure rather than a stable released package. People are using MockData and reporting that it is helpful. We take that as an encouraging signal, not as evidence that the package is mature. Please review @@ -33,10 +35,45 @@ the generated data before using it in any workflow that matters. **Current development limitations:** - APIs may change before a formal release -- Error handling is too permissive and can fail with warnings instead of stopping +- Some legacy v0.3-compatible paths still fall back with warnings; the v0.4 + `mock_spec` path is stricter and records diagnostics - The test suite does not yet cover every important edge case - Generated data should be manually checked against your intended metadata rules +**v0.4 direct API example** + +The v0.4 API separates specification, baseline generation, and post-processing. +That makes the generated values easier to inspect and audit. + +```r +library(MockData) + +spec <- mock_spec( + mock_spec_continuous( + "age", + range = c(18, 85), + distribution = "normal", + mean = 50, + sd = 12, + rtype = "integer" + ), + mock_spec_categorical( + "smoking", + levels = c("never", "former", "current"), + proportions = c(0.5, 0.3, 0.2), + rtype = "character", + missing_codes = "unknown", + missing_proportions = 0.05 + ) +) + +baseline <- generate_mock_data_native(spec, n = 100, seed = 1) +mock_data <- postprocess_mock_data(baseline, spec, seed = 2) + +head(mock_data) +attr(mock_data, "mockdata_diagnostics")$variables$smoking +``` + **30-second standalone example** For a quick numeric variable, `create_con_var()` can use two small @@ -221,6 +258,7 @@ devtools::install_local("~/github/mock-data") **Tutorials:** +- [v0.4 getting started](vignettes/getting-started-v04.qmd) - Direct `mock_spec`, recodeflow adapter, and diagnostics workflow - [Getting started](vignettes/getting-started.qmd) - Complete tutorial from single variables to full datasets - [For recodeflow users](vignettes/for-recodeflow-users.qmd) - Using MockData with existing metadata - [Survival data](vignettes/tutorial-survival-data.qmd) - Time-to-event data and temporal patterns diff --git a/_pkgdown.yml b/_pkgdown.yml index d17d53a..8c02821 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -100,6 +100,7 @@ articles: desc: Learning-oriented step-by-step guides navbar: Tutorials contents: + - getting-started-v04 - getting-started - tutorial-categorical-continuous - tutorial-dates diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md new file mode 100644 index 0000000..ae78e00 --- /dev/null +++ b/development/v04-documentation-sprint.md @@ -0,0 +1,57 @@ +# MockData v0.4 Documentation Sprint + +This sprint treats documentation as implementation validation. The goal is not +only to explain the v0.4 API, but to run realistic user workflows during +vignette and pkgdown builds. + +## Principles + +- Use Divio's four documentation needs: tutorials, how-to guides, reference, and + explanation. +- Keep vignette code executable unless the code genuinely depends on an + external package or private data. +- Prefer small, focused vignettes over one large tour. +- Use seeds in every stochastic example so rendered output is stable. +- Include at least one diagnostics example because the v0.4 pipeline's + auditability contract is a central design change. + +## First pass + +- `getting-started-v04.qmd`: tutorial for the v0.4 `mock_spec` workflow. +- README: update the top-level status and quick example so users see v0.4 + immediately. +- `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation. + +## Follow-up vignettes + +Tutorial: + +- `getting-started-v04.qmd`: linear first-use path. + +How-to: + +- `recodeflow-metadata-v04.qmd`: use existing `variables.csv` and + `variable_details.csv`. +- `diagnostics-and-garbage-v04.qmd`: inspect missing-code and garbage + diagnostics. +- `choosing-a-backend-v04.qmd`: native vs optional `simstudy`. +- `migrating-from-v03-v04.qmd`: seed behavior, diagnostics attribute, + fallback conditions, and compatibility wrappers. + +Explanation: + +- `design-philosophy-v04.qmd`: distill the architecture review, hybrid backend + decision, and mock-data versus synthetic-data boundary. + +Reference: + +- Keep roxygen pages and `_pkgdown.yml` synchronized with exported functions. +- Keep `NEWS.md` as the release-note source of truth. + +## Review checklist + +- Does every vignette render locally? +- Does every code chunk either run or clearly justify `eval: false`? +- Does each vignette commit to one Divio purpose? +- Do examples use the public API exactly as users should use it? +- Are error messages and diagnostics understandable in rendered output? diff --git a/vignettes/getting-started-v04.qmd b/vignettes/getting-started-v04.qmd new file mode 100644 index 0000000..ac95417 --- /dev/null +++ b/vignettes/getting-started-v04.qmd @@ -0,0 +1,189 @@ +--- +title: "Getting started with MockData v0.4" +format: html +vignette: > + %\VignetteIndexEntry{Getting started with MockData v0.4} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +```{r} +#| label: setup +#| include: false +input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL) +candidate_roots <- unique(c( + ".", + "..", + if (!is.null(input_file)) file.path(dirname(input_file), "..") +)) +package_root <- NULL +for (candidate in candidate_roots) { + description <- file.path(candidate, "DESCRIPTION") + if (file.exists(description) && + any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) { + package_root <- candidate + break + } +} + +if (!is.null(package_root)) { + devtools::load_all(package_root, quiet = TRUE) +} else { + library(MockData) +} +``` + +::: {.vignette-about} +**About this vignette:** This tutorial introduces the v0.4 `mock_spec` +workflow. All code is executed when the vignette builds, so this page also +serves as a user-flow test for the public API. +::: + +## The v0.4 workflow + +MockData v0.4 separates data generation into three steps: + +1. Create a `mock_spec` +2. Generate baseline valid values +3. Apply missing-code and garbage-value post-processing + +That separation makes it easier to inspect what was requested and what was +changed after generation. + +## Specify variables directly + +Use the direct helpers when you want a small mock dataset without creating CSV +metadata files. + +```{r} +spec <- mock_spec( + mock_spec_continuous( + "age", + range = c(18, 85), + distribution = "normal", + mean = 50, + sd = 12, + rtype = "integer" + ), + mock_spec_categorical( + "smoking", + levels = c("never", "former", "current"), + proportions = c(0.5, 0.3, 0.2), + rtype = "character", + missing_codes = "unknown", + missing_proportions = 0.05 + ) +) + +validate_mock_spec(spec) +``` + +Generate baseline values first. These are values within the intended valid +space. + +```{r} +baseline <- generate_mock_data_native(spec, n = 100, seed = 101) +head(baseline) +``` + +Then apply missing-code and garbage-value rules. This step adds diagnostics as +an attribute on the returned data frame. + +```{r} +mock_data <- postprocess_mock_data(baseline, spec, seed = 102) +head(mock_data) +``` + +```{r} +diagnostics <- attr(mock_data, "mockdata_diagnostics") +diagnostics$variables$smoking +``` + +The diagnostics distinguish values assigned by post-processing from values that +were drawn naturally during baseline generation. + +## Use recodeflow metadata + +If you already have recodeflow-style `variables` and `variable_details` +metadata, adapt those tables to the same `mock_spec` shape. + +```{r} +variables <- data.frame( + variable = c("age", "smoking"), + variableType = c("Continuous", "Categorical"), + rType = c("integer", "character"), + role = c("enabled", "enabled"), + stringsAsFactors = FALSE +) + +variable_details <- data.frame( + variable = c("age", "smoking", "smoking", "smoking"), + recStart = c("[18, 85]", "1", "2", "97"), + recEnd = c("copy", "copy", "copy", "NA::b"), + proportion = c(1, 0.6, 0.3, 0.1), + stringsAsFactors = FALSE +) + +spec_from_metadata <- mock_spec_from_recodeflow(variables, variable_details) +names(spec_from_metadata$variables) +``` + +The adapter preserves recodeflow semantics: valid ranges, categorical +proportions, `recEnd` missing-code rows, and garbage settings. + +```{r} +metadata_baseline <- generate_mock_data_native( + spec_from_metadata, + n = 100, + seed = 201 +) + +metadata_mock <- postprocess_mock_data( + metadata_baseline, + spec_from_metadata, + seed = 202 +) + +head(metadata_mock) +``` + +```{r} +metadata_diag <- attr(metadata_mock, "mockdata_diagnostics") +metadata_diag$variables$smoking$assigned_missing_indices[1:5] +``` + +## Use the compatibility wrapper + +`create_mock_data()` remains available for v0.3-style workflows. In strict mode, +it attempts the v0.4 pipeline for supported metadata and falls back to the legacy +`create_*` dispatch path for features that are not yet supported by the v0.4 +native backend. + +```{r} +wrapped <- create_mock_data( + databaseStart = "example", + variables = variables, + variable_details = variable_details, + n = 100, + seed = 301 +) + +head(wrapped) +``` + +```{r} +!is.null(attr(wrapped, "mockdata_diagnostics")) +``` + +When the v0.4 path is used, `create_mock_data()` returns diagnostics. Legacy +fallback paths return plain data frames without that attribute. + +## Choosing the next function + +- Use `mock_*()` or `mock_spec_*()` for small examples and tests. +- Use `mock_spec_from_recodeflow()` when metadata already exists. +- Use `generate_mock_data_native()` for the default MIT-licensed backend. +- Use `generate_mock_data_simstudy()` only when the optional `simstudy` package + is installed and you want to test that backend. +- Use `postprocess_mock_data()` when you need missing codes, garbage values, or + diagnostics. From 028692f73506b8ebf724b86f2327792035ea6726 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Wed, 20 May 2026 12:41:06 -0400 Subject: [PATCH 2/8] Add v04 maintainer communication note --- development/v04-documentation-sprint.md | 5 + development/v04-phase-c-comms-note.md | 151 ++++++++++++++++++++++++ 2 files changed, 156 insertions(+) create mode 100644 development/v04-phase-c-comms-note.md diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md index ae78e00..4c138d7 100644 --- a/development/v04-documentation-sprint.md +++ b/development/v04-documentation-sprint.md @@ -21,6 +21,8 @@ vignette and pkgdown builds. - README: update the top-level status and quick example so users see v0.4 immediately. - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation. +- `v04-phase-c-comms-note.md`: maintainer-facing note for cchsflow, + chmsflow, and recodeflow testing while v0.4 sits on `dev`. ## Follow-up vignettes @@ -42,6 +44,9 @@ Explanation: - `design-philosophy-v04.qmd`: distill the architecture review, hybrid backend decision, and mock-data versus synthetic-data boundary. +- `development/v04-phase-c-comms-note.md`: Phase C maintainer communication + source material; fold relevant parts into migration and recodeflow how-to + docs after maintainer feedback. Reference: diff --git a/development/v04-phase-c-comms-note.md b/development/v04-phase-c-comms-note.md new file mode 100644 index 0000000..0cf79e2 --- /dev/null +++ b/development/v04-phase-c-comms-note.md @@ -0,0 +1,151 @@ +# MockData v0.4 Phase C Maintainer Communication Note + +**Audience**: cchsflow, chmsflow, recodeflow, and MockData maintainers +**Status**: draft for maintainer review +**Branch for testing**: `dev` + +## Short Version + +MockData v0.4 is now available on the `dev` branch for maintainer testing. The +main change is architectural: MockData now normalizes inputs into a `mock_spec`, +then generates data through a native backend, an optional `simstudy` backend, and +a MockData-owned post-processing layer for missing codes, garbage values, and +diagnostics. + +The goal is to make MockData more reliable for recodeflow-style metadata while +preserving the existing public API. No sibling package needs to migrate +immediately for v0.4.0. + +## What Changed + +- `mock_spec` is the normalized internal representation for mock-data + specifications. +- New direct helper APIs exist for simple use cases: + `mock_continuous()`, `mock_categorical()`, and `mock_date()`. +- `mock_spec_from_recodeflow()` converts recodeflow-style `variables.csv` and + `variable_details.csv` metadata into a `mock_spec`. +- `generate_mock_data_native()` generates data from `mock_spec` without optional + dependencies. +- `postprocess_mock_data()` applies MockData-owned missing-code and garbage-data + semantics and attaches a `mockdata_diagnostics` attribute. +- `generate_mock_data_simstudy()` is available for supported advanced cases when + `simstudy >= 0.8.1` is installed. +- `create_mock_data()` now routes supported metadata through the v0.4 pipeline + and falls back to the legacy path for unsupported or explicitly lenient cases. + +## What Did Not Change + +- Existing v0.3 public functions remain available in v0.4.0. +- `create_mock_data()` keeps its existing signature. +- MockData remains focused on mock data for package development, QA, + documentation, examples, and training. +- MockData is not positioning v0.4 as synthetic data for privacy release, + inference, or population-valid analysis. +- MockData remains MIT licensed. `simstudy` is GPL-3 and optional through + `Suggests`, not a required dependency. +- No public function removals are planned before v0.5.0. + +## Compatibility Notes + +- `validate = TRUE` is the default strict path. It uses the v0.4 pipeline when + the requested metadata is supported. +- `validate = FALSE` deliberately uses the legacy, more permissive path. +- `variable_details = NULL` also uses the legacy fallback path. +- The v0.4 path returns a regular data frame with an optional + `mockdata_diagnostics` attribute. Legacy fallback output does not include this + attribute. +- Seeded output may differ from v0.3 even when the same seed is supplied. + v0.4 uses the requested seed for baseline generation and `seed + 1L` for + post-processing so missing-code and garbage injection are reproducible without + sharing the same RNG stream as baseline generation. +- Formula-derived variables, multi-group correlations, and advanced survival + models are intentionally deferred. Unsupported cases should either fail loudly + or route through the legacy path, depending on the public entry point. + +## What We Need Maintainers To Test + +Please test against representative metadata from cchsflow, chmsflow, and +recodeflow projects, especially files that include: + +- categorical variables with `recEnd` missing-code semantics; +- continuous variables with ranges or distribution parameters; +- date variables; +- garbage or invalid-value rules; +- role and `databaseStart` filtering; +- any variables that sibling packages expect MockData to generate today. + +Suggested smoke test: + +```r +devtools::load_all() + +vars <- read.csv("path/to/variables.csv") +details <- read.csv("path/to/variable_details.csv") + +mock <- create_mock_data( + variables = vars, + variable_details = details, + databaseStart = "cycle1", + n = 100, + seed = 123, + validate = TRUE, + verbose = TRUE +) + +str(mock) +attr(mock, "mockdata_diagnostics") +``` + +Also useful: + +```r +spec <- mock_spec_from_recodeflow( + variables = vars, + variable_details = details, + databaseStart = "cycle1" +) + +validate_mock_spec(spec, strict = TRUE) +``` + +## What To Report + +Please report: + +- metadata files or patterns that unexpectedly fall back to the legacy path; +- variables that generated correctly in v0.3 but fail in v0.4; +- variables that generate but have surprising values, types, or missing-code + behavior; +- diagnostics that are hard to interpret; +- any cchsflow/chmsflow/recodeflow assumptions about MockData output that v0.4 + appears to change; +- API ergonomics issues that make the new path hard to explain in documentation. + +## Proposed Timeline + +- v0.4 sits on `dev` while sibling maintainers test representative metadata. +- Documentation sprint work continues on a separate branch and PR. +- After checks, documentation, and maintainer smoke tests are complete, v0.4.0 + can be tagged and merged forward to `main`. +- Any lifecycle deprecation warnings for older APIs should wait until v0.4.x and + only after sibling package maintainers have a clear migration path. + +## Message Template + +Subject: MockData v0.4 available on `dev` for sibling-package testing + +MockData v0.4 is now on the `dev` branch for maintainer testing. It keeps the +existing `create_mock_data()` API, but internally routes supported metadata +through a new `mock_spec` pipeline with native generation and MockData-owned +post-processing diagnostics. + +No immediate migration is required for cchsflow/chmsflow/recodeflow, and no +public function removals are planned before v0.5.0. The main ask is to try +representative `variables.csv` and `variable_details.csv` files against +`create_mock_data(validate = TRUE, verbose = TRUE)` and report any unexpected +fallbacks, failures, or output changes. + +The key user-visible differences are that v0.4 output may include a +`mockdata_diagnostics` attribute, seeded output can differ from v0.3 because +post-processing uses `seed + 1L`, and optional `simstudy` support remains in +`Suggests` rather than becoming a required dependency. From 200630c592d435d2d3b2229ac2c7b667f672e7bf Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Wed, 20 May 2026 12:56:36 -0400 Subject: [PATCH 3/8] Add recodeflow metadata how-to --- _pkgdown.yml | 1 + development/v04-documentation-sprint.md | 2 + vignettes/recodeflow-metadata-v04.qmd | 268 ++++++++++++++++++++++++ 3 files changed, 271 insertions(+) create mode 100644 vignettes/recodeflow-metadata-v04.qmd diff --git a/_pkgdown.yml b/_pkgdown.yml index 8c02821..8329e1d 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -112,6 +112,7 @@ articles: desc: Task-oriented practical examples navbar: How-to guides contents: + - recodeflow-metadata-v04 - for-recodeflow-users - title: Explanation diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md index 4c138d7..97b8862 100644 --- a/development/v04-documentation-sprint.md +++ b/development/v04-documentation-sprint.md @@ -18,6 +18,8 @@ vignette and pkgdown builds. ## First pass - `getting-started-v04.qmd`: tutorial for the v0.4 `mock_spec` workflow. +- `recodeflow-metadata-v04.qmd`: how-to for generating mock data from + recodeflow-style CSV metadata. - README: update the top-level status and quick example so users see v0.4 immediately. - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation. diff --git a/vignettes/recodeflow-metadata-v04.qmd b/vignettes/recodeflow-metadata-v04.qmd new file mode 100644 index 0000000..a36b8be --- /dev/null +++ b/vignettes/recodeflow-metadata-v04.qmd @@ -0,0 +1,268 @@ +--- +title: "Use recodeflow metadata with MockData v0.4" +format: html +vignette: > + %\VignetteIndexEntry{Use recodeflow metadata with MockData v0.4} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +```{r} +#| label: setup +#| include: false +input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL) +candidate_roots <- unique(c( + ".", + "..", + if (!is.null(input_file)) file.path(dirname(input_file), "..") +)) +package_root <- NULL +for (candidate in candidate_roots) { + description <- file.path(candidate, "DESCRIPTION") + if (file.exists(description) && + any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) { + package_root <- candidate + break + } +} + +if (!is.null(package_root)) { + devtools::load_all(package_root, quiet = TRUE) +} else { + library(MockData) +} +``` + +::: {.vignette-about} +**About this vignette:** This how-to shows how to generate mock data from +recodeflow-style `variables.csv` and `variable_details.csv` files. The code +writes temporary CSV files and reads them back, so the vignette exercises the +same path as a file-based user workflow. +::: + +## Starting point + +Use this path when you already have recodeflow metadata. MockData reads the +metadata, converts it to a v0.4 `mock_spec`, generates baseline values, and +applies missing-code and garbage-value post-processing. + +For a compact example, define three variables: + +- `age`: continuous integer with a normal distribution and one missing code +- `smoking`: categorical code with one `recEnd = "NA::b"` missing-code row +- `interview_date`: date variable with a valid calendar range + +```{r} +variables <- data.frame( + variable = c("age", "smoking", "interview_date"), + label = c("Age in years", "Smoking status", "Interview date"), + variableType = c("Continuous", "Categorical", "Date"), + rType = c("integer", "factor", "date"), + role = c("enabled,table1", "enabled,table1", "enabled"), + position = c(10, 20, 30), + databaseStart = c("cycle1", "cycle1", "cycle1"), + distribution = c("normal", NA, "uniform"), + mean = c(50, NA, NA), + sd = c(12, NA, NA), + garbage_low_prop = c(0.02, NA, NA), + garbage_low_range = c("[0, 17]", NA, NA), + stringsAsFactors = FALSE +) + +variable_details <- data.frame( + variable = c( + "age", + "age", + "smoking", + "smoking", + "smoking", + "smoking", + "interview_date" + ), + recStart = c( + "[18, 85]", + "999", + "1", + "2", + "3", + "7", + "[2020-01-01, 2020-12-31]" + ), + recEnd = c("copy", "NA::b", "copy", "copy", "copy", "NA::b", "copy"), + catLabel = c( + "Valid age range", + "Not stated", + "Never smoker", + "Former smoker", + "Current smoker", + "Don't know", + "Interview date range" + ), + proportion = c(0.95, 0.05, 0.50, 0.30, 0.17, 0.03, 1), + databaseStart = "cycle1", + stringsAsFactors = FALSE +) +``` + +## Write metadata as CSV files + +In a real project, these files already exist. Here we write them to a temporary +directory so this vignette remains self-contained. + +```{r} +metadata_dir <- tempfile("mockdata-recodeflow-") +dir.create(metadata_dir) + +variables_file <- file.path(metadata_dir, "variables.csv") +details_file <- file.path(metadata_dir, "variable_details.csv") + +write.csv(variables, variables_file, row.names = FALSE, na = "") +write.csv(variable_details, details_file, row.names = FALSE, na = "") +``` + +## Inspect the normalized specification + +`mock_spec_from_recodeflow()` reads either data frames or CSV file paths. It +returns a validated `mock_spec` without generating data. + +```{r} +spec <- mock_spec_from_recodeflow( + variables = variables_file, + variable_details = details_file, + databaseStart = "cycle1" +) + +names(spec$variables) +``` + +The spec preserves the recodeflow pieces MockData needs: variable types, +categorical levels, proportions, valid ranges, missing-code rows, and garbage +rules. + +```{r} +spec$variables$smoking$levels +spec$variables$smoking$missing_codes +spec$variables$age$range +spec$variables$age$garbage_rules +``` + +## Generate mock data with the compatibility wrapper + +Most recodeflow users should start with `create_mock_data()`. In strict mode +(`validate = TRUE`, the default), supported metadata routes through the v0.4 +pipeline. + +```{r} +mock_data <- create_mock_data( + databaseStart = "cycle1", + variables = variables_file, + variable_details = details_file, + n = 200, + seed = 123, + verbose = TRUE +) + +head(mock_data) +``` + +The output is a regular data frame. + +```{r} +str(mock_data) +``` + +## Check diagnostics + +When `create_mock_data()` uses the v0.4 path, the returned data frame has a +`mockdata_diagnostics` attribute. The attribute records which rows were changed +during missing-code and garbage-value post-processing. + +```{r} +diagnostics <- attr(mock_data, "mockdata_diagnostics") +names(diagnostics$variables) +``` + +For example, `smoking` has a missing-code rule for code `7`, and `age` has a +low garbage rule. + +```{r} +length(diagnostics$variables$smoking$assigned_missing_indices) +diagnostics$variables$smoking$assigned_missing_indices[1:6] + +length(diagnostics$variables$age$assigned_garbage_indices$low) +diagnostics$variables$age$assigned_garbage_indices$low +``` + +Use the diagnostics as an audit trail, not as columns in the mock dataset. Some +base R operations and downstream tools can drop attributes, so inspect or save +diagnostics before heavy reshaping. + +## Generate explicitly from the spec + +The wrapper is convenient, but the v0.4 pipeline can also be called step by +step. This is useful when you want to inspect baseline values before +post-processing. + +```{r} +baseline <- generate_mock_data_native(spec, n = 200, seed = 123) +head(baseline) +``` + +```{r} +postprocessed <- postprocess_mock_data(baseline, spec, seed = 124) +head(postprocessed) +``` + +The wrapper uses the same idea: the public seed controls baseline generation, +and `seed + 1L` controls post-processing. + +## Database filtering + +`databaseStart` filtering is exact token matching. A variable tagged for +`cycle10` will not accidentally match `cycle1`. + +```{r} +variables_cycle10 <- variables +variables_cycle10$variable[1] <- "age_cycle10" +variables_cycle10$databaseStart[1] <- "cycle10" + +combined_variables <- rbind(variables, variables_cycle10[1, ]) +combined_details <- rbind( + variable_details, + transform(variable_details[variable_details$variable == "age", ], + variable = "age_cycle10") +) + +filtered_spec <- mock_spec_from_recodeflow( + variables = combined_variables, + variable_details = combined_details, + databaseStart = "cycle1" +) + +names(filtered_spec$variables) +``` + +## Troubleshooting + +If `create_mock_data()` cannot use the v0.4 path, set `verbose = TRUE` to see +which path was chosen. + +```{r} +#| eval: false +mock_data <- create_mock_data( + databaseStart = "cycle1", + variables = variables_file, + variable_details = details_file, + n = 200, + seed = 123, + verbose = TRUE +) +``` + +Common reasons for legacy fallback include `validate = FALSE`, +`variable_details = NULL`, detail-level `databaseStart` filtering without a +variable-level `databaseStart` column, and features that are intentionally +deferred from the v0.4 native backend. + +For deeper diagnostics examples, see the diagnostics and garbage how-to when it +lands in the v0.4 documentation sprint. From 750bf13052a52118d2ef2571bee8dc7b2e46d671 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Wed, 20 May 2026 16:26:29 -0400 Subject: [PATCH 4/8] Add diagnostics and garbage how-to --- _pkgdown.yml | 1 + development/v04-documentation-sprint.md | 2 + vignettes/diagnostics-and-garbage-v04.qmd | 247 ++++++++++++++++++++++ vignettes/recodeflow-metadata-v04.qmd | 4 + 4 files changed, 254 insertions(+) create mode 100644 vignettes/diagnostics-and-garbage-v04.qmd diff --git a/_pkgdown.yml b/_pkgdown.yml index 8329e1d..7b4851c 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -113,6 +113,7 @@ articles: navbar: How-to guides contents: - recodeflow-metadata-v04 + - diagnostics-and-garbage-v04 - for-recodeflow-users - title: Explanation diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md index 97b8862..584a0fa 100644 --- a/development/v04-documentation-sprint.md +++ b/development/v04-documentation-sprint.md @@ -20,6 +20,8 @@ vignette and pkgdown builds. - `getting-started-v04.qmd`: tutorial for the v0.4 `mock_spec` workflow. - `recodeflow-metadata-v04.qmd`: how-to for generating mock data from recodeflow-style CSV metadata. +- `diagnostics-and-garbage-v04.qmd`: how-to for reading diagnostics and + auditing garbage/missing-code post-processing. - README: update the top-level status and quick example so users see v0.4 immediately. - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation. diff --git a/vignettes/diagnostics-and-garbage-v04.qmd b/vignettes/diagnostics-and-garbage-v04.qmd new file mode 100644 index 0000000..259f0de --- /dev/null +++ b/vignettes/diagnostics-and-garbage-v04.qmd @@ -0,0 +1,247 @@ +--- +title: "Inspect diagnostics and garbage rules in MockData v0.4" +format: html +vignette: > + %\VignetteIndexEntry{Inspect diagnostics and garbage rules in MockData v0.4} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +```{r} +#| label: setup +#| include: false +input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL) +candidate_roots <- unique(c( + ".", + "..", + if (!is.null(input_file)) file.path(dirname(input_file), "..") +)) +package_root <- NULL +for (candidate in candidate_roots) { + description <- file.path(candidate, "DESCRIPTION") + if (file.exists(description) && + any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) { + package_root <- candidate + break + } +} + +if (!is.null(package_root)) { + devtools::load_all(package_root, quiet = TRUE) +} else { + library(MockData) +} +``` + +::: {.vignette-about} +**About this vignette:** This how-to shows how to inspect the +`mockdata_diagnostics` attribute added by the v0.4 post-processing layer. The +examples focus on audit trails for missing-code collisions and garbage-value +rules. +::: + +## Why diagnostics matter + +Mock data often needs two kinds of unusual values: + +- missing codes, such as `97` or `999` +- garbage values, such as impossible ages used to test validation code + +Sometimes a value can be both meaningful and suspicious. For example, code `97` +could be a valid category in one source file and also a declared missing code in +another. MockData records diagnostics so you can tell whether a value was drawn +naturally by the baseline generator or assigned later by post-processing. + +## Create a collision case + +Start with a categorical variable where `97` is both a valid level and a +declared missing code. This is deliberately awkward; it is the case diagnostics +are designed to make auditable. + +```{r} +response_spec <- mock_categorical( + "response", + levels = c("1", "97"), + proportions = c(0.65, 0.35), + rtype = "character", + missing_codes = "97", + missing_proportions = 0.20 +) + +baseline <- generate_mock_data_native(response_spec, n = 200, seed = 11) +table(baseline$response) +``` + +The baseline already contains some `97` values because `97` is a valid level. +Now apply post-processing. + +```{r} +processed <- postprocess_mock_data(baseline, response_spec, seed = 12) +table(processed$response) +``` + +## Read the diagnostics + +Diagnostics live in a data-frame attribute. + +```{r} +diagnostics <- attr(processed, "mockdata_diagnostics") +names(diagnostics$variables) +``` + +For a variable, two fields are especially important: + +- `preexisting_missing_code_indices`: rows whose baseline value already matched + a declared missing code +- `assigned_missing_indices`: rows changed by post-processing to a missing code + +```{r} +response_diag <- diagnostics$variables$response + +length(response_diag$preexisting_missing_code_indices) +length(response_diag$assigned_missing_indices) +``` + +These two sets should be distinct. + +```{r} +intersect( + response_diag$preexisting_missing_code_indices, + response_diag$assigned_missing_indices +) +``` + +Both groups contain `97` in the final data, but they mean different things. + +```{r} +head(processed$response[response_diag$preexisting_missing_code_indices]) +head(processed$response[response_diag$assigned_missing_indices]) +``` + +Use the diagnostics when your tests need to distinguish a naturally drawn +collision from a missing code assigned by MockData. + +## Add garbage rules + +Garbage rules deliberately inject invalid or out-of-range values. Here `age` +has one missing code and two garbage rules: + +- `low`: values below the valid age range +- `high`: values above the valid age range + +```{r} +age_spec <- mock_continuous( + "age", + range = c(18, 85), + distribution = "normal", + mean = 50, + sd = 12, + rtype = "integer", + missing_codes = 999, + missing_proportions = 0.05, + garbage_rules = list( + high = list(proportion = 0.03, range = "[120, 150]"), + low = list(proportion = 0.04, range = "[0, 17]") + ) +) + +age_baseline <- generate_mock_data_native(age_spec, n = 200, seed = 21) +age_processed <- postprocess_mock_data(age_baseline, age_spec, seed = 22) +``` + +MockData applies garbage rules in canonical order: `low`, then `high`, then any +other named rules in caller order. The diagnostics use the same order. + +```{r} +age_diag <- attr(age_processed, "mockdata_diagnostics")$variables$age +names(age_diag$assigned_garbage_indices) +``` + +Inspect the assigned rows. + +```{r} +low_idx <- age_diag$assigned_garbage_indices$low +high_idx <- age_diag$assigned_garbage_indices$high + +length(low_idx) +range(age_processed$age[low_idx]) + +length(high_idx) +range(age_processed$age[high_idx]) +``` + +Missing-code rows are protected from garbage assignment. + +```{r} +intersect(age_diag$assigned_missing_indices, low_idx) +intersect(age_diag$assigned_missing_indices, high_idx) +``` + +## Combine variables in one pipeline + +Most workflows generate several variables together. The same diagnostics shape +is used for every variable in the spec. + +```{r} +spec <- mock_spec( + response_spec$variables$response, + age_spec$variables$age +) + +combined_baseline <- generate_mock_data_native(spec, n = 200, seed = 31) +combined_processed <- postprocess_mock_data(combined_baseline, spec, seed = 32) + +combined_diag <- attr(combined_processed, "mockdata_diagnostics") +names(combined_diag$variables) +``` + +A compact audit summary can be built from the diagnostics. + +```{r} +data.frame( + variable = names(combined_diag$variables), + preexisting_missing = vapply( + combined_diag$variables, + function(x) length(x$preexisting_missing_code_indices), + integer(1) + ), + assigned_missing = vapply( + combined_diag$variables, + function(x) length(x$assigned_missing_indices), + integer(1) + ), + assigned_garbage = vapply( + combined_diag$variables, + function(x) sum(lengths(x$assigned_garbage_indices)), + integer(1) + ) +) +``` + +## Preserve diagnostics before reshaping + +Diagnostics are stored as an attribute on the returned data frame. Some +downstream operations keep attributes and others drop them. If diagnostics are +part of your QA workflow, save them before heavy reshaping or joins. + +```{r} +saved_diagnostics <- attr(combined_processed, "mockdata_diagnostics") + +subset_data <- combined_processed[1:5, ] +is.null(attr(subset_data, "mockdata_diagnostics")) + +names(saved_diagnostics$variables) +``` + +## Re-running post-processing + +`postprocess_mock_data()` is intentionally not idempotent. Running it again on a +data frame that already has `mockdata_diagnostics` would double-contaminate the +data, so MockData stops loudly. + +```{r} +#| error: true +postprocess_mock_data(combined_processed, spec, seed = 33) +``` + +Start again from baseline data when you want a fresh post-processing draw. diff --git a/vignettes/recodeflow-metadata-v04.qmd b/vignettes/recodeflow-metadata-v04.qmd index a36b8be..51774fb 100644 --- a/vignettes/recodeflow-metadata-v04.qmd +++ b/vignettes/recodeflow-metadata-v04.qmd @@ -104,6 +104,10 @@ variable_details <- data.frame( ) ``` +When `databaseStart` filtering is requested, include `databaseStart` in both +metadata tables. If the detail metadata has the filter column but the variables +metadata does not, `create_mock_data()` uses the legacy path for compatibility. + ## Write metadata as CSV files In a real project, these files already exist. Here we write them to a temporary From 3d7873f73fe65577f66e34855e8facb80f79c9f4 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Wed, 20 May 2026 16:35:35 -0400 Subject: [PATCH 5/8] Add v03 to v04 migration how-to --- _pkgdown.yml | 1 + development/v04-documentation-sprint.md | 2 + vignettes/migrating-from-v03-v04.qmd | 276 ++++++++++++++++++++++++ 3 files changed, 279 insertions(+) create mode 100644 vignettes/migrating-from-v03-v04.qmd diff --git a/_pkgdown.yml b/_pkgdown.yml index 7b4851c..215f895 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -114,6 +114,7 @@ articles: contents: - recodeflow-metadata-v04 - diagnostics-and-garbage-v04 + - migrating-from-v03-v04 - for-recodeflow-users - title: Explanation diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md index 584a0fa..8e05cfa 100644 --- a/development/v04-documentation-sprint.md +++ b/development/v04-documentation-sprint.md @@ -22,6 +22,8 @@ vignette and pkgdown builds. recodeflow-style CSV metadata. - `diagnostics-and-garbage-v04.qmd`: how-to for reading diagnostics and auditing garbage/missing-code post-processing. +- `migrating-from-v03-v04.qmd`: how-to for compatibility behavior, fallback + routing, diagnostics, and seed differences. - README: update the top-level status and quick example so users see v0.4 immediately. - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation. diff --git a/vignettes/migrating-from-v03-v04.qmd b/vignettes/migrating-from-v03-v04.qmd new file mode 100644 index 0000000..48744f9 --- /dev/null +++ b/vignettes/migrating-from-v03-v04.qmd @@ -0,0 +1,276 @@ +--- +title: "Migrate from MockData v0.3 to v0.4" +format: html +vignette: > + %\VignetteIndexEntry{Migrate from MockData v0.3 to v0.4} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +```{r} +#| label: setup +#| include: false +input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL) +candidate_roots <- unique(c( + ".", + "..", + if (!is.null(input_file)) file.path(dirname(input_file), "..") +)) +package_root <- NULL +for (candidate in candidate_roots) { + description <- file.path(candidate, "DESCRIPTION") + if (file.exists(description) && + any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) { + package_root <- candidate + break + } +} + +if (!is.null(package_root)) { + devtools::load_all(package_root, quiet = TRUE) +} else { + library(MockData) +} +``` + +::: {.vignette-about} +**About this vignette:** This how-to is for users moving existing +`create_mock_data()` workflows from v0.3 to v0.4. It focuses on the compatibility +wrapper, routing messages, diagnostics, and reproducibility differences. +::: + +## What stayed the same + +The main entry point is still `create_mock_data()`, and the existing arguments +are still available. + +```{r} +variables <- data.frame( + variable = c("age", "smoking"), + variableType = c("Continuous", "Categorical"), + rType = c("integer", "character"), + role = c("enabled", "enabled"), + position = c(10, 20), + distribution = c("normal", NA), + mean = c(50, NA), + sd = c(12, NA), + stringsAsFactors = FALSE +) + +variable_details <- data.frame( + variable = c("age", "age", "smoking", "smoking", "smoking"), + recStart = c("[18, 85]", "999", "1", "2", "7"), + recEnd = c("copy", "NA::b", "copy", "copy", "NA::b"), + proportion = c(0.95, 0.05, 0.60, 0.35, 0.05), + stringsAsFactors = FALSE +) +``` + +```{r} +mock_data <- create_mock_data( + databaseStart = "study", + variables = variables, + variable_details = variable_details, + n = 100, + seed = 123 +) + +head(mock_data) +``` + +For supported metadata, v0.4 routes this call through the new `mock_spec` +pipeline. + +## See which path ran + +Use `verbose = TRUE` when migrating. The message tells you whether the v0.4 path +or the legacy path was used. + +```{r} +strict_data <- create_mock_data( + databaseStart = "study", + variables = variables, + variable_details = variable_details, + n = 50, + seed = 456, + verbose = TRUE +) +``` + +The v0.4 path returns a data frame with a diagnostics attribute. + +```{r} +!is.null(attr(strict_data, "mockdata_diagnostics")) +``` + +## Opt into legacy behavior + +Set `validate = FALSE` when you need the legacy v0.3 dispatch path during +migration. This is the explicit compatibility opt-out. + +```{r} +legacy_data <- create_mock_data( + databaseStart = "study", + variables = variables, + variable_details = variable_details, + n = 50, + seed = 456, + validate = FALSE, + verbose = TRUE +) +``` + +Legacy output is a plain data frame without the v0.4 diagnostics attribute. + +```{r} +is.null(attr(legacy_data, "mockdata_diagnostics")) +``` + +The strict and legacy paths should agree on the broad shape of supported data, +but exact values can differ. + +```{r} +names(strict_data) +names(legacy_data) + +table(strict_data$smoking) +table(legacy_data$smoking) +``` + +## Understand seed differences + +In v0.3, the public seed controlled the legacy generators. In v0.4, the wrapper +uses the public seed for baseline generation and `seed + 1L` for missing-code +and garbage-value post-processing. + +That makes both stages reproducible, but it means exact values may differ from +v0.3 even when you pass the same seed. + +```{r} +strict_again <- create_mock_data( + databaseStart = "study", + variables = variables, + variable_details = variable_details, + n = 50, + seed = 456 +) + +identical(strict_data, strict_again) +``` + +When testing migrations, compare structure, types, ranges, and proportions +rather than expecting row-for-row equality with v0.3 output. + +```{r} +str(strict_data) +prop.table(table(strict_data$smoking)) +``` + +## Know the fallback conditions + +`create_mock_data()` deliberately uses the legacy path when: + +- `validate = FALSE` +- `variable_details = NULL` +- detail-level `databaseStart` filtering is needed but `variables` has no + `databaseStart` column +- the requested metadata uses a feature not yet supported by the v0.4 native + backend + +For example, `variable_details = NULL` keeps the simple legacy fallback. + +```{r} +fallback_data <- create_mock_data( + databaseStart = "study", + variables = variables[1, ], + variable_details = NULL, + n = 20, + seed = 789, + verbose = TRUE +) + +head(fallback_data) +``` + +```{r} +is.null(attr(fallback_data, "mockdata_diagnostics")) +``` + +Unsupported v0.4 backend features also route to legacy dispatch. This example +uses an exponential continuous distribution, which remains available through the +legacy generator. + +```{r} +exp_variables <- data.frame( + variable = "time_to_visit", + variableType = "Continuous", + rType = "double", + role = "enabled", + distribution = "exponential", + rate = 0.5, + stringsAsFactors = FALSE +) + +exp_details <- data.frame( + variable = "time_to_visit", + recStart = "[0, 10]", + recEnd = "copy", + proportion = 1, + stringsAsFactors = FALSE +) + +exp_data <- create_mock_data( + databaseStart = "study", + variables = exp_variables, + variable_details = exp_details, + n = 20, + seed = 321, + verbose = TRUE +) + +head(exp_data) +``` + +## Inspect the v0.4 path directly + +When debugging a migration, split the wrapper into its three v0.4 steps: + +```{r} +spec <- mock_spec_from_recodeflow(variables, variable_details) +validate_mock_spec(spec, strict = TRUE) +``` + +```{r} +baseline <- generate_mock_data_native(spec, n = 50, seed = 456) +postprocessed <- postprocess_mock_data(baseline, spec, seed = 457) + +identical(strict_data, postprocessed) +``` + +This makes it easier to tell whether an issue is coming from metadata parsing, +baseline generation, or post-processing. + +## What to check in sibling packages + +For cchsflow, chmsflow, and recodeflow workflows, test representative +`variables.csv` and `variable_details.csv` files with: + +```{r} +#| eval: false +mock <- create_mock_data( + databaseStart = "your-cycle", + variables = "variables.csv", + variable_details = "variable_details.csv", + n = 100, + seed = 123, + validate = TRUE, + verbose = TRUE +) + +str(mock) +attr(mock, "mockdata_diagnostics") +``` + +Report cases where metadata unexpectedly falls back to legacy dispatch, where a +variable generated in v0.3 but errors in v0.4, or where the generated values, +types, or diagnostics are surprising. From 9b00ff38e41a2142ff8806b423ef97a01d397832 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Wed, 20 May 2026 16:47:00 -0400 Subject: [PATCH 6/8] Add backend choice how-to --- _pkgdown.yml | 1 + development/v04-documentation-sprint.md | 2 + vignettes/choosing-a-backend-v04.qmd | 250 ++++++++++++++++++++++++ 3 files changed, 253 insertions(+) create mode 100644 vignettes/choosing-a-backend-v04.qmd diff --git a/_pkgdown.yml b/_pkgdown.yml index 215f895..c6811fe 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -115,6 +115,7 @@ articles: - recodeflow-metadata-v04 - diagnostics-and-garbage-v04 - migrating-from-v03-v04 + - choosing-a-backend-v04 - for-recodeflow-users - title: Explanation diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md index 8e05cfa..59f3381 100644 --- a/development/v04-documentation-sprint.md +++ b/development/v04-documentation-sprint.md @@ -24,6 +24,8 @@ vignette and pkgdown builds. auditing garbage/missing-code post-processing. - `migrating-from-v03-v04.qmd`: how-to for compatibility behavior, fallback routing, diagnostics, and seed differences. +- `choosing-a-backend-v04.qmd`: how-to for native versus optional `simstudy` + backend selection. - README: update the top-level status and quick example so users see v0.4 immediately. - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation. diff --git a/vignettes/choosing-a-backend-v04.qmd b/vignettes/choosing-a-backend-v04.qmd new file mode 100644 index 0000000..30685bc --- /dev/null +++ b/vignettes/choosing-a-backend-v04.qmd @@ -0,0 +1,250 @@ +--- +title: "Choose a MockData v0.4 backend" +format: html +vignette: > + %\VignetteIndexEntry{Choose a MockData v0.4 backend} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +```{r} +#| label: setup +#| include: false +input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL) +candidate_roots <- unique(c( + ".", + "..", + if (!is.null(input_file)) file.path(dirname(input_file), "..") +)) +package_root <- NULL +for (candidate in candidate_roots) { + description <- file.path(candidate, "DESCRIPTION") + if (file.exists(description) && + any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) { + package_root <- candidate + break + } +} + +if (!is.null(package_root)) { + devtools::load_all(package_root, quiet = TRUE) +} else { + library(MockData) +} +``` + +::: {.vignette-about} +**About this vignette:** This how-to explains when to use the default native +backend and when to try the optional `simstudy` backend. The `simstudy` examples +run when `simstudy >= 0.8.1` is installed and otherwise render a clear message. +::: + +## The short version + +Use the native backend by default. + +```{r} +spec <- mock_spec( + mock_spec_continuous("age", range = c(18, 85), rtype = "integer"), + mock_spec_categorical( + "smoking", + levels = c("never", "former", "current"), + proportions = c(0.5, 0.3, 0.2), + rtype = "character" + ) +) + +native_data <- generate_mock_data_native(spec, n = 100, seed = 101) +head(native_data) +``` + +The native backend is always available, stays within MockData's MIT-licensed +code, and is the backend used by `create_mock_data()` for supported v0.4 +metadata. + +Use the optional `simstudy` backend when you want to exercise that engine path +or when future MockData features need simulation mechanics that `simstudy` +already provides. + +## Check whether simstudy is available + +MockData keeps `simstudy` optional. It is listed in `Suggests`, not `Imports`, +so installing MockData does not require installing `simstudy`. + +```{r} +simstudy_available <- requireNamespace("simstudy", quietly = TRUE) && + utils::packageVersion("simstudy") >= "0.8.1" + +simstudy_available +``` + +If `simstudy` is unavailable, use `generate_mock_data_native()`. + +```{r} +if (!simstudy_available) { + message( + "The optional simstudy backend is not available in this R environment; ", + "using generate_mock_data_native() is the recommended path." + ) +} +``` + +## Run the same spec through both backends + +For categorical variables and uniform continuous variables, both backends can +generate the baseline data. + +```{r} +native_large <- generate_mock_data_native(spec, n = 2000, seed = 202) + +if (simstudy_available) { + simstudy_large <- generate_mock_data_simstudy(spec, n = 2000, seed = 202) + head(simstudy_large) +} else { + simstudy_large <- NULL +} +``` + +When `simstudy` is installed, compare broad properties rather than expecting +row-for-row equality. The engines use different internals. + +```{r} +if (simstudy_available) { + c( + native_mean_age = mean(native_large$age), + simstudy_mean_age = mean(simstudy_large$age) + ) +} +``` + +```{r} +if (simstudy_available) { + rbind( + native = prop.table(table(factor( + native_large$smoking, + levels = c("never", "former", "current") + ))), + simstudy = prop.table(table(factor( + simstudy_large$smoking, + levels = c("never", "former", "current") + ))) + ) +} +``` + +## Mixed specs are allowed + +The optional backend uses `simstudy` only for pieces it can currently generate +safely. Other variables route through MockData's native backend inside the same +call. + +```{r} +mixed_spec <- mock_spec( + mock_spec_categorical( + "smoking", + levels = c("never", "former", "current"), + proportions = c(0.5, 0.3, 0.2), + rtype = "character" + ), + mock_spec_continuous( + "bmi", + range = c(15, 50), + distribution = "normal", + mean = 27, + sd = 5, + rtype = "double" + ), + mock_spec_date( + "interview_date", + range = as.Date(c("2020-01-01", "2020-12-31")) + ) +) + +mixed_native <- generate_mock_data_native(mixed_spec, n = 100, seed = 303) +head(mixed_native) +``` + +```{r} +if (simstudy_available) { + mixed_simstudy <- generate_mock_data_simstudy(mixed_spec, n = 100, seed = 303) + head(mixed_simstudy) +} +``` + +In this example, `smoking` can be generated through `simstudy`; `bmi` and +`interview_date` stay native because MockData owns the truncated normal and +calendar-date contracts in v0.4. + +## Post-processing stays MockData-owned + +Missing codes, garbage values, and diagnostics are applied after baseline +generation. That is true for both backends. + +```{r} +post_spec <- mock_categorical( + "response", + levels = c("1", "97"), + proportions = c(0.6, 0.4), + rtype = "character", + missing_codes = "97", + missing_proportions = 0.2, + garbage_rules = list(low = list(proportion = 0.1, range = "[-2, 0]")) +) + +native_baseline <- generate_mock_data_native(post_spec, n = 100, seed = 404) +native_processed <- postprocess_mock_data(native_baseline, post_spec, seed = 405) + +names(attr(native_processed, "mockdata_diagnostics")$variables$response) +``` + +```{r} +if (simstudy_available) { + simstudy_baseline <- generate_mock_data_simstudy(post_spec, n = 100, seed = 404) + simstudy_processed <- postprocess_mock_data(simstudy_baseline, post_spec, seed = 405) + + names(attr(simstudy_processed, "mockdata_diagnostics")$variables$response) +} +``` + +The diagnostics shape is the same because post-processing is not delegated to +`simstudy`. + +## License and dependency posture + +MockData is MIT licensed. `simstudy` is GPL-3 licensed. Keeping `simstudy` +optional lets MockData keep the core package MIT while still allowing users to +try the advanced backend when that dependency is acceptable in their project. + +If your workflow needs no optional dependency, use: + +```{r} +generate_mock_data_native(spec, n = 10, seed = 1) +``` + +If your workflow explicitly wants to test the optional backend and `simstudy` is +installed, use: + +```{r} +if (simstudy_available) { + generate_mock_data_simstudy(spec, n = 10, seed = 1) +} +``` + +## Decision guide + +Choose the native backend when: + +- you want the default v0.4 behavior; +- you need MockData to work without optional dependencies; +- you are generating categorical, continuous, date, missing-code, or garbage + examples covered by the native pipeline; +- you want the simplest path for package tests and vignettes. + +Try the optional `simstudy` backend when: + +- `simstudy >= 0.8.1` is already acceptable in your project; +- you want to exercise the optional engine path; +- you are preparing for future features where `simstudy` provides mature + simulation mechanics; +- you still want MockData to own missing-code, garbage-value, and diagnostics + semantics after generation. From d3f47c7c74c02a8e6e3512749e826b1976a30408 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Wed, 20 May 2026 16:55:43 -0400 Subject: [PATCH 7/8] Add v04 design philosophy vignette --- _pkgdown.yml | 1 + development/v04-documentation-sprint.md | 2 + vignettes/choosing-a-backend-v04.qmd | 2 +- vignettes/design-philosophy-v04.qmd | 293 ++++++++++++++++++++++++ 4 files changed, 297 insertions(+), 1 deletion(-) create mode 100644 vignettes/design-philosophy-v04.qmd diff --git a/_pkgdown.yml b/_pkgdown.yml index c6811fe..b3169e3 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -122,6 +122,7 @@ articles: desc: Understanding concepts and design decisions navbar: Explanation contents: + - design-philosophy-v04 - advanced-topics - title: Reference diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md index 59f3381..4d7d074 100644 --- a/development/v04-documentation-sprint.md +++ b/development/v04-documentation-sprint.md @@ -26,6 +26,8 @@ vignette and pkgdown builds. routing, diagnostics, and seed differences. - `choosing-a-backend-v04.qmd`: how-to for native versus optional `simstudy` backend selection. +- `design-philosophy-v04.qmd`: explanation of v0.4 design choices and scope + boundaries. - README: update the top-level status and quick example so users see v0.4 immediately. - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation. diff --git a/vignettes/choosing-a-backend-v04.qmd b/vignettes/choosing-a-backend-v04.qmd index 30685bc..d0bff62 100644 --- a/vignettes/choosing-a-backend-v04.qmd +++ b/vignettes/choosing-a-backend-v04.qmd @@ -73,7 +73,7 @@ so installing MockData does not require installing `simstudy`. ```{r} simstudy_available <- requireNamespace("simstudy", quietly = TRUE) && - utils::packageVersion("simstudy") >= "0.8.1" + utils::packageVersion("simstudy") >= numeric_version("0.8.1") simstudy_available ``` diff --git a/vignettes/design-philosophy-v04.qmd b/vignettes/design-philosophy-v04.qmd new file mode 100644 index 0000000..fc67ded --- /dev/null +++ b/vignettes/design-philosophy-v04.qmd @@ -0,0 +1,293 @@ +--- +title: "MockData v0.4 design philosophy" +format: html +vignette: > + %\VignetteIndexEntry{MockData v0.4 design philosophy} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +```{r} +#| label: setup +#| include: false +input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL) +candidate_roots <- unique(c( + ".", + "..", + if (!is.null(input_file)) file.path(dirname(input_file), "..") +)) +package_root <- NULL +for (candidate in candidate_roots) { + description <- file.path(candidate, "DESCRIPTION") + if (file.exists(description) && + any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) { + package_root <- candidate + break + } +} + +if (!is.null(package_root)) { + devtools::load_all(package_root, quiet = TRUE) +} else { + library(MockData) +} +``` + +::: {.vignette-about} +**About this vignette:** This explanation describes why MockData v0.4 is shaped +around `mock_spec`, native generation, optional `simstudy`, and MockData-owned +post-processing. It is not a tutorial; start with the v0.4 getting-started +vignette if you want a first workflow. +::: + +## Mock data, not synthetic data + +MockData generates mock data for package development, QA, documentation, +examples, and training. Its output is meant to exercise code paths. It is not +intended for privacy release, inference, or population-valid statistical +analysis. + +That boundary is deliberate. In health-data and survey-data settings, +"synthetic data" can imply privacy review, data-sharing obligations, or +statistical validity claims. MockData avoids that claim. It helps you test a +pipeline before you have access to real data; it does not replace real data for +analysis. + +The working sentence is: + +> Give a recodeflow-style specification a body, so you can test the recoding +> before you have the data. + +## The people v0.4 is trying to serve + +Three user groups shaped the v0.4 design. + +First, recodeflow ecosystem maintainers need data frames that can run through +cchsflow, chmsflow, and recodeflow examples and tests. They already have +`variables.csv` and `variable_details.csv`, so MockData should read those +metadata files rather than invent a competing file format. + +Second, methodologists and package authors need examples and vignettes that run +without restricted data. They may want a small, readable direct API rather than +a full metadata table. + +Third, QA developers need deliberately bad data. Out-of-range ages, declared +missing codes, impossible dates, and invalid category values are not incidental +features; they are the point when testing validation code. + +v0.4 tries to serve all three without making any one workflow the only workflow. + +## Why `mock_spec` exists + +Before v0.4, MockData's generators read metadata, parsed ranges, generated +values, applied missing codes, injected garbage, coerced types, and assembled +columns in one path. That was useful while the package was young, but it made +validation, backend choice, and diagnostics hard to reason about. + +v0.4 introduces `mock_spec` as the normalized internal representation. Different +front doors can produce the same spec: + +- direct helpers, such as `mock_continuous()` and `mock_categorical()` +- composable constructors, such as `mock_spec_continuous()` +- recodeflow metadata through `mock_spec_from_recodeflow()` + +The spec is then consumed by generation and post-processing layers. + +```{r} +spec <- mock_spec( + mock_spec_continuous( + "age", + range = c(18, 85), + distribution = "normal", + mean = 50, + sd = 12, + rtype = "integer" + ), + mock_spec_categorical( + "smoking", + levels = c("never", "former", "current"), + proportions = c(0.5, 0.3, 0.2), + rtype = "character" + ) +) + +names(spec$variables) +``` + +This is the main architectural move: parse once, validate once, then generate +from the normalized shape. + +## Two tiers, one model + +The direct helpers are there for the first ten minutes. + +```{r} +one_variable <- mock_continuous( + "age", + range = c(18, 85), + distribution = "normal", + mean = 50, + sd = 12, + rtype = "integer" +) + +names(one_variable$variables) +``` + +The lower-level constructors are there when you want to compose multiple +variables or build adapters. + +```{r} +same_variable <- mock_spec( + mock_spec_continuous( + "age", + range = c(18, 85), + distribution = "normal", + mean = 50, + sd = 12, + rtype = "integer" + ) +) + +names(same_variable$variables) +``` + +Those are two surface syntaxes for the same internal model. That is why the +package can support small hand-written examples and recodeflow metadata without +duplicating generation logic. + +## Why the backend is hybrid + +MockData v0.4 has a native backend and an optional `simstudy` backend. + +The native backend is the default. It is always available, keeps MockData usable +without optional dependencies, and owns the simple cases that are central to the +package: categorical values, continuous values, dates, missing-code semantics, +garbage values, and diagnostics. + +`simstudy` is optional. It is a mature GPL-3 simulation package with useful +machinery for future advanced features, but MockData remains MIT licensed by +keeping `simstudy` in `Suggests` and soft-gating the backend. + +```{r} +native_data <- generate_mock_data_native(spec, n = 5, seed = 1) +native_data +``` + +```{r} +simstudy_available <- requireNamespace("simstudy", quietly = TRUE) && + utils::packageVersion("simstudy") >= numeric_version("0.8.1") + +if (simstudy_available) { + generate_mock_data_simstudy(spec, n = 5, seed = 1) +} else { + message("simstudy is not installed; the native backend remains available.") +} +``` + +This split is intentionally conservative. MockData should not reimplement a +large simulation library when a good one exists, but it also should not make a +GPL-3 package mandatory for users who only need the core mock-data path. + +## Why post-processing is separate + +Missing codes and garbage values are not just another distribution. They are QA +semantics layered on top of otherwise valid generated data. + +v0.4 therefore generates baseline values first, then applies missing-code and +garbage rules in a separate post-processing pass. + +```{r} +qa_spec <- mock_categorical( + "response", + levels = c("1", "97"), + proportions = c(0.7, 0.3), + rtype = "character", + missing_codes = "97", + missing_proportions = 0.2 +) + +baseline <- generate_mock_data_native(qa_spec, n = 100, seed = 11) +processed <- postprocess_mock_data(baseline, qa_spec, seed = 12) + +diagnostics <- attr(processed, "mockdata_diagnostics") +names(diagnostics$variables$response) +``` + +The diagnostics matter because a value can naturally collide with a declared +missing code. In the example above, `97` is both a valid level and a missing +code. MockData records which rows naturally drew `97` and which rows were +assigned `97` during post-processing. + +```{r} +response_diag <- diagnostics$variables$response + +c( + preexisting = length(response_diag$preexisting_missing_code_indices), + assigned = length(response_diag$assigned_missing_indices) +) +``` + +That distinction is what makes the output auditable for QA workflows. + +## Why strictness increased + +Earlier MockData versions were often permissive: warn, skip a variable, and +return whatever could be generated. That behavior was convenient in exploratory +work, but risky in package tests and documentation. A silently missing column +can make a vignette or downstream test look successful while testing the wrong +thing. + +v0.4 moves toward strict generation for the new pipeline. Unsupported features +should either fail loudly or route through an explicit compatibility path. + +`create_mock_data()` keeps compatibility by retaining legacy fallback routes, +especially for `validate = FALSE`, `variable_details = NULL`, detail-level +`databaseStart` filtering, and unsupported native-backend features. Use +`verbose = TRUE` while migrating so the chosen path is visible. + +## What is deliberately deferred + +Several features are intentionally not solved in v0.4. + +Formula-derived variables are detected and kept loud rather than silently +ignored. They need a dependency-aware evaluator, sandboxing rules, and clear +syntax. + +Multi-variable correlation and richer joint distributions are future work. +`simstudy` is one possible engine for those features, but v0.4 does not claim to +generate statistically realistic joint distributions. + +Table 1 bootstrap is also future work. It is a natural third adapter: take +published descriptive statistics and produce a `mock_spec`. That is useful, but +it should not be squeezed into the recodeflow adapter. + +LinkML or another schema-first model remains a possible north star for the +larger recodeflow ecosystem. v0.4 keeps the internal spec abstract enough that a +future schema adapter could produce it. + +These are roadmap items, not hidden guarantees. + +## How the v0.4 refactor was reviewed + +The v0.4 architecture was developed through a spike, milestone PRs, and repeated +review of code, tests, silent-failure paths, and documentation. That process +changed the design in concrete ways: + +- strict-by-default behavior became more important than permissive fallback; +- diagnostics became a first-class auditability contract; +- `simstudy` stayed optional to preserve MockData's dependency and license + posture; +- executable vignettes became part of validation, not just prose. + +The development notes in `development/` and `.tmp/` preserve more of that review +trail for maintainers. This vignette distills the user-facing design choices. + +## The design in one paragraph + +MockData v0.4 normalizes inputs into `mock_spec`, validates that shape, generates +baseline values through a native backend or optional `simstudy` backend, and +then applies MockData-owned post-processing for missing codes, garbage values, +and diagnostics. It keeps recodeflow metadata central, adds simpler direct APIs, +and preserves the public `create_mock_data()` wrapper for compatibility. It is +mock data for development and QA, not synthetic data for inference. From 5907d55107e09b34575b7d9176515f7065ad723a Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Thu, 21 May 2026 13:47:58 -0400 Subject: [PATCH 8/8] Prepare v04 docs for tag --- DESCRIPTION | 2 +- README.md | 4 ++-- development/adr/v04-hybrid-backend.md | 27 ++++++++++++++++--------- development/simstudy-v04.md | 22 +++++++++++++++++++- development/v04-documentation-sprint.md | 4 ++++ vignettes/design-philosophy-v04.qmd | 20 ++++++++++-------- 6 files changed, 57 insertions(+), 22 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index b10b0fc..99e61e5 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: MockData Title: Generate Mock Data from Metadata Specifications -Version: 0.4.0.9000 +Version: 0.4.0 Authors@R: c( person("Juan", "Li", role = "aut", email = "juli@ohri.ca"), person("Douglas", "Manuel", role = c("aut", "cre"), email = "dmanuel@ohri.ca"), diff --git a/README.md b/README.md index 715f74a..3b19f31 100644 --- a/README.md +++ b/README.md @@ -3,13 +3,13 @@ [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) -[![Version: 0.4.0-dev](https://img.shields.io/badge/version-0.4.0--dev-blue.svg)](https://github.com/Big-Life-Lab/MockData) +[![Version: 0.4.0](https://img.shields.io/badge/version-0.4.0-blue.svg)](https://github.com/Big-Life-Lab/MockData) [![pkgdown](https://github.com/Big-Life-Lab/MockData/actions/workflows/pkgdown.yaml/badge.svg)](https://github.com/Big-Life-Lab/MockData/actions/workflows/pkgdown.yaml) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -**Status: Experimental, pre-release software** +**Status: Experimental v0.4.0 release candidate** MockData is a work-in-progress R package for generating mock testing data from small metadata specifications. The `dev` branch now contains the v0.4 diff --git a/development/adr/v04-hybrid-backend.md b/development/adr/v04-hybrid-backend.md index 63fc032..9d9b71f 100644 --- a/development/adr/v04-hybrid-backend.md +++ b/development/adr/v04-hybrid-backend.md @@ -1,7 +1,7 @@ # ADR: v0.4 Hybrid Backend Architecture -**Status**: draft -**Date**: 2026-05-18 +**Status**: accepted and implemented in PR #28 +**Date**: 2026-05-18 **Decision owner**: MockData maintainers ## Context @@ -23,6 +23,9 @@ MockData-specific semantics as post-processing. Three review rounds converged on the same conclusion: the hybrid architecture is ready for production refactor planning. +The production refactor was implemented in PR #28 and merged to `dev` for +sibling-package testing before a v0.4.0 tag. + ## Decision MockData v0.4 will move toward a hybrid backend architecture: @@ -92,18 +95,22 @@ Tradeoffs: formula syntax, custom distribution registry, and correlation merging. - Maintaining wrappers will add short-term complexity. -## Implementation Direction +## Implementation Status -Production refactor should proceed in layers: +The production refactor proceeded in layers: 1. `mock_spec` constructors and validators. 2. Direct and recodeflow input adapters. -3. Formula/dependency evaluator. -4. Native backend. -5. Post-processing layer. -6. Promotion of spike assertions to `testthat`. -7. Optional `simstudy` backend. -8. Current API wrappers. +3. Native backend. +4. Post-processing layer and diagnostics. +5. Promotion of spike assertions to `testthat`. +6. Optional `simstudy` backend. +7. Current API wrappers. +8. Divio documentation sprint and Phase C maintainer communication. + +Formula/dependency evaluation, multi-group correlations, Table 1 adapters, and +schema-first integration remain deferred roadmap items rather than v0.4.0 +commitments. ## Open Follow-Up Decisions diff --git a/development/simstudy-v04.md b/development/simstudy-v04.md index 5772839..6d2a85d 100644 --- a/development/simstudy-v04.md +++ b/development/simstudy-v04.md @@ -1,9 +1,15 @@ # MockData v0.4 Production Refactor Plan +**Status**: implemented in PR #28 and superseded by the v0.4 documentation +sprint. This document is retained as the production-refactor plan and should be +read as historical implementation context rather than an active task list. + ## 1. Write The ADR First Write a short architecture decision record before production code changes. +**Status**: complete. See `development/adr/v04-hybrid-backend.md`. + The ADR should lock these decisions: - **Decision**: MockData adopts a hybrid backend architecture. @@ -31,6 +37,9 @@ The ADR should lock these decisions: Each layer should have focused tests before the next layer starts. +**Status**: complete for the v0.4.0 scope. Formula/dependency evaluation, +multi-group correlation, and Table 1 input remain deferred roadmap items. + 1. **`mock_spec` core** - Constructors and validators. - Stable fields for names, types, ranges, levels, proportions, missing codes, @@ -81,6 +90,10 @@ Each layer should have focused tests before the next layer starts. ## 3. Keep The Current API Alive +**Status**: complete. The v0.3 public functions remain available, and +`create_mock_data()` now routes supported metadata through the v0.4 pipeline +while preserving legacy fallback paths. + Existing public functions should remain available in v0.4.0: - `create_mock_data()` @@ -95,6 +108,11 @@ synchronized release. ## 4. Carry-Forward Design Issues +**Status**: partly resolved. The diagnostics shape, seed discipline, native vs +`simstudy` parity tests, and optional `simstudy` posture were settled for v0.4.0. +The remaining items below should be treated as v0.5+ roadmap candidates or issue +backlog material. + Settle in the ADR or the first design note: - Multi-group correlation merge strategy. @@ -116,6 +134,9 @@ Track as implementation issues: ## 5. Communication +**Status**: complete as a draft communication artifact. See +`development/v04-phase-c-comms-note.md`. + Before v0.4.0 lands, write a short communication note for cchsflow, chmsflow, and recodeflow maintainers: @@ -125,4 +146,3 @@ and recodeflow maintainers: - What migration is optional in v0.4.0. - When deprecation warnings may begin. - How the mock-data framing remains distinct from synthetic-data release. - diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md index 4d7d074..7293021 100644 --- a/development/v04-documentation-sprint.md +++ b/development/v04-documentation-sprint.md @@ -1,5 +1,9 @@ # MockData v0.4 Documentation Sprint +**Status**: complete for the v0.4.0 documentation sprint. Remaining work before +tagging is package checks, maintainer smoke testing, and any follow-up edits from +review. + This sprint treats documentation as implementation validation. The goal is not only to explain the v0.4 API, but to run realistic user workflows during vignette and pkgdown builds. diff --git a/vignettes/design-philosophy-v04.qmd b/vignettes/design-philosophy-v04.qmd index fc67ded..8af0afe 100644 --- a/vignettes/design-philosophy-v04.qmd +++ b/vignettes/design-philosophy-v04.qmd @@ -278,16 +278,20 @@ changed the design in concrete ways: - diagnostics became a first-class auditability contract; - `simstudy` stayed optional to preserve MockData's dependency and license posture; +- a Phase C communication note made sibling-package testing part of the release + process; - executable vignettes became part of validation, not just prose. -The development notes in `development/` and `.tmp/` preserve more of that review -trail for maintainers. This vignette distills the user-facing design choices. +The development notes in `development/` and maintainer-only review notes +preserve more of that review trail. This vignette distills the user-facing +design choices. ## The design in one paragraph -MockData v0.4 normalizes inputs into `mock_spec`, validates that shape, generates -baseline values through a native backend or optional `simstudy` backend, and -then applies MockData-owned post-processing for missing codes, garbage values, -and diagnostics. It keeps recodeflow metadata central, adds simpler direct APIs, -and preserves the public `create_mock_data()` wrapper for compatibility. It is -mock data for development and QA, not synthetic data for inference. +MockData v0.4 normalizes inputs into `mock_spec`, validates that shape strictly +by default, generates baseline values through a native backend or optional +`simstudy` backend, and then applies MockData-owned post-processing for missing +codes, garbage values, and diagnostics. It keeps recodeflow metadata central, +adds simpler direct APIs, and preserves the public `create_mock_data()` wrapper +for compatibility. It is mock data for development and QA, not synthetic data +for inference.