From a7c224eb5d79de345c81edf22cafa9f15c96459f Mon Sep 17 00:00:00 2001
From: Doug Manuel <DougManuel@users.noreply.github.com>
Date: Wed, 20 May 2026 09:10:04 -0400
Subject: [PATCH 1/8] Start v04 documentation sprint

---
 DESCRIPTION                             |   2 +-
 README.md                               |  50 ++++++-
 _pkgdown.yml                            |   1 +
 development/v04-documentation-sprint.md |  57 +++++++
 vignettes/getting-started-v04.qmd       | 189 ++++++++++++++++++++++++
 5 files changed, 292 insertions(+), 7 deletions(-)
 create mode 100644 development/v04-documentation-sprint.md
 create mode 100644 vignettes/getting-started-v04.qmd

diff --git a/DESCRIPTION b/DESCRIPTION
index dbb0f20..b10b0fc 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: MockData
 Title: Generate Mock Data from Metadata Specifications
-Version: 0.3.0
+Version: 0.4.0.9000
 Authors@R: c(
     person("Juan", "Li", role = "aut", email = "juli@ohri.ca"),
     person("Douglas", "Manuel", role = c("aut", "cre"), email = "dmanuel@ohri.ca"),
diff --git a/README.md b/README.md
index fef0d79..715f74a 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
 <!-- badges: start -->
 
 [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
-[![Version: 0.3.0](https://img.shields.io/badge/version-0.3.0-blue.svg)](https://github.com/Big-Life-Lab/MockData)
+[![Version: 0.4.0-dev](https://img.shields.io/badge/version-0.4.0--dev-blue.svg)](https://github.com/Big-Life-Lab/MockData)
 [![pkgdown](https://github.com/Big-Life-Lab/MockData/actions/workflows/pkgdown.yaml/badge.svg)](https://github.com/Big-Life-Lab/MockData/actions/workflows/pkgdown.yaml)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
@@ -12,10 +12,12 @@
 **Status: Experimental, pre-release software**
 
 MockData is a work-in-progress R package for generating mock testing data from
-small metadata specifications. It is useful
-today for development and documentation workflows, especially when paired
-with recodeflow-style metadata (see below), but it should be treated as experimental
-infrastructure rather than a stable released package.
+small metadata specifications. The `dev` branch now contains the v0.4
+`mock_spec` architecture: direct specification helpers, a recodeflow metadata
+adapter, native generation, optional `simstudy` generation, and post-processing
+diagnostics. It is useful today for development and documentation workflows,
+especially when paired with recodeflow-style metadata (see below), but it should
+be treated as experimental infrastructure rather than a stable released package.
 
 People are using MockData and reporting that it is helpful. We take that as an
 encouraging signal, not as evidence that the package is mature. Please review
@@ -33,10 +35,45 @@ the generated data before using it in any workflow that matters.
 **Current development limitations:**
 
 - APIs may change before a formal release
-- Error handling is too permissive and can fail with warnings instead of stopping
+- Some legacy v0.3-compatible paths still fall back with warnings; the v0.4
+  `mock_spec` path is stricter and records diagnostics
 - The test suite does not yet cover every important edge case
 - Generated data should be manually checked against your intended metadata rules
 
+**v0.4 direct API example**
+
+The v0.4 API separates specification, baseline generation, and post-processing.
+That makes the generated values easier to inspect and audit.
+
+```r
+library(MockData)
+
+spec <- mock_spec(
+  mock_spec_continuous(
+    "age",
+    range = c(18, 85),
+    distribution = "normal",
+    mean = 50,
+    sd = 12,
+    rtype = "integer"
+  ),
+  mock_spec_categorical(
+    "smoking",
+    levels = c("never", "former", "current"),
+    proportions = c(0.5, 0.3, 0.2),
+    rtype = "character",
+    missing_codes = "unknown",
+    missing_proportions = 0.05
+  )
+)
+
+baseline <- generate_mock_data_native(spec, n = 100, seed = 1)
+mock_data <- postprocess_mock_data(baseline, spec, seed = 2)
+
+head(mock_data)
+attr(mock_data, "mockdata_diagnostics")$variables$smoking
+```
+
 **30-second standalone example**
 
 For a quick numeric variable, `create_con_var()` can use two small
@@ -221,6 +258,7 @@ devtools::install_local("~/github/mock-data")
 
 **Tutorials:**
 
+- [v0.4 getting started](vignettes/getting-started-v04.qmd) - Direct `mock_spec`, recodeflow adapter, and diagnostics workflow
 - [Getting started](vignettes/getting-started.qmd) - Complete tutorial from single variables to full datasets
 - [For recodeflow users](vignettes/for-recodeflow-users.qmd) - Using MockData with existing metadata
 - [Survival data](vignettes/tutorial-survival-data.qmd) - Time-to-event data and temporal patterns
diff --git a/_pkgdown.yml b/_pkgdown.yml
index d17d53a..8c02821 100644
--- a/_pkgdown.yml
+++ b/_pkgdown.yml
@@ -100,6 +100,7 @@ articles:
   desc: Learning-oriented step-by-step guides
   navbar: Tutorials
   contents:
+  - getting-started-v04
   - getting-started
   - tutorial-categorical-continuous
   - tutorial-dates
diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md
new file mode 100644
index 0000000..ae78e00
--- /dev/null
+++ b/development/v04-documentation-sprint.md
@@ -0,0 +1,57 @@
+# MockData v0.4 Documentation Sprint
+
+This sprint treats documentation as implementation validation. The goal is not
+only to explain the v0.4 API, but to run realistic user workflows during
+vignette and pkgdown builds.
+
+## Principles
+
+- Use Divio's four documentation needs: tutorials, how-to guides, reference, and
+  explanation.
+- Keep vignette code executable unless the code genuinely depends on an
+  external package or private data.
+- Prefer small, focused vignettes over one large tour.
+- Use seeds in every stochastic example so rendered output is stable.
+- Include at least one diagnostics example because the v0.4 pipeline's
+  auditability contract is a central design change.
+
+## First pass
+
+- `getting-started-v04.qmd`: tutorial for the v0.4 `mock_spec` workflow.
+- README: update the top-level status and quick example so users see v0.4
+  immediately.
+- `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation.
+
+## Follow-up vignettes
+
+Tutorial:
+
+- `getting-started-v04.qmd`: linear first-use path.
+
+How-to:
+
+- `recodeflow-metadata-v04.qmd`: use existing `variables.csv` and
+  `variable_details.csv`.
+- `diagnostics-and-garbage-v04.qmd`: inspect missing-code and garbage
+  diagnostics.
+- `choosing-a-backend-v04.qmd`: native vs optional `simstudy`.
+- `migrating-from-v03-v04.qmd`: seed behavior, diagnostics attribute,
+  fallback conditions, and compatibility wrappers.
+
+Explanation:
+
+- `design-philosophy-v04.qmd`: distill the architecture review, hybrid backend
+  decision, and mock-data versus synthetic-data boundary.
+
+Reference:
+
+- Keep roxygen pages and `_pkgdown.yml` synchronized with exported functions.
+- Keep `NEWS.md` as the release-note source of truth.
+
+## Review checklist
+
+- Does every vignette render locally?
+- Does every code chunk either run or clearly justify `eval: false`?
+- Does each vignette commit to one Divio purpose?
+- Do examples use the public API exactly as users should use it?
+- Are error messages and diagnostics understandable in rendered output?
diff --git a/vignettes/getting-started-v04.qmd b/vignettes/getting-started-v04.qmd
new file mode 100644
index 0000000..ac95417
--- /dev/null
+++ b/vignettes/getting-started-v04.qmd
@@ -0,0 +1,189 @@
+---
+title: "Getting started with MockData v0.4"
+format: html
+vignette: >
+  %\VignetteIndexEntry{Getting started with MockData v0.4}
+  %\VignetteEngine{quarto::html}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r}
+#| label: setup
+#| include: false
+input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL)
+candidate_roots <- unique(c(
+  ".",
+  "..",
+  if (!is.null(input_file)) file.path(dirname(input_file), "..")
+))
+package_root <- NULL
+for (candidate in candidate_roots) {
+  description <- file.path(candidate, "DESCRIPTION")
+  if (file.exists(description) &&
+      any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) {
+    package_root <- candidate
+    break
+  }
+}
+
+if (!is.null(package_root)) {
+  devtools::load_all(package_root, quiet = TRUE)
+} else {
+  library(MockData)
+}
+```
+
+::: {.vignette-about}
+**About this vignette:** This tutorial introduces the v0.4 `mock_spec`
+workflow. All code is executed when the vignette builds, so this page also
+serves as a user-flow test for the public API.
+:::
+
+## The v0.4 workflow
+
+MockData v0.4 separates data generation into three steps:
+
+1. Create a `mock_spec`
+2. Generate baseline valid values
+3. Apply missing-code and garbage-value post-processing
+
+That separation makes it easier to inspect what was requested and what was
+changed after generation.
+
+## Specify variables directly
+
+Use the direct helpers when you want a small mock dataset without creating CSV
+metadata files.
+
+```{r}
+spec <- mock_spec(
+  mock_spec_continuous(
+    "age",
+    range = c(18, 85),
+    distribution = "normal",
+    mean = 50,
+    sd = 12,
+    rtype = "integer"
+  ),
+  mock_spec_categorical(
+    "smoking",
+    levels = c("never", "former", "current"),
+    proportions = c(0.5, 0.3, 0.2),
+    rtype = "character",
+    missing_codes = "unknown",
+    missing_proportions = 0.05
+  )
+)
+
+validate_mock_spec(spec)
+```
+
+Generate baseline values first. These are values within the intended valid
+space.
+
+```{r}
+baseline <- generate_mock_data_native(spec, n = 100, seed = 101)
+head(baseline)
+```
+
+Then apply missing-code and garbage-value rules. This step adds diagnostics as
+an attribute on the returned data frame.
+
+```{r}
+mock_data <- postprocess_mock_data(baseline, spec, seed = 102)
+head(mock_data)
+```
+
+```{r}
+diagnostics <- attr(mock_data, "mockdata_diagnostics")
+diagnostics$variables$smoking
+```
+
+The diagnostics distinguish values assigned by post-processing from values that
+were drawn naturally during baseline generation.
+
+## Use recodeflow metadata
+
+If you already have recodeflow-style `variables` and `variable_details`
+metadata, adapt those tables to the same `mock_spec` shape.
+
+```{r}
+variables <- data.frame(
+  variable = c("age", "smoking"),
+  variableType = c("Continuous", "Categorical"),
+  rType = c("integer", "character"),
+  role = c("enabled", "enabled"),
+  stringsAsFactors = FALSE
+)
+
+variable_details <- data.frame(
+  variable = c("age", "smoking", "smoking", "smoking"),
+  recStart = c("[18, 85]", "1", "2", "97"),
+  recEnd = c("copy", "copy", "copy", "NA::b"),
+  proportion = c(1, 0.6, 0.3, 0.1),
+  stringsAsFactors = FALSE
+)
+
+spec_from_metadata <- mock_spec_from_recodeflow(variables, variable_details)
+names(spec_from_metadata$variables)
+```
+
+The adapter preserves recodeflow semantics: valid ranges, categorical
+proportions, `recEnd` missing-code rows, and garbage settings.
+
+```{r}
+metadata_baseline <- generate_mock_data_native(
+  spec_from_metadata,
+  n = 100,
+  seed = 201
+)
+
+metadata_mock <- postprocess_mock_data(
+  metadata_baseline,
+  spec_from_metadata,
+  seed = 202
+)
+
+head(metadata_mock)
+```
+
+```{r}
+metadata_diag <- attr(metadata_mock, "mockdata_diagnostics")
+metadata_diag$variables$smoking$assigned_missing_indices[1:5]
+```
+
+## Use the compatibility wrapper
+
+`create_mock_data()` remains available for v0.3-style workflows. In strict mode,
+it attempts the v0.4 pipeline for supported metadata and falls back to the legacy
+`create_*` dispatch path for features that are not yet supported by the v0.4
+native backend.
+
+```{r}
+wrapped <- create_mock_data(
+  databaseStart = "example",
+  variables = variables,
+  variable_details = variable_details,
+  n = 100,
+  seed = 301
+)
+
+head(wrapped)
+```
+
+```{r}
+!is.null(attr(wrapped, "mockdata_diagnostics"))
+```
+
+When the v0.4 path is used, `create_mock_data()` returns diagnostics. Legacy
+fallback paths return plain data frames without that attribute.
+
+## Choosing the next function
+
+- Use `mock_*()` or `mock_spec_*()` for small examples and tests.
+- Use `mock_spec_from_recodeflow()` when metadata already exists.
+- Use `generate_mock_data_native()` for the default MIT-licensed backend.
+- Use `generate_mock_data_simstudy()` only when the optional `simstudy` package
+  is installed and you want to test that backend.
+- Use `postprocess_mock_data()` when you need missing codes, garbage values, or
+  diagnostics.

From 028692f73506b8ebf724b86f2327792035ea6726 Mon Sep 17 00:00:00 2001
From: Doug Manuel <DougManuel@users.noreply.github.com>
Date: Wed, 20 May 2026 12:41:06 -0400
Subject: [PATCH 2/8] Add v04 maintainer communication note

---
 development/v04-documentation-sprint.md |   5 +
 development/v04-phase-c-comms-note.md   | 151 ++++++++++++++++++++++++
 2 files changed, 156 insertions(+)
 create mode 100644 development/v04-phase-c-comms-note.md

diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md
index ae78e00..4c138d7 100644
--- a/development/v04-documentation-sprint.md
+++ b/development/v04-documentation-sprint.md
@@ -21,6 +21,8 @@ vignette and pkgdown builds.
 - README: update the top-level status and quick example so users see v0.4
   immediately.
 - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation.
+- `v04-phase-c-comms-note.md`: maintainer-facing note for cchsflow,
+  chmsflow, and recodeflow testing while v0.4 sits on `dev`.
 
 ## Follow-up vignettes
 
@@ -42,6 +44,9 @@ Explanation:
 
 - `design-philosophy-v04.qmd`: distill the architecture review, hybrid backend
   decision, and mock-data versus synthetic-data boundary.
+- `development/v04-phase-c-comms-note.md`: Phase C maintainer communication
+  source material; fold relevant parts into migration and recodeflow how-to
+  docs after maintainer feedback.
 
 Reference:
 
diff --git a/development/v04-phase-c-comms-note.md b/development/v04-phase-c-comms-note.md
new file mode 100644
index 0000000..0cf79e2
--- /dev/null
+++ b/development/v04-phase-c-comms-note.md
@@ -0,0 +1,151 @@
+# MockData v0.4 Phase C Maintainer Communication Note
+
+**Audience**: cchsflow, chmsflow, recodeflow, and MockData maintainers
+**Status**: draft for maintainer review
+**Branch for testing**: `dev`
+
+## Short Version
+
+MockData v0.4 is now available on the `dev` branch for maintainer testing. The
+main change is architectural: MockData now normalizes inputs into a `mock_spec`,
+then generates data through a native backend, an optional `simstudy` backend, and
+a MockData-owned post-processing layer for missing codes, garbage values, and
+diagnostics.
+
+The goal is to make MockData more reliable for recodeflow-style metadata while
+preserving the existing public API. No sibling package needs to migrate
+immediately for v0.4.0.
+
+## What Changed
+
+- `mock_spec` is the normalized internal representation for mock-data
+  specifications.
+- New direct helper APIs exist for simple use cases:
+  `mock_continuous()`, `mock_categorical()`, and `mock_date()`.
+- `mock_spec_from_recodeflow()` converts recodeflow-style `variables.csv` and
+  `variable_details.csv` metadata into a `mock_spec`.
+- `generate_mock_data_native()` generates data from `mock_spec` without optional
+  dependencies.
+- `postprocess_mock_data()` applies MockData-owned missing-code and garbage-data
+  semantics and attaches a `mockdata_diagnostics` attribute.
+- `generate_mock_data_simstudy()` is available for supported advanced cases when
+  `simstudy >= 0.8.1` is installed.
+- `create_mock_data()` now routes supported metadata through the v0.4 pipeline
+  and falls back to the legacy path for unsupported or explicitly lenient cases.
+
+## What Did Not Change
+
+- Existing v0.3 public functions remain available in v0.4.0.
+- `create_mock_data()` keeps its existing signature.
+- MockData remains focused on mock data for package development, QA,
+  documentation, examples, and training.
+- MockData is not positioning v0.4 as synthetic data for privacy release,
+  inference, or population-valid analysis.
+- MockData remains MIT licensed. `simstudy` is GPL-3 and optional through
+  `Suggests`, not a required dependency.
+- No public function removals are planned before v0.5.0.
+
+## Compatibility Notes
+
+- `validate = TRUE` is the default strict path. It uses the v0.4 pipeline when
+  the requested metadata is supported.
+- `validate = FALSE` deliberately uses the legacy, more permissive path.
+- `variable_details = NULL` also uses the legacy fallback path.
+- The v0.4 path returns a regular data frame with an optional
+  `mockdata_diagnostics` attribute. Legacy fallback output does not include this
+  attribute.
+- Seeded output may differ from v0.3 even when the same seed is supplied.
+  v0.4 uses the requested seed for baseline generation and `seed + 1L` for
+  post-processing so missing-code and garbage injection are reproducible without
+  sharing the same RNG stream as baseline generation.
+- Formula-derived variables, multi-group correlations, and advanced survival
+  models are intentionally deferred. Unsupported cases should either fail loudly
+  or route through the legacy path, depending on the public entry point.
+
+## What We Need Maintainers To Test
+
+Please test against representative metadata from cchsflow, chmsflow, and
+recodeflow projects, especially files that include:
+
+- categorical variables with `recEnd` missing-code semantics;
+- continuous variables with ranges or distribution parameters;
+- date variables;
+- garbage or invalid-value rules;
+- role and `databaseStart` filtering;
+- any variables that sibling packages expect MockData to generate today.
+
+Suggested smoke test:
+
+```r
+devtools::load_all()
+
+vars <- read.csv("path/to/variables.csv")
+details <- read.csv("path/to/variable_details.csv")
+
+mock <- create_mock_data(
+  variables = vars,
+  variable_details = details,
+  databaseStart = "cycle1",
+  n = 100,
+  seed = 123,
+  validate = TRUE,
+  verbose = TRUE
+)
+
+str(mock)
+attr(mock, "mockdata_diagnostics")
+```
+
+Also useful:
+
+```r
+spec <- mock_spec_from_recodeflow(
+  variables = vars,
+  variable_details = details,
+  databaseStart = "cycle1"
+)
+
+validate_mock_spec(spec, strict = TRUE)
+```
+
+## What To Report
+
+Please report:
+
+- metadata files or patterns that unexpectedly fall back to the legacy path;
+- variables that generated correctly in v0.3 but fail in v0.4;
+- variables that generate but have surprising values, types, or missing-code
+  behavior;
+- diagnostics that are hard to interpret;
+- any cchsflow/chmsflow/recodeflow assumptions about MockData output that v0.4
+  appears to change;
+- API ergonomics issues that make the new path hard to explain in documentation.
+
+## Proposed Timeline
+
+- v0.4 sits on `dev` while sibling maintainers test representative metadata.
+- Documentation sprint work continues on a separate branch and PR.
+- After checks, documentation, and maintainer smoke tests are complete, v0.4.0
+  can be tagged and merged forward to `main`.
+- Any lifecycle deprecation warnings for older APIs should wait until v0.4.x and
+  only after sibling package maintainers have a clear migration path.
+
+## Message Template
+
+Subject: MockData v0.4 available on `dev` for sibling-package testing
+
+MockData v0.4 is now on the `dev` branch for maintainer testing. It keeps the
+existing `create_mock_data()` API, but internally routes supported metadata
+through a new `mock_spec` pipeline with native generation and MockData-owned
+post-processing diagnostics.
+
+No immediate migration is required for cchsflow/chmsflow/recodeflow, and no
+public function removals are planned before v0.5.0. The main ask is to try
+representative `variables.csv` and `variable_details.csv` files against
+`create_mock_data(validate = TRUE, verbose = TRUE)` and report any unexpected
+fallbacks, failures, or output changes.
+
+The key user-visible differences are that v0.4 output may include a
+`mockdata_diagnostics` attribute, seeded output can differ from v0.3 because
+post-processing uses `seed + 1L`, and optional `simstudy` support remains in
+`Suggests` rather than becoming a required dependency.

From 200630c592d435d2d3b2229ac2c7b667f672e7bf Mon Sep 17 00:00:00 2001
From: Doug Manuel <DougManuel@users.noreply.github.com>
Date: Wed, 20 May 2026 12:56:36 -0400
Subject: [PATCH 3/8] Add recodeflow metadata how-to

---
 _pkgdown.yml                            |   1 +
 development/v04-documentation-sprint.md |   2 +
 vignettes/recodeflow-metadata-v04.qmd   | 268 ++++++++++++++++++++++++
 3 files changed, 271 insertions(+)
 create mode 100644 vignettes/recodeflow-metadata-v04.qmd

diff --git a/_pkgdown.yml b/_pkgdown.yml
index 8c02821..8329e1d 100644
--- a/_pkgdown.yml
+++ b/_pkgdown.yml
@@ -112,6 +112,7 @@ articles:
   desc: Task-oriented practical examples
   navbar: How-to guides
   contents:
+  - recodeflow-metadata-v04
   - for-recodeflow-users
 
 - title: Explanation
diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md
index 4c138d7..97b8862 100644
--- a/development/v04-documentation-sprint.md
+++ b/development/v04-documentation-sprint.md
@@ -18,6 +18,8 @@ vignette and pkgdown builds.
 ## First pass
 
 - `getting-started-v04.qmd`: tutorial for the v0.4 `mock_spec` workflow.
+- `recodeflow-metadata-v04.qmd`: how-to for generating mock data from
+  recodeflow-style CSV metadata.
 - README: update the top-level status and quick example so users see v0.4
   immediately.
 - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation.
diff --git a/vignettes/recodeflow-metadata-v04.qmd b/vignettes/recodeflow-metadata-v04.qmd
new file mode 100644
index 0000000..a36b8be
--- /dev/null
+++ b/vignettes/recodeflow-metadata-v04.qmd
@@ -0,0 +1,268 @@
+---
+title: "Use recodeflow metadata with MockData v0.4"
+format: html
+vignette: >
+  %\VignetteIndexEntry{Use recodeflow metadata with MockData v0.4}
+  %\VignetteEngine{quarto::html}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r}
+#| label: setup
+#| include: false
+input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL)
+candidate_roots <- unique(c(
+  ".",
+  "..",
+  if (!is.null(input_file)) file.path(dirname(input_file), "..")
+))
+package_root <- NULL
+for (candidate in candidate_roots) {
+  description <- file.path(candidate, "DESCRIPTION")
+  if (file.exists(description) &&
+      any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) {
+    package_root <- candidate
+    break
+  }
+}
+
+if (!is.null(package_root)) {
+  devtools::load_all(package_root, quiet = TRUE)
+} else {
+  library(MockData)
+}
+```
+
+::: {.vignette-about}
+**About this vignette:** This how-to shows how to generate mock data from
+recodeflow-style `variables.csv` and `variable_details.csv` files. The code
+writes temporary CSV files and reads them back, so the vignette exercises the
+same path as a file-based user workflow.
+:::
+
+## Starting point
+
+Use this path when you already have recodeflow metadata. MockData reads the
+metadata, converts it to a v0.4 `mock_spec`, generates baseline values, and
+applies missing-code and garbage-value post-processing.
+
+For a compact example, define three variables:
+
+- `age`: continuous integer with a normal distribution and one missing code
+- `smoking`: categorical code with one `recEnd = "NA::b"` missing-code row
+- `interview_date`: date variable with a valid calendar range
+
+```{r}
+variables <- data.frame(
+  variable = c("age", "smoking", "interview_date"),
+  label = c("Age in years", "Smoking status", "Interview date"),
+  variableType = c("Continuous", "Categorical", "Date"),
+  rType = c("integer", "factor", "date"),
+  role = c("enabled,table1", "enabled,table1", "enabled"),
+  position = c(10, 20, 30),
+  databaseStart = c("cycle1", "cycle1", "cycle1"),
+  distribution = c("normal", NA, "uniform"),
+  mean = c(50, NA, NA),
+  sd = c(12, NA, NA),
+  garbage_low_prop = c(0.02, NA, NA),
+  garbage_low_range = c("[0, 17]", NA, NA),
+  stringsAsFactors = FALSE
+)
+
+variable_details <- data.frame(
+  variable = c(
+    "age",
+    "age",
+    "smoking",
+    "smoking",
+    "smoking",
+    "smoking",
+    "interview_date"
+  ),
+  recStart = c(
+    "[18, 85]",
+    "999",
+    "1",
+    "2",
+    "3",
+    "7",
+    "[2020-01-01, 2020-12-31]"
+  ),
+  recEnd = c("copy", "NA::b", "copy", "copy", "copy", "NA::b", "copy"),
+  catLabel = c(
+    "Valid age range",
+    "Not stated",
+    "Never smoker",
+    "Former smoker",
+    "Current smoker",
+    "Don't know",
+    "Interview date range"
+  ),
+  proportion = c(0.95, 0.05, 0.50, 0.30, 0.17, 0.03, 1),
+  databaseStart = "cycle1",
+  stringsAsFactors = FALSE
+)
+```
+
+## Write metadata as CSV files
+
+In a real project, these files already exist. Here we write them to a temporary
+directory so this vignette remains self-contained.
+
+```{r}
+metadata_dir <- tempfile("mockdata-recodeflow-")
+dir.create(metadata_dir)
+
+variables_file <- file.path(metadata_dir, "variables.csv")
+details_file <- file.path(metadata_dir, "variable_details.csv")
+
+write.csv(variables, variables_file, row.names = FALSE, na = "")
+write.csv(variable_details, details_file, row.names = FALSE, na = "")
+```
+
+## Inspect the normalized specification
+
+`mock_spec_from_recodeflow()` reads either data frames or CSV file paths. It
+returns a validated `mock_spec` without generating data.
+
+```{r}
+spec <- mock_spec_from_recodeflow(
+  variables = variables_file,
+  variable_details = details_file,
+  databaseStart = "cycle1"
+)
+
+names(spec$variables)
+```
+
+The spec preserves the recodeflow pieces MockData needs: variable types,
+categorical levels, proportions, valid ranges, missing-code rows, and garbage
+rules.
+
+```{r}
+spec$variables$smoking$levels
+spec$variables$smoking$missing_codes
+spec$variables$age$range
+spec$variables$age$garbage_rules
+```
+
+## Generate mock data with the compatibility wrapper
+
+Most recodeflow users should start with `create_mock_data()`. In strict mode
+(`validate = TRUE`, the default), supported metadata routes through the v0.4
+pipeline.
+
+```{r}
+mock_data <- create_mock_data(
+  databaseStart = "cycle1",
+  variables = variables_file,
+  variable_details = details_file,
+  n = 200,
+  seed = 123,
+  verbose = TRUE
+)
+
+head(mock_data)
+```
+
+The output is a regular data frame.
+
+```{r}
+str(mock_data)
+```
+
+## Check diagnostics
+
+When `create_mock_data()` uses the v0.4 path, the returned data frame has a
+`mockdata_diagnostics` attribute. The attribute records which rows were changed
+during missing-code and garbage-value post-processing.
+
+```{r}
+diagnostics <- attr(mock_data, "mockdata_diagnostics")
+names(diagnostics$variables)
+```
+
+For example, `smoking` has a missing-code rule for code `7`, and `age` has a
+low garbage rule.
+
+```{r}
+length(diagnostics$variables$smoking$assigned_missing_indices)
+diagnostics$variables$smoking$assigned_missing_indices[1:6]
+
+length(diagnostics$variables$age$assigned_garbage_indices$low)
+diagnostics$variables$age$assigned_garbage_indices$low
+```
+
+Use the diagnostics as an audit trail, not as columns in the mock dataset. Some
+base R operations and downstream tools can drop attributes, so inspect or save
+diagnostics before heavy reshaping.
+
+## Generate explicitly from the spec
+
+The wrapper is convenient, but the v0.4 pipeline can also be called step by
+step. This is useful when you want to inspect baseline values before
+post-processing.
+
+```{r}
+baseline <- generate_mock_data_native(spec, n = 200, seed = 123)
+head(baseline)
+```
+
+```{r}
+postprocessed <- postprocess_mock_data(baseline, spec, seed = 124)
+head(postprocessed)
+```
+
+The wrapper uses the same idea: the public seed controls baseline generation,
+and `seed + 1L` controls post-processing.
+
+## Database filtering
+
+`databaseStart` filtering is exact token matching. A variable tagged for
+`cycle10` will not accidentally match `cycle1`.
+
+```{r}
+variables_cycle10 <- variables
+variables_cycle10$variable[1] <- "age_cycle10"
+variables_cycle10$databaseStart[1] <- "cycle10"
+
+combined_variables <- rbind(variables, variables_cycle10[1, ])
+combined_details <- rbind(
+  variable_details,
+  transform(variable_details[variable_details$variable == "age", ],
+            variable = "age_cycle10")
+)
+
+filtered_spec <- mock_spec_from_recodeflow(
+  variables = combined_variables,
+  variable_details = combined_details,
+  databaseStart = "cycle1"
+)
+
+names(filtered_spec$variables)
+```
+
+## Troubleshooting
+
+If `create_mock_data()` cannot use the v0.4 path, set `verbose = TRUE` to see
+which path was chosen.
+
+```{r}
+#| eval: false
+mock_data <- create_mock_data(
+  databaseStart = "cycle1",
+  variables = variables_file,
+  variable_details = details_file,
+  n = 200,
+  seed = 123,
+  verbose = TRUE
+)
+```
+
+Common reasons for legacy fallback include `validate = FALSE`,
+`variable_details = NULL`, detail-level `databaseStart` filtering without a
+variable-level `databaseStart` column, and features that are intentionally
+deferred from the v0.4 native backend.
+
+For deeper diagnostics examples, see the diagnostics and garbage how-to when it
+lands in the v0.4 documentation sprint.

From 750bf13052a52118d2ef2571bee8dc7b2e46d671 Mon Sep 17 00:00:00 2001
From: Doug Manuel <DougManuel@users.noreply.github.com>
Date: Wed, 20 May 2026 16:26:29 -0400
Subject: [PATCH 4/8] Add diagnostics and garbage how-to

---
 _pkgdown.yml                              |   1 +
 development/v04-documentation-sprint.md   |   2 +
 vignettes/diagnostics-and-garbage-v04.qmd | 247 ++++++++++++++++++++++
 vignettes/recodeflow-metadata-v04.qmd     |   4 +
 4 files changed, 254 insertions(+)
 create mode 100644 vignettes/diagnostics-and-garbage-v04.qmd

diff --git a/_pkgdown.yml b/_pkgdown.yml
index 8329e1d..7b4851c 100644
--- a/_pkgdown.yml
+++ b/_pkgdown.yml
@@ -113,6 +113,7 @@ articles:
   navbar: How-to guides
   contents:
   - recodeflow-metadata-v04
+  - diagnostics-and-garbage-v04
   - for-recodeflow-users
 
 - title: Explanation
diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md
index 97b8862..584a0fa 100644
--- a/development/v04-documentation-sprint.md
+++ b/development/v04-documentation-sprint.md
@@ -20,6 +20,8 @@ vignette and pkgdown builds.
 - `getting-started-v04.qmd`: tutorial for the v0.4 `mock_spec` workflow.
 - `recodeflow-metadata-v04.qmd`: how-to for generating mock data from
   recodeflow-style CSV metadata.
+- `diagnostics-and-garbage-v04.qmd`: how-to for reading diagnostics and
+  auditing garbage/missing-code post-processing.
 - README: update the top-level status and quick example so users see v0.4
   immediately.
 - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation.
diff --git a/vignettes/diagnostics-and-garbage-v04.qmd b/vignettes/diagnostics-and-garbage-v04.qmd
new file mode 100644
index 0000000..259f0de
--- /dev/null
+++ b/vignettes/diagnostics-and-garbage-v04.qmd
@@ -0,0 +1,247 @@
+---
+title: "Inspect diagnostics and garbage rules in MockData v0.4"
+format: html
+vignette: >
+  %\VignetteIndexEntry{Inspect diagnostics and garbage rules in MockData v0.4}
+  %\VignetteEngine{quarto::html}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r}
+#| label: setup
+#| include: false
+input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL)
+candidate_roots <- unique(c(
+  ".",
+  "..",
+  if (!is.null(input_file)) file.path(dirname(input_file), "..")
+))
+package_root <- NULL
+for (candidate in candidate_roots) {
+  description <- file.path(candidate, "DESCRIPTION")
+  if (file.exists(description) &&
+      any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) {
+    package_root <- candidate
+    break
+  }
+}
+
+if (!is.null(package_root)) {
+  devtools::load_all(package_root, quiet = TRUE)
+} else {
+  library(MockData)
+}
+```
+
+::: {.vignette-about}
+**About this vignette:** This how-to shows how to inspect the
+`mockdata_diagnostics` attribute added by the v0.4 post-processing layer. The
+examples focus on audit trails for missing-code collisions and garbage-value
+rules.
+:::
+
+## Why diagnostics matter
+
+Mock data often needs two kinds of unusual values:
+
+- missing codes, such as `97` or `999`
+- garbage values, such as impossible ages used to test validation code
+
+Sometimes a value can be both meaningful and suspicious. For example, code `97`
+could be a valid category in one source file and also a declared missing code in
+another. MockData records diagnostics so you can tell whether a value was drawn
+naturally by the baseline generator or assigned later by post-processing.
+
+## Create a collision case
+
+Start with a categorical variable where `97` is both a valid level and a
+declared missing code. This is deliberately awkward; it is the case diagnostics
+are designed to make auditable.
+
+```{r}
+response_spec <- mock_categorical(
+  "response",
+  levels = c("1", "97"),
+  proportions = c(0.65, 0.35),
+  rtype = "character",
+  missing_codes = "97",
+  missing_proportions = 0.20
+)
+
+baseline <- generate_mock_data_native(response_spec, n = 200, seed = 11)
+table(baseline$response)
+```
+
+The baseline already contains some `97` values because `97` is a valid level.
+Now apply post-processing.
+
+```{r}
+processed <- postprocess_mock_data(baseline, response_spec, seed = 12)
+table(processed$response)
+```
+
+## Read the diagnostics
+
+Diagnostics live in a data-frame attribute.
+
+```{r}
+diagnostics <- attr(processed, "mockdata_diagnostics")
+names(diagnostics$variables)
+```
+
+For a variable, two fields are especially important:
+
+- `preexisting_missing_code_indices`: rows whose baseline value already matched
+  a declared missing code
+- `assigned_missing_indices`: rows changed by post-processing to a missing code
+
+```{r}
+response_diag <- diagnostics$variables$response
+
+length(response_diag$preexisting_missing_code_indices)
+length(response_diag$assigned_missing_indices)
+```
+
+These two sets should be distinct.
+
+```{r}
+intersect(
+  response_diag$preexisting_missing_code_indices,
+  response_diag$assigned_missing_indices
+)
+```
+
+Both groups contain `97` in the final data, but they mean different things.
+
+```{r}
+head(processed$response[response_diag$preexisting_missing_code_indices])
+head(processed$response[response_diag$assigned_missing_indices])
+```
+
+Use the diagnostics when your tests need to distinguish a naturally drawn
+collision from a missing code assigned by MockData.
+
+## Add garbage rules
+
+Garbage rules deliberately inject invalid or out-of-range values. Here `age`
+has one missing code and two garbage rules:
+
+- `low`: values below the valid age range
+- `high`: values above the valid age range
+
+```{r}
+age_spec <- mock_continuous(
+  "age",
+  range = c(18, 85),
+  distribution = "normal",
+  mean = 50,
+  sd = 12,
+  rtype = "integer",
+  missing_codes = 999,
+  missing_proportions = 0.05,
+  garbage_rules = list(
+    high = list(proportion = 0.03, range = "[120, 150]"),
+    low = list(proportion = 0.04, range = "[0, 17]")
+  )
+)
+
+age_baseline <- generate_mock_data_native(age_spec, n = 200, seed = 21)
+age_processed <- postprocess_mock_data(age_baseline, age_spec, seed = 22)
+```
+
+MockData applies garbage rules in canonical order: `low`, then `high`, then any
+other named rules in caller order. The diagnostics use the same order.
+
+```{r}
+age_diag <- attr(age_processed, "mockdata_diagnostics")$variables$age
+names(age_diag$assigned_garbage_indices)
+```
+
+Inspect the assigned rows.
+
+```{r}
+low_idx <- age_diag$assigned_garbage_indices$low
+high_idx <- age_diag$assigned_garbage_indices$high
+
+length(low_idx)
+range(age_processed$age[low_idx])
+
+length(high_idx)
+range(age_processed$age[high_idx])
+```
+
+Missing-code rows are protected from garbage assignment.
+
+```{r}
+intersect(age_diag$assigned_missing_indices, low_idx)
+intersect(age_diag$assigned_missing_indices, high_idx)
+```
+
+## Combine variables in one pipeline
+
+Most workflows generate several variables together. The same diagnostics shape
+is used for every variable in the spec.
+
+```{r}
+spec <- mock_spec(
+  response_spec$variables$response,
+  age_spec$variables$age
+)
+
+combined_baseline <- generate_mock_data_native(spec, n = 200, seed = 31)
+combined_processed <- postprocess_mock_data(combined_baseline, spec, seed = 32)
+
+combined_diag <- attr(combined_processed, "mockdata_diagnostics")
+names(combined_diag$variables)
+```
+
+A compact audit summary can be built from the diagnostics.
+
+```{r}
+data.frame(
+  variable = names(combined_diag$variables),
+  preexisting_missing = vapply(
+    combined_diag$variables,
+    function(x) length(x$preexisting_missing_code_indices),
+    integer(1)
+  ),
+  assigned_missing = vapply(
+    combined_diag$variables,
+    function(x) length(x$assigned_missing_indices),
+    integer(1)
+  ),
+  assigned_garbage = vapply(
+    combined_diag$variables,
+    function(x) sum(lengths(x$assigned_garbage_indices)),
+    integer(1)
+  )
+)
+```
+
+## Preserve diagnostics before reshaping
+
+Diagnostics are stored as an attribute on the returned data frame. Some
+downstream operations keep attributes and others drop them. If diagnostics are
+part of your QA workflow, save them before heavy reshaping or joins.
+
+```{r}
+saved_diagnostics <- attr(combined_processed, "mockdata_diagnostics")
+
+subset_data <- combined_processed[1:5, ]
+is.null(attr(subset_data, "mockdata_diagnostics"))
+
+names(saved_diagnostics$variables)
+```
+
+## Re-running post-processing
+
+`postprocess_mock_data()` is intentionally not idempotent. Running it again on a
+data frame that already has `mockdata_diagnostics` would double-contaminate the
+data, so MockData stops loudly.
+
+```{r}
+#| error: true
+postprocess_mock_data(combined_processed, spec, seed = 33)
+```
+
+Start again from baseline data when you want a fresh post-processing draw.
diff --git a/vignettes/recodeflow-metadata-v04.qmd b/vignettes/recodeflow-metadata-v04.qmd
index a36b8be..51774fb 100644
--- a/vignettes/recodeflow-metadata-v04.qmd
+++ b/vignettes/recodeflow-metadata-v04.qmd
@@ -104,6 +104,10 @@ variable_details <- data.frame(
 )
 ```
 
+When `databaseStart` filtering is requested, include `databaseStart` in both
+metadata tables. If the detail metadata has the filter column but the variables
+metadata does not, `create_mock_data()` uses the legacy path for compatibility.
+
 ## Write metadata as CSV files
 
 In a real project, these files already exist. Here we write them to a temporary

From 3d7873f73fe65577f66e34855e8facb80f79c9f4 Mon Sep 17 00:00:00 2001
From: Doug Manuel <DougManuel@users.noreply.github.com>
Date: Wed, 20 May 2026 16:35:35 -0400
Subject: [PATCH 5/8] Add v03 to v04 migration how-to

---
 _pkgdown.yml                            |   1 +
 development/v04-documentation-sprint.md |   2 +
 vignettes/migrating-from-v03-v04.qmd    | 276 ++++++++++++++++++++++++
 3 files changed, 279 insertions(+)
 create mode 100644 vignettes/migrating-from-v03-v04.qmd

diff --git a/_pkgdown.yml b/_pkgdown.yml
index 7b4851c..215f895 100644
--- a/_pkgdown.yml
+++ b/_pkgdown.yml
@@ -114,6 +114,7 @@ articles:
   contents:
   - recodeflow-metadata-v04
   - diagnostics-and-garbage-v04
+  - migrating-from-v03-v04
   - for-recodeflow-users
 
 - title: Explanation
diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md
index 584a0fa..8e05cfa 100644
--- a/development/v04-documentation-sprint.md
+++ b/development/v04-documentation-sprint.md
@@ -22,6 +22,8 @@ vignette and pkgdown builds.
   recodeflow-style CSV metadata.
 - `diagnostics-and-garbage-v04.qmd`: how-to for reading diagnostics and
   auditing garbage/missing-code post-processing.
+- `migrating-from-v03-v04.qmd`: how-to for compatibility behavior, fallback
+  routing, diagnostics, and seed differences.
 - README: update the top-level status and quick example so users see v0.4
   immediately.
 - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation.
diff --git a/vignettes/migrating-from-v03-v04.qmd b/vignettes/migrating-from-v03-v04.qmd
new file mode 100644
index 0000000..48744f9
--- /dev/null
+++ b/vignettes/migrating-from-v03-v04.qmd
@@ -0,0 +1,276 @@
+---
+title: "Migrate from MockData v0.3 to v0.4"
+format: html
+vignette: >
+  %\VignetteIndexEntry{Migrate from MockData v0.3 to v0.4}
+  %\VignetteEngine{quarto::html}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r}
+#| label: setup
+#| include: false
+input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL)
+candidate_roots <- unique(c(
+  ".",
+  "..",
+  if (!is.null(input_file)) file.path(dirname(input_file), "..")
+))
+package_root <- NULL
+for (candidate in candidate_roots) {
+  description <- file.path(candidate, "DESCRIPTION")
+  if (file.exists(description) &&
+      any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) {
+    package_root <- candidate
+    break
+  }
+}
+
+if (!is.null(package_root)) {
+  devtools::load_all(package_root, quiet = TRUE)
+} else {
+  library(MockData)
+}
+```
+
+::: {.vignette-about}
+**About this vignette:** This how-to is for users moving existing
+`create_mock_data()` workflows from v0.3 to v0.4. It focuses on the compatibility
+wrapper, routing messages, diagnostics, and reproducibility differences.
+:::
+
+## What stayed the same
+
+The main entry point is still `create_mock_data()`, and the existing arguments
+are still available.
+
+```{r}
+variables <- data.frame(
+  variable = c("age", "smoking"),
+  variableType = c("Continuous", "Categorical"),
+  rType = c("integer", "character"),
+  role = c("enabled", "enabled"),
+  position = c(10, 20),
+  distribution = c("normal", NA),
+  mean = c(50, NA),
+  sd = c(12, NA),
+  stringsAsFactors = FALSE
+)
+
+variable_details <- data.frame(
+  variable = c("age", "age", "smoking", "smoking", "smoking"),
+  recStart = c("[18, 85]", "999", "1", "2", "7"),
+  recEnd = c("copy", "NA::b", "copy", "copy", "NA::b"),
+  proportion = c(0.95, 0.05, 0.60, 0.35, 0.05),
+  stringsAsFactors = FALSE
+)
+```
+
+```{r}
+mock_data <- create_mock_data(
+  databaseStart = "study",
+  variables = variables,
+  variable_details = variable_details,
+  n = 100,
+  seed = 123
+)
+
+head(mock_data)
+```
+
+For supported metadata, v0.4 routes this call through the new `mock_spec`
+pipeline.
+
+## See which path ran
+
+Use `verbose = TRUE` when migrating. The message tells you whether the v0.4 path
+or the legacy path was used.
+
+```{r}
+strict_data <- create_mock_data(
+  databaseStart = "study",
+  variables = variables,
+  variable_details = variable_details,
+  n = 50,
+  seed = 456,
+  verbose = TRUE
+)
+```
+
+The v0.4 path returns a data frame with a diagnostics attribute.
+
+```{r}
+!is.null(attr(strict_data, "mockdata_diagnostics"))
+```
+
+## Opt into legacy behavior
+
+Set `validate = FALSE` when you need the legacy v0.3 dispatch path during
+migration. This is the explicit compatibility opt-out.
+
+```{r}
+legacy_data <- create_mock_data(
+  databaseStart = "study",
+  variables = variables,
+  variable_details = variable_details,
+  n = 50,
+  seed = 456,
+  validate = FALSE,
+  verbose = TRUE
+)
+```
+
+Legacy output is a plain data frame without the v0.4 diagnostics attribute.
+
+```{r}
+is.null(attr(legacy_data, "mockdata_diagnostics"))
+```
+
+The strict and legacy paths should agree on the broad shape of supported data,
+but exact values can differ.
+
+```{r}
+names(strict_data)
+names(legacy_data)
+
+table(strict_data$smoking)
+table(legacy_data$smoking)
+```
+
+## Understand seed differences
+
+In v0.3, the public seed controlled the legacy generators. In v0.4, the wrapper
+uses the public seed for baseline generation and `seed + 1L` for missing-code
+and garbage-value post-processing.
+
+That makes both stages reproducible, but it means exact values may differ from
+v0.3 even when you pass the same seed.
+
+```{r}
+strict_again <- create_mock_data(
+  databaseStart = "study",
+  variables = variables,
+  variable_details = variable_details,
+  n = 50,
+  seed = 456
+)
+
+identical(strict_data, strict_again)
+```
+
+When testing migrations, compare structure, types, ranges, and proportions
+rather than expecting row-for-row equality with v0.3 output.
+
+```{r}
+str(strict_data)
+prop.table(table(strict_data$smoking))
+```
+
+## Know the fallback conditions
+
+`create_mock_data()` deliberately uses the legacy path when:
+
+- `validate = FALSE`
+- `variable_details = NULL`
+- detail-level `databaseStart` filtering is needed but `variables` has no
+  `databaseStart` column
+- the requested metadata uses a feature not yet supported by the v0.4 native
+  backend
+
+For example, `variable_details = NULL` keeps the simple legacy fallback.
+
+```{r}
+fallback_data <- create_mock_data(
+  databaseStart = "study",
+  variables = variables[1, ],
+  variable_details = NULL,
+  n = 20,
+  seed = 789,
+  verbose = TRUE
+)
+
+head(fallback_data)
+```
+
+```{r}
+is.null(attr(fallback_data, "mockdata_diagnostics"))
+```
+
+Unsupported v0.4 backend features also route to legacy dispatch. This example
+uses an exponential continuous distribution, which remains available through the
+legacy generator.
+
+```{r}
+exp_variables <- data.frame(
+  variable = "time_to_visit",
+  variableType = "Continuous",
+  rType = "double",
+  role = "enabled",
+  distribution = "exponential",
+  rate = 0.5,
+  stringsAsFactors = FALSE
+)
+
+exp_details <- data.frame(
+  variable = "time_to_visit",
+  recStart = "[0, 10]",
+  recEnd = "copy",
+  proportion = 1,
+  stringsAsFactors = FALSE
+)
+
+exp_data <- create_mock_data(
+  databaseStart = "study",
+  variables = exp_variables,
+  variable_details = exp_details,
+  n = 20,
+  seed = 321,
+  verbose = TRUE
+)
+
+head(exp_data)
+```
+
+## Inspect the v0.4 path directly
+
+When debugging a migration, split the wrapper into its three v0.4 steps:
+
+```{r}
+spec <- mock_spec_from_recodeflow(variables, variable_details)
+validate_mock_spec(spec, strict = TRUE)
+```
+
+```{r}
+baseline <- generate_mock_data_native(spec, n = 50, seed = 456)
+postprocessed <- postprocess_mock_data(baseline, spec, seed = 457)
+
+identical(strict_data, postprocessed)
+```
+
+This makes it easier to tell whether an issue is coming from metadata parsing,
+baseline generation, or post-processing.
+
+## What to check in sibling packages
+
+For cchsflow, chmsflow, and recodeflow workflows, test representative
+`variables.csv` and `variable_details.csv` files with:
+
+```{r}
+#| eval: false
+mock <- create_mock_data(
+  databaseStart = "your-cycle",
+  variables = "variables.csv",
+  variable_details = "variable_details.csv",
+  n = 100,
+  seed = 123,
+  validate = TRUE,
+  verbose = TRUE
+)
+
+str(mock)
+attr(mock, "mockdata_diagnostics")
+```
+
+Report cases where metadata unexpectedly falls back to legacy dispatch, where a
+variable generated in v0.3 but errors in v0.4, or where the generated values,
+types, or diagnostics are surprising.

From 9b00ff38e41a2142ff8806b423ef97a01d397832 Mon Sep 17 00:00:00 2001
From: Doug Manuel <DougManuel@users.noreply.github.com>
Date: Wed, 20 May 2026 16:47:00 -0400
Subject: [PATCH 6/8] Add backend choice how-to

---
 _pkgdown.yml                            |   1 +
 development/v04-documentation-sprint.md |   2 +
 vignettes/choosing-a-backend-v04.qmd    | 250 ++++++++++++++++++++++++
 3 files changed, 253 insertions(+)
 create mode 100644 vignettes/choosing-a-backend-v04.qmd

diff --git a/_pkgdown.yml b/_pkgdown.yml
index 215f895..c6811fe 100644
--- a/_pkgdown.yml
+++ b/_pkgdown.yml
@@ -115,6 +115,7 @@ articles:
   - recodeflow-metadata-v04
   - diagnostics-and-garbage-v04
   - migrating-from-v03-v04
+  - choosing-a-backend-v04
   - for-recodeflow-users
 
 - title: Explanation
diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md
index 8e05cfa..59f3381 100644
--- a/development/v04-documentation-sprint.md
+++ b/development/v04-documentation-sprint.md
@@ -24,6 +24,8 @@ vignette and pkgdown builds.
   auditing garbage/missing-code post-processing.
 - `migrating-from-v03-v04.qmd`: how-to for compatibility behavior, fallback
   routing, diagnostics, and seed differences.
+- `choosing-a-backend-v04.qmd`: how-to for native versus optional `simstudy`
+  backend selection.
 - README: update the top-level status and quick example so users see v0.4
   immediately.
 - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation.
diff --git a/vignettes/choosing-a-backend-v04.qmd b/vignettes/choosing-a-backend-v04.qmd
new file mode 100644
index 0000000..30685bc
--- /dev/null
+++ b/vignettes/choosing-a-backend-v04.qmd
@@ -0,0 +1,250 @@
+---
+title: "Choose a MockData v0.4 backend"
+format: html
+vignette: >
+  %\VignetteIndexEntry{Choose a MockData v0.4 backend}
+  %\VignetteEngine{quarto::html}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r}
+#| label: setup
+#| include: false
+input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL)
+candidate_roots <- unique(c(
+  ".",
+  "..",
+  if (!is.null(input_file)) file.path(dirname(input_file), "..")
+))
+package_root <- NULL
+for (candidate in candidate_roots) {
+  description <- file.path(candidate, "DESCRIPTION")
+  if (file.exists(description) &&
+      any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) {
+    package_root <- candidate
+    break
+  }
+}
+
+if (!is.null(package_root)) {
+  devtools::load_all(package_root, quiet = TRUE)
+} else {
+  library(MockData)
+}
+```
+
+::: {.vignette-about}
+**About this vignette:** This how-to explains when to use the default native
+backend and when to try the optional `simstudy` backend. The `simstudy` examples
+run when `simstudy >= 0.8.1` is installed and otherwise render a clear message.
+:::
+
+## The short version
+
+Use the native backend by default.
+
+```{r}
+spec <- mock_spec(
+  mock_spec_continuous("age", range = c(18, 85), rtype = "integer"),
+  mock_spec_categorical(
+    "smoking",
+    levels = c("never", "former", "current"),
+    proportions = c(0.5, 0.3, 0.2),
+    rtype = "character"
+  )
+)
+
+native_data <- generate_mock_data_native(spec, n = 100, seed = 101)
+head(native_data)
+```
+
+The native backend is always available, stays within MockData's MIT-licensed
+code, and is the backend used by `create_mock_data()` for supported v0.4
+metadata.
+
+Use the optional `simstudy` backend when you want to exercise that engine path
+or when future MockData features need simulation mechanics that `simstudy`
+already provides.
+
+## Check whether simstudy is available
+
+MockData keeps `simstudy` optional. It is listed in `Suggests`, not `Imports`,
+so installing MockData does not require installing `simstudy`.
+
+```{r}
+simstudy_available <- requireNamespace("simstudy", quietly = TRUE) &&
+  utils::packageVersion("simstudy") >= "0.8.1"
+
+simstudy_available
+```
+
+If `simstudy` is unavailable, use `generate_mock_data_native()`.
+
+```{r}
+if (!simstudy_available) {
+  message(
+    "The optional simstudy backend is not available in this R environment; ",
+    "using generate_mock_data_native() is the recommended path."
+  )
+}
+```
+
+## Run the same spec through both backends
+
+For categorical variables and uniform continuous variables, both backends can
+generate the baseline data.
+
+```{r}
+native_large <- generate_mock_data_native(spec, n = 2000, seed = 202)
+
+if (simstudy_available) {
+  simstudy_large <- generate_mock_data_simstudy(spec, n = 2000, seed = 202)
+  head(simstudy_large)
+} else {
+  simstudy_large <- NULL
+}
+```
+
+When `simstudy` is installed, compare broad properties rather than expecting
+row-for-row equality. The engines use different internals.
+
+```{r}
+if (simstudy_available) {
+  c(
+    native_mean_age = mean(native_large$age),
+    simstudy_mean_age = mean(simstudy_large$age)
+  )
+}
+```
+
+```{r}
+if (simstudy_available) {
+  rbind(
+    native = prop.table(table(factor(
+      native_large$smoking,
+      levels = c("never", "former", "current")
+    ))),
+    simstudy = prop.table(table(factor(
+      simstudy_large$smoking,
+      levels = c("never", "former", "current")
+    )))
+  )
+}
+```
+
+## Mixed specs are allowed
+
+The optional backend uses `simstudy` only for pieces it can currently generate
+safely. Other variables route through MockData's native backend inside the same
+call.
+
+```{r}
+mixed_spec <- mock_spec(
+  mock_spec_categorical(
+    "smoking",
+    levels = c("never", "former", "current"),
+    proportions = c(0.5, 0.3, 0.2),
+    rtype = "character"
+  ),
+  mock_spec_continuous(
+    "bmi",
+    range = c(15, 50),
+    distribution = "normal",
+    mean = 27,
+    sd = 5,
+    rtype = "double"
+  ),
+  mock_spec_date(
+    "interview_date",
+    range = as.Date(c("2020-01-01", "2020-12-31"))
+  )
+)
+
+mixed_native <- generate_mock_data_native(mixed_spec, n = 100, seed = 303)
+head(mixed_native)
+```
+
+```{r}
+if (simstudy_available) {
+  mixed_simstudy <- generate_mock_data_simstudy(mixed_spec, n = 100, seed = 303)
+  head(mixed_simstudy)
+}
+```
+
+In this example, `smoking` can be generated through `simstudy`; `bmi` and
+`interview_date` stay native because MockData owns the truncated normal and
+calendar-date contracts in v0.4.
+
+## Post-processing stays MockData-owned
+
+Missing codes, garbage values, and diagnostics are applied after baseline
+generation. That is true for both backends.
+
+```{r}
+post_spec <- mock_categorical(
+  "response",
+  levels = c("1", "97"),
+  proportions = c(0.6, 0.4),
+  rtype = "character",
+  missing_codes = "97",
+  missing_proportions = 0.2,
+  garbage_rules = list(low = list(proportion = 0.1, range = "[-2, 0]"))
+)
+
+native_baseline <- generate_mock_data_native(post_spec, n = 100, seed = 404)
+native_processed <- postprocess_mock_data(native_baseline, post_spec, seed = 405)
+
+names(attr(native_processed, "mockdata_diagnostics")$variables$response)
+```
+
+```{r}
+if (simstudy_available) {
+  simstudy_baseline <- generate_mock_data_simstudy(post_spec, n = 100, seed = 404)
+  simstudy_processed <- postprocess_mock_data(simstudy_baseline, post_spec, seed = 405)
+
+  names(attr(simstudy_processed, "mockdata_diagnostics")$variables$response)
+}
+```
+
+The diagnostics shape is the same because post-processing is not delegated to
+`simstudy`.
+
+## License and dependency posture
+
+MockData is MIT licensed. `simstudy` is GPL-3 licensed. Keeping `simstudy`
+optional lets MockData keep the core package MIT while still allowing users to
+try the advanced backend when that dependency is acceptable in their project.
+
+If your workflow needs no optional dependency, use:
+
+```{r}
+generate_mock_data_native(spec, n = 10, seed = 1)
+```
+
+If your workflow explicitly wants to test the optional backend and `simstudy` is
+installed, use:
+
+```{r}
+if (simstudy_available) {
+  generate_mock_data_simstudy(spec, n = 10, seed = 1)
+}
+```
+
+## Decision guide
+
+Choose the native backend when:
+
+- you want the default v0.4 behavior;
+- you need MockData to work without optional dependencies;
+- you are generating categorical, continuous, date, missing-code, or garbage
+  examples covered by the native pipeline;
+- you want the simplest path for package tests and vignettes.
+
+Try the optional `simstudy` backend when:
+
+- `simstudy >= 0.8.1` is already acceptable in your project;
+- you want to exercise the optional engine path;
+- you are preparing for future features where `simstudy` provides mature
+  simulation mechanics;
+- you still want MockData to own missing-code, garbage-value, and diagnostics
+  semantics after generation.

From d3f47c7c74c02a8e6e3512749e826b1976a30408 Mon Sep 17 00:00:00 2001
From: Doug Manuel <DougManuel@users.noreply.github.com>
Date: Wed, 20 May 2026 16:55:43 -0400
Subject: [PATCH 7/8] Add v04 design philosophy vignette

---
 _pkgdown.yml                            |   1 +
 development/v04-documentation-sprint.md |   2 +
 vignettes/choosing-a-backend-v04.qmd    |   2 +-
 vignettes/design-philosophy-v04.qmd     | 293 ++++++++++++++++++++++++
 4 files changed, 297 insertions(+), 1 deletion(-)
 create mode 100644 vignettes/design-philosophy-v04.qmd

diff --git a/_pkgdown.yml b/_pkgdown.yml
index c6811fe..b3169e3 100644
--- a/_pkgdown.yml
+++ b/_pkgdown.yml
@@ -122,6 +122,7 @@ articles:
   desc: Understanding concepts and design decisions
   navbar: Explanation
   contents:
+  - design-philosophy-v04
   - advanced-topics
 
 - title: Reference
diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md
index 59f3381..4d7d074 100644
--- a/development/v04-documentation-sprint.md
+++ b/development/v04-documentation-sprint.md
@@ -26,6 +26,8 @@ vignette and pkgdown builds.
   routing, diagnostics, and seed differences.
 - `choosing-a-backend-v04.qmd`: how-to for native versus optional `simstudy`
   backend selection.
+- `design-philosophy-v04.qmd`: explanation of v0.4 design choices and scope
+  boundaries.
 - README: update the top-level status and quick example so users see v0.4
   immediately.
 - `_pkgdown.yml`: expose the new v0.4 tutorial in site navigation.
diff --git a/vignettes/choosing-a-backend-v04.qmd b/vignettes/choosing-a-backend-v04.qmd
index 30685bc..d0bff62 100644
--- a/vignettes/choosing-a-backend-v04.qmd
+++ b/vignettes/choosing-a-backend-v04.qmd
@@ -73,7 +73,7 @@ so installing MockData does not require installing `simstudy`.
 
 ```{r}
 simstudy_available <- requireNamespace("simstudy", quietly = TRUE) &&
-  utils::packageVersion("simstudy") >= "0.8.1"
+  utils::packageVersion("simstudy") >= numeric_version("0.8.1")
 
 simstudy_available
 ```
diff --git a/vignettes/design-philosophy-v04.qmd b/vignettes/design-philosophy-v04.qmd
new file mode 100644
index 0000000..fc67ded
--- /dev/null
+++ b/vignettes/design-philosophy-v04.qmd
@@ -0,0 +1,293 @@
+---
+title: "MockData v0.4 design philosophy"
+format: html
+vignette: >
+  %\VignetteIndexEntry{MockData v0.4 design philosophy}
+  %\VignetteEngine{quarto::html}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r}
+#| label: setup
+#| include: false
+input_file <- tryCatch(knitr::current_input(dir = TRUE), error = function(e) NULL)
+candidate_roots <- unique(c(
+  ".",
+  "..",
+  if (!is.null(input_file)) file.path(dirname(input_file), "..")
+))
+package_root <- NULL
+for (candidate in candidate_roots) {
+  description <- file.path(candidate, "DESCRIPTION")
+  if (file.exists(description) &&
+      any(grepl("^Package:\\s+MockData\\s*$", readLines(description, warn = FALSE)))) {
+    package_root <- candidate
+    break
+  }
+}
+
+if (!is.null(package_root)) {
+  devtools::load_all(package_root, quiet = TRUE)
+} else {
+  library(MockData)
+}
+```
+
+::: {.vignette-about}
+**About this vignette:** This explanation describes why MockData v0.4 is shaped
+around `mock_spec`, native generation, optional `simstudy`, and MockData-owned
+post-processing. It is not a tutorial; start with the v0.4 getting-started
+vignette if you want a first workflow.
+:::
+
+## Mock data, not synthetic data
+
+MockData generates mock data for package development, QA, documentation,
+examples, and training. Its output is meant to exercise code paths. It is not
+intended for privacy release, inference, or population-valid statistical
+analysis.
+
+That boundary is deliberate. In health-data and survey-data settings,
+"synthetic data" can imply privacy review, data-sharing obligations, or
+statistical validity claims. MockData avoids that claim. It helps you test a
+pipeline before you have access to real data; it does not replace real data for
+analysis.
+
+The working sentence is:
+
+> Give a recodeflow-style specification a body, so you can test the recoding
+> before you have the data.
+
+## The people v0.4 is trying to serve
+
+Three user groups shaped the v0.4 design.
+
+First, recodeflow ecosystem maintainers need data frames that can run through
+cchsflow, chmsflow, and recodeflow examples and tests. They already have
+`variables.csv` and `variable_details.csv`, so MockData should read those
+metadata files rather than invent a competing file format.
+
+Second, methodologists and package authors need examples and vignettes that run
+without restricted data. They may want a small, readable direct API rather than
+a full metadata table.
+
+Third, QA developers need deliberately bad data. Out-of-range ages, declared
+missing codes, impossible dates, and invalid category values are not incidental
+features; they are the point when testing validation code.
+
+v0.4 tries to serve all three without making any one workflow the only workflow.
+
+## Why `mock_spec` exists
+
+Before v0.4, MockData's generators read metadata, parsed ranges, generated
+values, applied missing codes, injected garbage, coerced types, and assembled
+columns in one path. That was useful while the package was young, but it made
+validation, backend choice, and diagnostics hard to reason about.
+
+v0.4 introduces `mock_spec` as the normalized internal representation. Different
+front doors can produce the same spec:
+
+- direct helpers, such as `mock_continuous()` and `mock_categorical()`
+- composable constructors, such as `mock_spec_continuous()`
+- recodeflow metadata through `mock_spec_from_recodeflow()`
+
+The spec is then consumed by generation and post-processing layers.
+
+```{r}
+spec <- mock_spec(
+  mock_spec_continuous(
+    "age",
+    range = c(18, 85),
+    distribution = "normal",
+    mean = 50,
+    sd = 12,
+    rtype = "integer"
+  ),
+  mock_spec_categorical(
+    "smoking",
+    levels = c("never", "former", "current"),
+    proportions = c(0.5, 0.3, 0.2),
+    rtype = "character"
+  )
+)
+
+names(spec$variables)
+```
+
+This is the main architectural move: parse once, validate once, then generate
+from the normalized shape.
+
+## Two tiers, one model
+
+The direct helpers are there for the first ten minutes.
+
+```{r}
+one_variable <- mock_continuous(
+  "age",
+  range = c(18, 85),
+  distribution = "normal",
+  mean = 50,
+  sd = 12,
+  rtype = "integer"
+)
+
+names(one_variable$variables)
+```
+
+The lower-level constructors are there when you want to compose multiple
+variables or build adapters.
+
+```{r}
+same_variable <- mock_spec(
+  mock_spec_continuous(
+    "age",
+    range = c(18, 85),
+    distribution = "normal",
+    mean = 50,
+    sd = 12,
+    rtype = "integer"
+  )
+)
+
+names(same_variable$variables)
+```
+
+Those are two surface syntaxes for the same internal model. That is why the
+package can support small hand-written examples and recodeflow metadata without
+duplicating generation logic.
+
+## Why the backend is hybrid
+
+MockData v0.4 has a native backend and an optional `simstudy` backend.
+
+The native backend is the default. It is always available, keeps MockData usable
+without optional dependencies, and owns the simple cases that are central to the
+package: categorical values, continuous values, dates, missing-code semantics,
+garbage values, and diagnostics.
+
+`simstudy` is optional. It is a mature GPL-3 simulation package with useful
+machinery for future advanced features, but MockData remains MIT licensed by
+keeping `simstudy` in `Suggests` and soft-gating the backend.
+
+```{r}
+native_data <- generate_mock_data_native(spec, n = 5, seed = 1)
+native_data
+```
+
+```{r}
+simstudy_available <- requireNamespace("simstudy", quietly = TRUE) &&
+  utils::packageVersion("simstudy") >= numeric_version("0.8.1")
+
+if (simstudy_available) {
+  generate_mock_data_simstudy(spec, n = 5, seed = 1)
+} else {
+  message("simstudy is not installed; the native backend remains available.")
+}
+```
+
+This split is intentionally conservative. MockData should not reimplement a
+large simulation library when a good one exists, but it also should not make a
+GPL-3 package mandatory for users who only need the core mock-data path.
+
+## Why post-processing is separate
+
+Missing codes and garbage values are not just another distribution. They are QA
+semantics layered on top of otherwise valid generated data.
+
+v0.4 therefore generates baseline values first, then applies missing-code and
+garbage rules in a separate post-processing pass.
+
+```{r}
+qa_spec <- mock_categorical(
+  "response",
+  levels = c("1", "97"),
+  proportions = c(0.7, 0.3),
+  rtype = "character",
+  missing_codes = "97",
+  missing_proportions = 0.2
+)
+
+baseline <- generate_mock_data_native(qa_spec, n = 100, seed = 11)
+processed <- postprocess_mock_data(baseline, qa_spec, seed = 12)
+
+diagnostics <- attr(processed, "mockdata_diagnostics")
+names(diagnostics$variables$response)
+```
+
+The diagnostics matter because a value can naturally collide with a declared
+missing code. In the example above, `97` is both a valid level and a missing
+code. MockData records which rows naturally drew `97` and which rows were
+assigned `97` during post-processing.
+
+```{r}
+response_diag <- diagnostics$variables$response
+
+c(
+  preexisting = length(response_diag$preexisting_missing_code_indices),
+  assigned = length(response_diag$assigned_missing_indices)
+)
+```
+
+That distinction is what makes the output auditable for QA workflows.
+
+## Why strictness increased
+
+Earlier MockData versions were often permissive: warn, skip a variable, and
+return whatever could be generated. That behavior was convenient in exploratory
+work, but risky in package tests and documentation. A silently missing column
+can make a vignette or downstream test look successful while testing the wrong
+thing.
+
+v0.4 moves toward strict generation for the new pipeline. Unsupported features
+should either fail loudly or route through an explicit compatibility path.
+
+`create_mock_data()` keeps compatibility by retaining legacy fallback routes,
+especially for `validate = FALSE`, `variable_details = NULL`, detail-level
+`databaseStart` filtering, and unsupported native-backend features. Use
+`verbose = TRUE` while migrating so the chosen path is visible.
+
+## What is deliberately deferred
+
+Several features are intentionally not solved in v0.4.
+
+Formula-derived variables are detected and kept loud rather than silently
+ignored. They need a dependency-aware evaluator, sandboxing rules, and clear
+syntax.
+
+Multi-variable correlation and richer joint distributions are future work.
+`simstudy` is one possible engine for those features, but v0.4 does not claim to
+generate statistically realistic joint distributions.
+
+Table 1 bootstrap is also future work. It is a natural third adapter: take
+published descriptive statistics and produce a `mock_spec`. That is useful, but
+it should not be squeezed into the recodeflow adapter.
+
+LinkML or another schema-first model remains a possible north star for the
+larger recodeflow ecosystem. v0.4 keeps the internal spec abstract enough that a
+future schema adapter could produce it.
+
+These are roadmap items, not hidden guarantees.
+
+## How the v0.4 refactor was reviewed
+
+The v0.4 architecture was developed through a spike, milestone PRs, and repeated
+review of code, tests, silent-failure paths, and documentation. That process
+changed the design in concrete ways:
+
+- strict-by-default behavior became more important than permissive fallback;
+- diagnostics became a first-class auditability contract;
+- `simstudy` stayed optional to preserve MockData's dependency and license
+  posture;
+- executable vignettes became part of validation, not just prose.
+
+The development notes in `development/` and `.tmp/` preserve more of that review
+trail for maintainers. This vignette distills the user-facing design choices.
+
+## The design in one paragraph
+
+MockData v0.4 normalizes inputs into `mock_spec`, validates that shape, generates
+baseline values through a native backend or optional `simstudy` backend, and
+then applies MockData-owned post-processing for missing codes, garbage values,
+and diagnostics. It keeps recodeflow metadata central, adds simpler direct APIs,
+and preserves the public `create_mock_data()` wrapper for compatibility. It is
+mock data for development and QA, not synthetic data for inference.

From 5907d55107e09b34575b7d9176515f7065ad723a Mon Sep 17 00:00:00 2001
From: Doug Manuel <DougManuel@users.noreply.github.com>
Date: Thu, 21 May 2026 13:47:58 -0400
Subject: [PATCH 8/8] Prepare v04 docs for tag

---
 DESCRIPTION                             |  2 +-
 README.md                               |  4 ++--
 development/adr/v04-hybrid-backend.md   | 27 ++++++++++++++++---------
 development/simstudy-v04.md             | 22 +++++++++++++++++++-
 development/v04-documentation-sprint.md |  4 ++++
 vignettes/design-philosophy-v04.qmd     | 20 ++++++++++--------
 6 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/DESCRIPTION b/DESCRIPTION
index b10b0fc..99e61e5 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: MockData
 Title: Generate Mock Data from Metadata Specifications
-Version: 0.4.0.9000
+Version: 0.4.0
 Authors@R: c(
     person("Juan", "Li", role = "aut", email = "juli@ohri.ca"),
     person("Douglas", "Manuel", role = c("aut", "cre"), email = "dmanuel@ohri.ca"),
diff --git a/README.md b/README.md
index 715f74a..3b19f31 100644
--- a/README.md
+++ b/README.md
@@ -3,13 +3,13 @@
 <!-- badges: start -->
 
 [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
-[![Version: 0.4.0-dev](https://img.shields.io/badge/version-0.4.0--dev-blue.svg)](https://github.com/Big-Life-Lab/MockData)
+[![Version: 0.4.0](https://img.shields.io/badge/version-0.4.0-blue.svg)](https://github.com/Big-Life-Lab/MockData)
 [![pkgdown](https://github.com/Big-Life-Lab/MockData/actions/workflows/pkgdown.yaml/badge.svg)](https://github.com/Big-Life-Lab/MockData/actions/workflows/pkgdown.yaml)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
 <!-- badges: end -->
 
-**Status: Experimental, pre-release software**
+**Status: Experimental v0.4.0 release candidate**
 
 MockData is a work-in-progress R package for generating mock testing data from
 small metadata specifications. The `dev` branch now contains the v0.4
diff --git a/development/adr/v04-hybrid-backend.md b/development/adr/v04-hybrid-backend.md
index 63fc032..9d9b71f 100644
--- a/development/adr/v04-hybrid-backend.md
+++ b/development/adr/v04-hybrid-backend.md
@@ -1,7 +1,7 @@
 # ADR: v0.4 Hybrid Backend Architecture
 
-**Status**: draft  
-**Date**: 2026-05-18  
+**Status**: accepted and implemented in PR #28
+**Date**: 2026-05-18
 **Decision owner**: MockData maintainers
 
 ## Context
@@ -23,6 +23,9 @@ MockData-specific semantics as post-processing. Three review rounds converged on
 the same conclusion: the hybrid architecture is ready for production refactor
 planning.
 
+The production refactor was implemented in PR #28 and merged to `dev` for
+sibling-package testing before a v0.4.0 tag.
+
 ## Decision
 
 MockData v0.4 will move toward a hybrid backend architecture:
@@ -92,18 +95,22 @@ Tradeoffs:
   formula syntax, custom distribution registry, and correlation merging.
 - Maintaining wrappers will add short-term complexity.
 
-## Implementation Direction
+## Implementation Status
 
-Production refactor should proceed in layers:
+The production refactor proceeded in layers:
 
 1. `mock_spec` constructors and validators.
 2. Direct and recodeflow input adapters.
-3. Formula/dependency evaluator.
-4. Native backend.
-5. Post-processing layer.
-6. Promotion of spike assertions to `testthat`.
-7. Optional `simstudy` backend.
-8. Current API wrappers.
+3. Native backend.
+4. Post-processing layer and diagnostics.
+5. Promotion of spike assertions to `testthat`.
+6. Optional `simstudy` backend.
+7. Current API wrappers.
+8. Divio documentation sprint and Phase C maintainer communication.
+
+Formula/dependency evaluation, multi-group correlations, Table 1 adapters, and
+schema-first integration remain deferred roadmap items rather than v0.4.0
+commitments.
 
 ## Open Follow-Up Decisions
 
diff --git a/development/simstudy-v04.md b/development/simstudy-v04.md
index 5772839..6d2a85d 100644
--- a/development/simstudy-v04.md
+++ b/development/simstudy-v04.md
@@ -1,9 +1,15 @@
 # MockData v0.4 Production Refactor Plan
 
+**Status**: implemented in PR #28 and superseded by the v0.4 documentation
+sprint. This document is retained as the production-refactor plan and should be
+read as historical implementation context rather than an active task list.
+
 ## 1. Write The ADR First
 
 Write a short architecture decision record before production code changes.
 
+**Status**: complete. See `development/adr/v04-hybrid-backend.md`.
+
 The ADR should lock these decisions:
 
 - **Decision**: MockData adopts a hybrid backend architecture.
@@ -31,6 +37,9 @@ The ADR should lock these decisions:
 
 Each layer should have focused tests before the next layer starts.
 
+**Status**: complete for the v0.4.0 scope. Formula/dependency evaluation,
+multi-group correlation, and Table 1 input remain deferred roadmap items.
+
 1. **`mock_spec` core**
    - Constructors and validators.
    - Stable fields for names, types, ranges, levels, proportions, missing codes,
@@ -81,6 +90,10 @@ Each layer should have focused tests before the next layer starts.
 
 ## 3. Keep The Current API Alive
 
+**Status**: complete. The v0.3 public functions remain available, and
+`create_mock_data()` now routes supported metadata through the v0.4 pipeline
+while preserving legacy fallback paths.
+
 Existing public functions should remain available in v0.4.0:
 
 - `create_mock_data()`
@@ -95,6 +108,11 @@ synchronized release.
 
 ## 4. Carry-Forward Design Issues
 
+**Status**: partly resolved. The diagnostics shape, seed discipline, native vs
+`simstudy` parity tests, and optional `simstudy` posture were settled for v0.4.0.
+The remaining items below should be treated as v0.5+ roadmap candidates or issue
+backlog material.
+
 Settle in the ADR or the first design note:
 
 - Multi-group correlation merge strategy.
@@ -116,6 +134,9 @@ Track as implementation issues:
 
 ## 5. Communication
 
+**Status**: complete as a draft communication artifact. See
+`development/v04-phase-c-comms-note.md`.
+
 Before v0.4.0 lands, write a short communication note for cchsflow, chmsflow,
 and recodeflow maintainers:
 
@@ -125,4 +146,3 @@ and recodeflow maintainers:
 - What migration is optional in v0.4.0.
 - When deprecation warnings may begin.
 - How the mock-data framing remains distinct from synthetic-data release.
-
diff --git a/development/v04-documentation-sprint.md b/development/v04-documentation-sprint.md
index 4d7d074..7293021 100644
--- a/development/v04-documentation-sprint.md
+++ b/development/v04-documentation-sprint.md
@@ -1,5 +1,9 @@
 # MockData v0.4 Documentation Sprint
 
+**Status**: complete for the v0.4.0 documentation sprint. Remaining work before
+tagging is package checks, maintainer smoke testing, and any follow-up edits from
+review.
+
 This sprint treats documentation as implementation validation. The goal is not
 only to explain the v0.4 API, but to run realistic user workflows during
 vignette and pkgdown builds.
diff --git a/vignettes/design-philosophy-v04.qmd b/vignettes/design-philosophy-v04.qmd
index fc67ded..8af0afe 100644
--- a/vignettes/design-philosophy-v04.qmd
+++ b/vignettes/design-philosophy-v04.qmd
@@ -278,16 +278,20 @@ changed the design in concrete ways:
 - diagnostics became a first-class auditability contract;
 - `simstudy` stayed optional to preserve MockData's dependency and license
   posture;
+- a Phase C communication note made sibling-package testing part of the release
+  process;
 - executable vignettes became part of validation, not just prose.
 
-The development notes in `development/` and `.tmp/` preserve more of that review
-trail for maintainers. This vignette distills the user-facing design choices.
+The development notes in `development/` and maintainer-only review notes
+preserve more of that review trail. This vignette distills the user-facing
+design choices.
 
 ## The design in one paragraph
 
-MockData v0.4 normalizes inputs into `mock_spec`, validates that shape, generates
-baseline values through a native backend or optional `simstudy` backend, and
-then applies MockData-owned post-processing for missing codes, garbage values,
-and diagnostics. It keeps recodeflow metadata central, adds simpler direct APIs,
-and preserves the public `create_mock_data()` wrapper for compatibility. It is
-mock data for development and QA, not synthetic data for inference.
+MockData v0.4 normalizes inputs into `mock_spec`, validates that shape strictly
+by default, generates baseline values through a native backend or optional
+`simstudy` backend, and then applies MockData-owned post-processing for missing
+codes, garbage values, and diagnostics. It keeps recodeflow metadata central,
+adds simpler direct APIs, and preserves the public `create_mock_data()` wrapper
+for compatibility. It is mock data for development and QA, not synthetic data
+for inference.