patterninc
diff --git a/‎.github/workflows/ci-cd-ds-platform-utils.yaml‎
Lines changed: 7 additions & 1 deletion b/‎.github/workflows/ci-cd-ds-platform-utils.yaml‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 10 additions & 2 deletions b/‎README.md‎
Lines changed: 10 additions & 2 deletions
diff --git a/‎docs/metaflow/batch_inference_pipeline.md‎
Lines changed: 205 additions & 0 deletions b/‎docs/metaflow/batch_inference_pipeline.md‎
Lines changed: 205 additions & 0 deletions
diff --git a/‎docs/metaflow/make_pydantic_parser_fn.md‎
Lines changed: 37 additions & 0 deletions b/‎docs/metaflow/make_pydantic_parser_fn.md‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎docs/metaflow/publish.md‎
Lines changed: 49 additions & 0 deletions b/‎docs/metaflow/publish.md‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎docs/metaflow/publish_pandas.md‎
Lines changed: 58 additions & 0 deletions b/‎docs/metaflow/publish_pandas.md‎
Lines changed: 58 additions & 0 deletions
@@ -5,8 +5,14 @@ on:
   push:
     branches:
       - main
+    paths-ignore:
+      - "README.md"
+      - "docs/**"
   pull_request:
     types: [opened, synchronize]
+    paths-ignore:
+      - "README.md"
+      - "docs/**"
 
 jobs:
   check-version:
@@ -110,7 +116,7 @@ jobs:
           uv pip install --group dev
           COVERAGE_DIR="$(python -c 'import ds_platform_utils; print(ds_platform_utils.__path__[0])')"
           poe clean
-          poe test --cov="$COVERAGE_DIR" --no-cov
+          poe test --cov="$COVERAGE_DIR" --no-cov -n auto
 
   tag-version:
     needs: [check-version, code-quality-checks, build-wheel, execute-tests]
 
@@ -1,3 +1,11 @@
-## ds-platform-utils
+# ds-platform-utils
+
+## Metaflow API Docs
+
+- [BatchInferencePipeline](docs/metaflow/batch_inference_pipeline.md)
+- [make_pydantic_parser_fn](docs/metaflow/make_pydantic_parser_fn.md)
+- [publish](docs/metaflow/publish.md)
+- [publish_pandas](docs/metaflow/publish_pandas.md)
+- [query_pandas_from_snowflake](docs/metaflow/query_pandas_from_snowflake.md)
+- [restore_step_state](docs/metaflow/restore_step_state.md)
 
-Utility library to support Pattern's [data-science-projects](https://github.com/patterninc/data-science-projects/).
 
@@ -0,0 +1,205 @@
+# `BatchInferencePipeline`
+
+Source: `ds_platform_utils.metaflow.batch_inference_pipeline.BatchInferencePipeline`
+
+Utility class to orchestrate batch inference with Snowflake + S3 in Metaflow steps.
+
+## Main methods
+
+- `query_and_batch(...)`: export source data to S3 and create worker batches.
+- `process_batch(...)`: run download → inference → upload for one worker.
+- `publish_results(...)`: copy prediction outputs from S3 to Snowflake.
+- `run(...)`: convenience method to execute full flow sequentially.
+
+## Detailed example (Metaflow foreach)
+
+This example shows the intended 3-step pattern in a Metaflow `FlowSpec`:
+
+1. `query_and_batch()` in `start`
+2. `process_batch()` in `foreach`
+3. `publish_results()` in `join`
+
+```python
+from metaflow import FlowSpec, step
+import pandas as pd
+
+from ds_platform_utils.metaflow import BatchInferencePipeline
+
+
+def predict_fn(df: pd.DataFrame) -> pd.DataFrame:
+	# Example model logic
+	out = pd.DataFrame()
+	out["id"] = df["id"]
+	out["score"] = (df["feature_1"].fillna(0) * 0.7 + df["feature_2"].fillna(0) * 0.3).round(6)
+	out["label"] = (out["score"] >= 0.5).astype(int)
+	return out
+
+
+class BatchPredictFlow(FlowSpec):
+
+    @step
+    def start(self):
+        self.next(self.query_and_batch)
+
+    @step
+    def query_and_batch(self):
+        self.pipeline = BatchInferencePipeline()
+
+        # Query can be inline SQL or a file path.
+        # {schema} is provided by ds_platform_utils (DEV/PROD selection).
+        self.worker_ids = self.pipeline.query_and_batch(
+            input_query="""
+                SELECT
+                    id,
+                    feature_1,
+                    feature_2
+                FROM {{schema}}.model_features
+                WHERE ds = '2026-02-26'
+            """,
+            parallel_workers=8,
+            warehouse="MED",
+            use_utc=True,
+        )
+
+        self.next(self.process_batch, foreach="worker_ids")
+
+    @step
+    def process_batch(self):
+        # In a foreach step, self.input contains one worker_id.
+        self.pipeline.process_batch(
+            worker_id=self.input,
+            predict_fn=predict_fn,
+            batch_size_in_mb=256,
+            timeout_per_batch=300,
+        )
+        self.next(self.publish_results)
+
+    @step
+    def publish_results(self, inputs):
+        # Reuse one pipeline object from foreach branches.
+        self.pipeline = inputs[0].pipeline
+
+        self.pipeline.publish_results(
+            output_table_name="MODEL_PREDICTIONS_DAILY",
+            output_table_definition=[
+                ("id", "NUMBER"),
+                ("score", "FLOAT"),
+                ("label", "NUMBER"),
+            ],
+            auto_create_table=True,
+            overwrite=True,
+            warehouse="MED",
+            use_utc=True,
+        )
+        self.next(self.end)
+
+    @step
+    def end(self):
+        print("Batch inference complete")
+```
+
+## Detailed example (single-step convenience)
+
+Use `run()` when you do not need Metaflow foreach parallelization:
+
+```python
+from ds_platform_utils.metaflow import BatchInferencePipeline
+import pandas as pd
+
+
+@step
+def batch_inference_step(self):
+	def predict_fn(df: pd.DataFrame) -> pd.DataFrame:
+		return pd.DataFrame(
+			{
+				"id": df["id"],
+				"score": (df["feature_1"] * 0.9).fillna(0),
+			}
+		)
+
+	pipeline = BatchInferencePipeline()
+	pipeline.run(
+		input_query="""
+			SELECT id, feature_1
+			FROM {{schema}}.model_features
+			WHERE ds = '2026-02-26'
+		""",
+		output_table_name="MODEL_PREDICTIONS_DAILY",
+		predict_fn=predict_fn,
+		output_table_definition=[("id", "NUMBER"), ("score", "FLOAT")],
+		warehouse="XL",
+	)
+
+	self.next(self.end)
+```
+
+## Parameters
+
+### `query_and_batch(...)`
+
+| Parameter          | Type          | Required | Description                                                                                                             |
+| ------------------ | ------------- | -------: | ----------------------------------------------------------------------------------------------------------------------- |
+| `input_query`      | `str \| Path` |      Yes | SQL query string or SQL file path used to fetch source rows. `{schema}` placeholder is resolved by `ds_platform_utils`. |
+| `ctx`              | `dict`        |       No | Optional substitution map for templated SQL; merged with the internal `{"schema": ...}` mapping before query execution. |
+| `warehouse`        | `str`         |       No | Snowflake warehouse used to execute the source query/export.                                                            |
+| `use_utc`          | `bool`        |       No | If `True`, uses UTC timestamps/paths for partitioning and run metadata.                                                 |
+| `parallel_workers` | `int`         |       No | Number of worker partitions to create for downstream processing.                                                        |
+
+**Returns:** `list[int]` of `worker_id` values for Metaflow `foreach`.
+
+---
+
+### `process_batch(...)`
+
+| Parameter           | Type                                     | Required | Description                                                                                              |
+| ------------------- | ---------------------------------------- | -------: | -------------------------------------------------------------------------------------------------------- |
+| `worker_id`         | `int`                                    |      Yes | Worker partition identifier generated by `query_and_batch()`.                                            |
+| `predict_fn`        | `Callable[[pd.DataFrame], pd.DataFrame]` |      Yes | Inference function applied to each input chunk. Must return a DataFrame matching expected output schema. |
+| `batch_size_in_mb`  | `int`                                    |       No | Target chunk size for reading/processing batch files.                                                    |
+| `timeout_per_batch` | `int`                                    |       No | Processing time for each batch in seconds. (Used for Queuing operations)                                 |
+
+**Returns:** `None`
+
+**Recommended**: Tune `batch_size_in_mb` for Outerbounds Small tasks (3 CPU, 15 GB memory), which are about 6x more cost-effective than Medium tasks.
+
+## Limitations
+
+- The pipeline uses Snowflake ↔ S3 stage copy operations, so some column data types may be inferred differently than expected.
+- For predictable output types, provide an explicit `output_table_definition` in `publish_results(...)` / `run(...)` and cast columns in `predict_fn` as needed.
+
+### `publish_results(...)`
+
+| Parameter                 | Type                            | Required | Description                                                       |
+| ------------------------- | ------------------------------- | -------: | ----------------------------------------------------------------- |
+| `output_table_name`       | `str`                           |      Yes | Destination Snowflake table for predictions.                      |
+| `output_table_definition` | `list[tuple[str, str]] \| None` |       No | Optional output schema as `(column_name, snowflake_type)` tuples. |
+| `auto_create_table`       | `bool`                          |       No | If `True`, creates destination table when missing.                |
+| `overwrite`               | `bool`                          |       No | If `True`, replaces existing table data before loading results.   |
+| `warehouse`               | `str`                           |       No | Snowflake warehouse used for load/publish operations.             |
+| `use_utc`                 | `bool`                          |       No | If `True`, uses UTC for load metadata/time handling.              |
+
+**Returns:** `None`
+
+---
+
+### `run(...)` (convenience method)
+
+Runs `query_and_batch()` → `process_batch()` → `publish_results()` in a single sequential call.
+
+| Parameter                 | Type                                     | Required | Description                                                                                                             |
+| ------------------------- | ---------------------------------------- | -------: | ----------------------------------------------------------------------------------------------------------------------- |
+| `input_query`             | `str \| Path`                            |      Yes | SQL query string or SQL file path used to fetch source rows. `{schema}` placeholder is resolved by `ds_platform_utils`. |
+| `output_table_name`       | `str`                                    |      Yes | Destination Snowflake table for predictions.                                                                            |
+| `predict_fn`              | `Callable[[pd.DataFrame], pd.DataFrame]` |      Yes | Inference function applied to each input chunk. Must return a DataFrame matching expected output schema.                |
+| `ctx`                     | `dict`                                   |       No | Optional substitution map for templated SQL; merged with the internal `{"schema": ...}` mapping before query execution. |
+| `output_table_definition` | `list[tuple[str, str]] \| None`          |       No | Optional output schema as `(column_name, snowflake_type)` tuples.                                                       |
+| `batch_size_in_mb`        | `int`                                    |       No | Target chunk size for reading/processing batch files.                                                                   |
+| `timeout_per_batch`       | `int`                                    |       No | Processing time for each batch in seconds. (Used for Queuing operations)                                                |
+| `auto_create_table`       | `bool`                                   |       No | If `True`, creates destination table when missing.                                                                      |
+| `overwrite`               | `bool`                                   |       No | If `True`, replaces existing table data before loading results.                                                         |
+| `warehouse`               | `str`                                    |       No | Snowflake warehouse used for load/publish operations.                                                                   |
+| `use_utc`                 | `bool`                                   |       No | If `True`, uses UTC for load metadata/time handling.                                                                    |
+
+**Returns:** `None`
+
+**Recommended**: Tune `batch_size_in_mb` for Outerbounds Small tasks (3 CPU, 15 GB memory), which are about 6x more cost-effective than Medium tasks.
@@ -0,0 +1,37 @@
+# `make_pydantic_parser_fn`
+
+Source: `ds_platform_utils.metaflow.validate_config.make_pydantic_parser_fn`
+
+Creates a Metaflow `Config(..., parser=...)` parser backed by a Pydantic model.
+
+## Signature
+
+```python
+make_pydantic_parser_fn(
+    pydantic_model: type[BaseModel],
+) -> Callable[[str], dict]
+```
+
+## What it does
+
+- Parses config content as JSON, TOML, or YAML.
+- Validates and normalizes with Pydantic.
+- Returns a dict with applied defaults from the model.
+
+## Parameters
+
+| Parameter        | Type              | Required | Description                                                         |
+| ---------------- | ----------------- | -------: | ------------------------------------------------------------------- |
+| `pydantic_model` | `type[BaseModel]` |      Yes | Pydantic model class used to validate and normalize config content. |
+
+**Returns:** `Callable[[str], dict]` parser function for Metaflow `Config(..., parser=...)`.
+
+## Typical usage
+
+```python
+config: MyConfig = Config(
+    name="config",
+    default="./configs/default.yaml",
+    parser=make_pydantic_parser_fn(MyConfig),
+)
+```
@@ -0,0 +1,49 @@
+# `publish`
+
+Source: `ds_platform_utils.metaflow.write_audit_publish.publish`
+
+Publishes data to a Snowflake table using the write-audit-publish (WAP) pattern.
+
+## Signature
+
+```python
+publish(
+    table_name: str,
+    query: str | Path,
+    audits: list[str | Path] | None = None,
+    ctx: dict[str, Any] | None = None,
+    warehouse: Literal["XS", "MED", "XL"] = None,
+    use_utc: bool = True,
+) -> None
+```
+
+## What it does
+
+- Reads SQL from a string or `.sql` path.
+- Runs write/audit/publish operations through Snowflake.
+- Adds operation details and table links to the Metaflow card when available.
+
+## Parameters
+
+| Parameter    | Type                                 | Required | Description                                                                                                   |
+| ------------ | ------------------------------------ | -------: | ------------------------------------------------------------------------------------------------------------- |
+| `table_name` | `str`                                |      Yes | Destination Snowflake table name for the publish operation.                                                   |
+| `query`      | `str \| Path`                        |      Yes | SQL query text or path to SQL file that produces the table data.                                              |
+| `audits`     | `list[str \| Path] \| None`          |       No | Optional SQL audits (strings or file paths) executed as validation checks.                                    |
+| `ctx`        | `dict[str, Any] \| None`             |       No | Optional template substitution context for SQL operations.                                                    |
+| `warehouse`  | `Literal["XS", "MED", "XL"] \| None` |       No | Snowflake warehouse override for this operation. Supports `XS`/`MED`/`XL` shortcuts or a full warehouse name. |
+| `use_utc`    | `bool`                               |       No | If `True`, uses UTC timezone for Snowflake session.                                                           |
+
+**Returns:** `None`
+
+## Typical usage
+
+```python
+from ds_platform_utils.metaflow import publish
+
+publish(
+    table_name="MY_TABLE",
+    query="SELECT * FROM PATTERN_DB.{{schema}}.SOURCE",
+    audits=["SELECT COUNT(*) > 0 FROM PATTERN_DB.{{schema}}.{{table_name}}"],
+)
+```
@@ -0,0 +1,58 @@
+# `publish_pandas`
+
+Source: `ds_platform_utils.metaflow.pandas.publish_pandas`
+
+Writes a pandas DataFrame to Snowflake.
+
+## Signature
+
+```python
+publish_pandas(
+    table_name: str,
+    df: pd.DataFrame,
+    add_created_date: bool = False,
+    chunk_size: int | None = None,
+    compression: Literal["snappy", "gzip"] = "snappy",
+    warehouse: Literal["XS", "MED", "XL"] = None,
+    parallel: int = 4,
+    quote_identifiers: bool = False,
+    auto_create_table: bool = False,
+    overwrite: bool = False,
+    use_logical_type: bool = True,
+    use_utc: bool = True,
+    use_s3_stage: bool = False,
+    table_definition: list[tuple[str, str]] | None = None,
+) -> None
+```
+
+## What it does
+
+- Validates DataFrame input.
+- Writes directly via `write_pandas` or via S3 stage flow for large data.
+- Adds a Snowflake table URL to Metaflow card output.
+
+## Parameters
+
+| Parameter           | Type                            | Required | Description                                                                                                   |
+| ------------------- | ------------------------------- | -------: | ------------------------------------------------------------------------------------------------------------- |
+| `table_name`        | `str`                           |      Yes | Destination Snowflake table name.                                                                             |
+| `df`                | `pd.DataFrame`                  |      Yes | DataFrame to publish.                                                                                         |
+| `add_created_date`  | `bool`                          |       No | If `True`, adds a `created_date` UTC timestamp column before publish.                                         |
+| `chunk_size`        | `int \| None`                   |       No | Number of rows per uploaded chunk. If not provided, calculate based on DataFrame size.                        |
+| `compression`       | `Literal["snappy", "gzip"]`     |       No | Compression codec used for staged parquet files.                                                              |
+| `warehouse`         | `str \| None`                   |       No | Snowflake warehouse override for this operation. Supports `XS`/`MED`/`XL` shortcuts or a full warehouse name. |
+| `parallel`          | `int`                           |       No | Number of upload threads used by `write_pandas` path.                                                         |
+| `quote_identifiers` | `bool`                          |       No | If `False`, passes identifiers unquoted so Snowflake applies uppercase coercion.                              |
+| `auto_create_table` | `bool`                          |       No | If `True`, creates destination table when missing.                                                            |
+| `overwrite`         | `bool`                          |       No | If `True`, replaces existing table contents.                                                                  |
+| `use_logical_type`  | `bool`                          |       No | Controls parquet logical type handling when loading data.                                                     |
+| `use_utc`           | `bool`                          |       No | If `True`, uses UTC timezone for Snowflake session.                                                           |
+| `use_s3_stage`      | `bool`                          |       No | If `True`, publishes via S3 stage flow; otherwise uses direct `write_pandas`.                                 |
+| `table_definition`  | `list[tuple[str, str]] \| None` |       No | Optional Snowflake table schema; used by S3 stage flow when table creation is needed.                         |
+
+**Returns:** `None`
+
+## Limitations
+
+- When `use_s3_stage=True`, some column data types may not map exactly as expected between pandas/parquet and Snowflake.
+- If needed, provide an explicit `table_definition` and/or cast columns before publishing to avoid data type mismatches.