Skip to content

feat: support explicit split inputs and nn epoch overrides#189

Merged
marcellodebernardi merged 4 commits intomainfrom
codex/port-explicit-splits-and-nn-epoch-overrides
Mar 5, 2026
Merged

feat: support explicit split inputs and nn epoch overrides#189
marcellodebernardi merged 4 commits intomainfrom
codex/port-explicit-splits-and-nn-epoch-overrides

Conversation

@marcellodebernardi
Copy link
Contributor

This PR adds explicit validation/test split support in plexe data preparation and updates entrypoint epoch overrides to nn_default_epochs/nn_max_epochs, while preserving backward compatibility for existing dataset input usage. The aim is to let callers provide train/val/test inputs directly (or partially) with predictable split generation behavior and cleaner neural-network epoch controls from the CLI/function entrypoint.

Testing

  • poetry run pytest tests/unit/agents/test_dataset_splitter_prompt.py tests/unit/test_main_dataset_inputs.py tests/unit/workflow/test_prepare_data_explicit_splits.py tests/unit/workflow/test_column_exclusion.py
  • poetry run ruff check plexe/main.py plexe/workflow.py plexe/agents/dataset_splitter.py tests/unit/test_main_dataset_inputs.py tests/unit/workflow/test_prepare_data_explicit_splits.py tests/unit/agents/test_dataset_splitter_prompt.py tests/integration/conftest.py

Copilot AI review requested due to automatic review settings March 5, 2026 20:59
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR introduces explicit train/val/test split injection into prepare_data and build_model, renames the max_epochs entrypoint parameter to the cleaner nn_default_epochs/nn_max_epochs pair (with the previously-flagged clamp guard restored), and provides backward compatibility for the deprecated data_refs dataset input.

Key changes and observations:

  • Explicit split routing logic (workflow.py): The four combination cases (both provided, val-only, test-only, neither) are cleanly handled. The val-only + generate-test branch correctly writes splitter output to a generated/ subdirectory to avoid overwriting the user's val.parquet, addressing the previously-flagged collision.
  • Schema validation removed (workflow.py): The old code validated target-column presence and feature-set compatibility when an explicit test dataset was provided. The replacement is a TODO comment, meaning users who supply an incompatible test dataset will receive a confusing late-stage error instead of an early ValueError.
  • max_epochs Python API removed without a deprecation shim (main.py): The data_refs parameter received a deprecation warning path, but max_epochs was dropped entirely. External Python callers using the keyword argument will get a TypeError rather than a deprecation notice.
  • normalized_output_base derived from already-normalized URI (main.py): A minor semantic imprecision — the base path for val/test normalization is computed using train_dataset_uri (the already-normalized output) rather than resolved_train_input_uri. Harmless for StandaloneIntegration but potentially confusing for custom integrations.
  • Splitter prompt conditioning (dataset_splitter.py): The 2-way vs 3-way split distinction is now correctly propagated into both the system instructions and the task prompt, preventing the splitter from producing an accidental third holdout file in 2-way mode.
  • Test coverage: New unit tests cover all four explicit-split combinations in prepare_data, the main() dataset-input resolution paths, epoch-clamp behavior, and splitter prompt conditioning.

Confidence Score: 4/5

  • Safe to merge with minor regressions in schema validation and Python API backward compatibility worth addressing before a customer-facing release.
  • The core logic is sound, well-tested, and the previously-flagged val-overwrite and epoch-clamp bugs have been correctly resolved. The two remaining concerns — removal of schema compatibility validation and the undeprecated max_epochs parameter — are both regressions relative to the base branch rather than net-new bugs, which keeps the score from dropping lower.
  • plexe/workflow.py (schema validation removal) and plexe/main.py (max_epochs API break + normalized_output_base URI)

Important Files Changed

Filename Overview
plexe/main.py Major refactor: adds train/val/test split inputs, nn epoch overrides (replacing max_epochs), and val/test dataset normalization. Normalizes the val/test base path off train_dataset_uri (already-normalized) rather than resolved_train_input_uri, which is harmless for StandaloneIntegration but semantically imprecise. The max_epochs Python API parameter is removed without a backward-compat shim, which is a breaking change for existing callers.
plexe/workflow.py Comprehensive refactor of prepare_data to support explicit val/test split injection with graceful generation of missing splits. The explicit test-dataset schema validation (target column presence, feature overlap check) has been removed and replaced with a TODO comment, which is a regression in early-error coverage for users providing incompatible split datasets. The refactored _sanitize_aux_dataset_column_names helper correctly handles column mapping and collision detection.
plexe/agents/dataset_splitter.py Clean refactor to conditionally shape splitter instructions and task prompt based on whether a test split is expected (2-way vs 3-way). canonicalize_split_ratios is now called upfront in both _build_agent and run, and the return type annotation is correctly updated to `tuple[str, str, str
tests/unit/workflow/test_prepare_data_explicit_splits.py Good coverage of the three primary explicit-split scenarios: all splits provided (splitter skipped), val-only with test generation, and test-only with val generation. The updated regression assertion for the generated subdirectory path correctly guards against the previous val-overwrite bug.
tests/unit/test_main_dataset_inputs.py Comprehensive unit tests for main() dataset resolution: train_dataset_uri priority over data_refs, auto-enabling final evaluation when test is provided, max-epoch clamping, data_refs fallback, and hard failure when neither is supplied. Well-structured with a reusable patch helper.

Comments Outside Diff (1)

  1. plexe/workflow.py, line 1036-1037 (link)

    Schema validation silently removed for explicit splits

    The old code, when an explicit test dataset was provided, validated:

    1. That the target column existed in the test dataset.
    2. That all training features were present in the test dataset (raising a clear ValueError with the list of missing features).

    The replacement is a TODO with no equivalent check for val_dataset_uri or the new test_dataset_uri path. Users who accidentally swap columns, provide the wrong file, or pass a bare-features test set (without the target) will now only discover the issue much later — during final evaluation — with a much harder-to-diagnose error.

    Consider at minimum re-adding the target-column check for test_uri when enable_final_evaluation is True, even if the full feature-overlap validation is deferred:

    if test_uri and context.output_targets:
        test_df = spark.read.parquet(test_uri)
        if context.output_targets[0] not in test_df.columns:
            raise ValueError(
                f"Test dataset missing target column '{context.output_targets[0]}'. "
                f"Test columns: {test_df.columns}"
            )
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: plexe/workflow.py
    Line: 1036-1037
    
    Comment:
    **Schema validation silently removed for explicit splits**
    
    The old code, when an explicit test dataset was provided, validated:
    1. That the target column existed in the test dataset.
    2. That all training features were present in the test dataset (raising a clear `ValueError` with the list of missing features).
    
    The replacement is a TODO with no equivalent check for `val_dataset_uri` or the new `test_dataset_uri` path. Users who accidentally swap columns, provide the wrong file, or pass a bare-features test set (without the target) will now only discover the issue much later — during final evaluation — with a much harder-to-diagnose error.
    
    Consider at minimum re-adding the target-column check for `test_uri` when `enable_final_evaluation` is `True`, even if the full feature-overlap validation is deferred:
    ```python
    if test_uri and context.output_targets:
        test_df = spark.read.parquet(test_uri)
        if context.output_targets[0] not in test_df.columns:
            raise ValueError(
                f"Test dataset missing target column '{context.output_targets[0]}'. "
                f"Test columns: {test_df.columns}"
            )
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 3f49497

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands plexe’s dataset handling to support explicit train/val/test inputs (including partial provision with generated missing splits), and updates the main entrypoint to use explicit neural-network epoch override knobs.

Changes:

  • Add explicit train_dataset_uri/val_dataset_uri/test_dataset_uri handling in main() (with deprecated data_refs fallback) and normalize optional split inputs.
  • Extend Phase 2 (prepare_data) to accept injected val/test splits and only generate missing splits from the training dataset.
  • Update DatasetSplitterAgent prompts/return typing to support 2-way vs 3-way split modes; add new unit tests and update integration test helpers/Makefile target.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
plexe/workflow.py Adds optional val input plumbed through build_model/prepare_data, explicit-split materialization, and aux dataset sanitization.
plexe/main.py Introduces explicit split inputs, dataset input resolution (train preferred; data_refs deprecated), and new epoch override args/CLI flags.
plexe/agents/dataset_splitter.py Makes split prompts conditional (2-way vs 3-way) and updates return type to allow test_uri=None.
tests/unit/workflow/test_prepare_data_explicit_splits.py Adds coverage for explicit split resolution and missing-split generation behaviors.
tests/unit/test_main_dataset_inputs.py Adds coverage for main() dataset input precedence and final-eval auto-enable on explicit test input.
tests/unit/agents/test_dataset_splitter_prompt.py Adds prompt-level checks for 2-way vs 3-way splitter instructions.
tests/integration/conftest.py Updates build_model(...) calls to pass val_dataset_uri=None explicitly.
Makefile Adds a runnable example target for explicit train+test input usage.
plexe/CODE_INDEX.md / tests/CODE_INDEX.md Regenerates code index entries for new/updated APIs and tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@marcellodebernardi
Copy link
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi
Copy link
Contributor Author

@greptileai please review again with latest changes

@marcellodebernardi marcellodebernardi merged commit a1e05f6 into main Mar 5, 2026
13 checks passed
@marcellodebernardi marcellodebernardi deleted the codex/port-explicit-splits-and-nn-epoch-overrides branch March 5, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants