feat: support explicit split inputs and nn epoch overrides#189
Conversation
Greptile SummaryThis PR introduces explicit train/val/test split injection into Key changes and observations:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| plexe/main.py | Major refactor: adds train/val/test split inputs, nn epoch overrides (replacing max_epochs), and val/test dataset normalization. Normalizes the val/test base path off train_dataset_uri (already-normalized) rather than resolved_train_input_uri, which is harmless for StandaloneIntegration but semantically imprecise. The max_epochs Python API parameter is removed without a backward-compat shim, which is a breaking change for existing callers. |
| plexe/workflow.py | Comprehensive refactor of prepare_data to support explicit val/test split injection with graceful generation of missing splits. The explicit test-dataset schema validation (target column presence, feature overlap check) has been removed and replaced with a TODO comment, which is a regression in early-error coverage for users providing incompatible split datasets. The refactored _sanitize_aux_dataset_column_names helper correctly handles column mapping and collision detection. |
| plexe/agents/dataset_splitter.py | Clean refactor to conditionally shape splitter instructions and task prompt based on whether a test split is expected (2-way vs 3-way). canonicalize_split_ratios is now called upfront in both _build_agent and run, and the return type annotation is correctly updated to `tuple[str, str, str |
| tests/unit/workflow/test_prepare_data_explicit_splits.py | Good coverage of the three primary explicit-split scenarios: all splits provided (splitter skipped), val-only with test generation, and test-only with val generation. The updated regression assertion for the generated subdirectory path correctly guards against the previous val-overwrite bug. |
| tests/unit/test_main_dataset_inputs.py | Comprehensive unit tests for main() dataset resolution: train_dataset_uri priority over data_refs, auto-enabling final evaluation when test is provided, max-epoch clamping, data_refs fallback, and hard failure when neither is supplied. Well-structured with a reusable patch helper. |
Comments Outside Diff (1)
-
plexe/workflow.py, line 1036-1037 (link)Schema validation silently removed for explicit splits
The old code, when an explicit test dataset was provided, validated:
- That the target column existed in the test dataset.
- That all training features were present in the test dataset (raising a clear
ValueErrorwith the list of missing features).
The replacement is a TODO with no equivalent check for
val_dataset_urior the newtest_dataset_uripath. Users who accidentally swap columns, provide the wrong file, or pass a bare-features test set (without the target) will now only discover the issue much later — during final evaluation — with a much harder-to-diagnose error.Consider at minimum re-adding the target-column check for
test_uriwhenenable_final_evaluationisTrue, even if the full feature-overlap validation is deferred:if test_uri and context.output_targets: test_df = spark.read.parquet(test_uri) if context.output_targets[0] not in test_df.columns: raise ValueError( f"Test dataset missing target column '{context.output_targets[0]}'. " f"Test columns: {test_df.columns}" )
Prompt To Fix With AI
This is a comment left during a code review. Path: plexe/workflow.py Line: 1036-1037 Comment: **Schema validation silently removed for explicit splits** The old code, when an explicit test dataset was provided, validated: 1. That the target column existed in the test dataset. 2. That all training features were present in the test dataset (raising a clear `ValueError` with the list of missing features). The replacement is a TODO with no equivalent check for `val_dataset_uri` or the new `test_dataset_uri` path. Users who accidentally swap columns, provide the wrong file, or pass a bare-features test set (without the target) will now only discover the issue much later — during final evaluation — with a much harder-to-diagnose error. Consider at minimum re-adding the target-column check for `test_uri` when `enable_final_evaluation` is `True`, even if the full feature-overlap validation is deferred: ```python if test_uri and context.output_targets: test_df = spark.read.parquet(test_uri) if context.output_targets[0] not in test_df.columns: raise ValueError( f"Test dataset missing target column '{context.output_targets[0]}'. " f"Test columns: {test_df.columns}" ) ``` How can I resolve this? If you propose a fix, please make it concise.
Last reviewed commit: 3f49497
There was a problem hiding this comment.
Pull request overview
This PR expands plexe’s dataset handling to support explicit train/val/test inputs (including partial provision with generated missing splits), and updates the main entrypoint to use explicit neural-network epoch override knobs.
Changes:
- Add explicit
train_dataset_uri/val_dataset_uri/test_dataset_urihandling inmain()(with deprecateddata_refsfallback) and normalize optional split inputs. - Extend Phase 2 (
prepare_data) to accept injected val/test splits and only generate missing splits from the training dataset. - Update
DatasetSplitterAgentprompts/return typing to support 2-way vs 3-way split modes; add new unit tests and update integration test helpers/Makefile target.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
plexe/workflow.py |
Adds optional val input plumbed through build_model/prepare_data, explicit-split materialization, and aux dataset sanitization. |
plexe/main.py |
Introduces explicit split inputs, dataset input resolution (train preferred; data_refs deprecated), and new epoch override args/CLI flags. |
plexe/agents/dataset_splitter.py |
Makes split prompts conditional (2-way vs 3-way) and updates return type to allow test_uri=None. |
tests/unit/workflow/test_prepare_data_explicit_splits.py |
Adds coverage for explicit split resolution and missing-split generation behaviors. |
tests/unit/test_main_dataset_inputs.py |
Adds coverage for main() dataset input precedence and final-eval auto-enable on explicit test input. |
tests/unit/agents/test_dataset_splitter_prompt.py |
Adds prompt-level checks for 2-way vs 3-way splitter instructions. |
tests/integration/conftest.py |
Updates build_model(...) calls to pass val_dataset_uri=None explicitly. |
Makefile |
Adds a runnable example target for explicit train+test input usage. |
plexe/CODE_INDEX.md / tests/CODE_INDEX.md |
Regenerates code index entries for new/updated APIs and tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
@greptileai please review again with latest changes |
|
@greptileai please review again with latest changes |
This PR adds explicit validation/test split support in
plexedata preparation and updates entrypoint epoch overrides tonn_default_epochs/nn_max_epochs, while preserving backward compatibility for existing dataset input usage. The aim is to let callers provide train/val/test inputs directly (or partially) with predictable split generation behavior and cleaner neural-network epoch controls from the CLI/function entrypoint.Testing
poetry run pytest tests/unit/agents/test_dataset_splitter_prompt.py tests/unit/test_main_dataset_inputs.py tests/unit/workflow/test_prepare_data_explicit_splits.py tests/unit/workflow/test_column_exclusion.pypoetry run ruff check plexe/main.py plexe/workflow.py plexe/agents/dataset_splitter.py tests/unit/test_main_dataset_inputs.py tests/unit/workflow/test_prepare_data_explicit_splits.py tests/unit/agents/test_dataset_splitter_prompt.py tests/integration/conftest.py