Skip to content

Commit a1e05f6

Browse files
feat: support explicit split inputs and nn epoch overrides (#189)
* feat: support explicit split inputs and nn epoch overrides * chore: bump version to 1.4.4 * fix: avoid explicit val overwrite when generating missing test * fix: clamp default epochs on max-only override
1 parent b2ebc97 commit a1e05f6

11 files changed

Lines changed: 954 additions & 267 deletions

Makefile

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ help:
4545
@echo ""
4646
@echo "📊 Example Datasets:"
4747
@echo " make run-titanic Run on Titanic dataset (medium)"
48+
@echo " make run-titanic-explicit-test-split Run Titanic with explicit train+test inputs"
4849
@echo " make run-titanic-proba Run Titanic with probability-focused intent"
4950
@echo " make run-house-prices Run on House Prices dataset (regression)"
5051
@echo ""
@@ -332,6 +333,33 @@ run-titanic: build
332333
--spark-mode local \
333334
--enable-final-evaluation
334335

336+
# Spaceship Titanic dataset with explicit test split input
337+
.PHONY: run-titanic-explicit-test-split
338+
run-titanic-explicit-test-split: build
339+
@echo "📊 Running on Spaceship Titanic dataset (explicit train + test splits)..."
340+
$(eval TIMESTAMP := $(shell date +%Y%m%d_%H%M%S))
341+
docker run --rm \
342+
--add-host=host.docker.internal:host-gateway \
343+
$(CONFIG_MOUNT) \
344+
$(CONFIG_ENV) \
345+
-v $(PWD)/examples/datasets:/data:ro \
346+
-v $(PWD)/workdir:/workdir \
347+
-e OPENAI_API_KEY=$(OPENAI_API_KEY) \
348+
-e ANTHROPIC_API_KEY=$(ANTHROPIC_API_KEY) \
349+
-e SPARK_LOCAL_CORES=4 \
350+
-e SPARK_DRIVER_MEMORY=4g \
351+
plexe:py$(PYTHON_VERSION) \
352+
python -m plexe.main \
353+
--train-dataset-uri /data/spaceship-titanic/train.parquet \
354+
--test-dataset-uri /data/spaceship-titanic/test.csv \
355+
--user-id dev_user \
356+
--intent "predict whether a passenger was transported" \
357+
--experiment-id titanic_explicit_test \
358+
--max-iterations 10 \
359+
--work-dir /workdir/titanic_explicit_test/$(TIMESTAMP) \
360+
--spark-mode local \
361+
--enable-final-evaluation
362+
335363
# Spaceship Titanic dataset with probability-focused objective
336364
.PHONY: run-titanic-proba
337365
run-titanic-proba: build

plexe/CODE_INDEX.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Code Index: plexe
22

3-
> Generated on 2026-03-03 05:08:33
3+
> Generated on 2026-03-05 21:32:55
44
55
Code structure and public interface documentation for the **plexe** package.
66

@@ -17,7 +17,7 @@ Dataset Splitter Agent.
1717

1818
**`DatasetSplitterAgent`** - Agent that generates PySpark code for intelligent dataset splitting.
1919
- `__init__(self, spark: SparkSession, dataset_uri: str, context: BuildContext, config: Config)`
20-
- `run(self, split_ratios: dict[str, float], output_dir: str | Path) -> tuple[str, str, str]` - Generate and execute intelligent dataset splitting.
20+
- `run(self, split_ratios: dict[str, float], output_dir: str | Path) -> tuple[str, str, str | None]` - Generate and execute intelligent dataset splitting.
2121

2222
---
2323
## `agents/feature_processor.py`
@@ -306,7 +306,7 @@ Amazon S3 storage helper.
306306
Universal entry point for plexe.
307307

308308
**Functions:**
309-
- `main(intent: str, data_refs: list[str], integration: WorkflowIntegration | None, spark_mode: str, user_id: str, experiment_id: str, max_iterations: int, global_seed: int | None, work_dir: Path, test_dataset_uri: str | None, enable_final_evaluation: bool, max_epochs: int | None, allowed_model_types: list[str] | None, is_retrain: bool, original_model_uri: str | None, original_experiment_id: str | None, auto_mode: bool, user_feedback: dict | None, enable_otel: bool, otel_endpoint: str | None, otel_headers: dict[str, str] | None, external_storage_uri: str | None, csv_delimiter: str, csv_header: bool)` - Main model building function.
309+
- `main(intent: str, data_refs: list[str] | None, integration: WorkflowIntegration | None, spark_mode: str, user_id: str, experiment_id: str, max_iterations: int, global_seed: int | None, work_dir: Path, train_dataset_uri: str | None, val_dataset_uri: str | None, test_dataset_uri: str | None, enable_final_evaluation: bool, nn_default_epochs: int | None, nn_max_epochs: int | None, allowed_model_types: list[str] | None, is_retrain: bool, original_model_uri: str | None, original_experiment_id: str | None, auto_mode: bool, user_feedback: dict | None, enable_otel: bool, otel_endpoint: str | None, otel_headers: dict[str, str] | None, external_storage_uri: str | None, csv_delimiter: str, csv_header: bool)` - Main model building function.
310310

311311
---
312312
## `models.py`
@@ -728,10 +728,10 @@ Streamlit dashboard for plexe.
728728
Main workflow orchestrator.
729729

730730
**Functions:**
731-
- `build_model(spark: SparkSession, train_dataset_uri: str, test_dataset_uri: str | None, user_id: str, intent: str, experiment_id: str, work_dir: Path, runner: TrainingRunner, search_policy: SearchPolicy, config: Config, integration: WorkflowIntegration, enable_final_evaluation: bool, on_checkpoint_saved: Callable[[str, Path, Path], None] | None, pause_points: list[str] | None, on_pause: Callable[[str], None] | None, user_feedback: dict | None) -> tuple[Solution, dict, EvaluationReport | None] | None` - Main workflow orchestrator.
731+
- `build_model(spark: SparkSession, train_dataset_uri: str, val_dataset_uri: str | None, test_dataset_uri: str | None, user_id: str, intent: str, experiment_id: str, work_dir: Path, runner: TrainingRunner, search_policy: SearchPolicy, config: Config, integration: WorkflowIntegration, enable_final_evaluation: bool, on_checkpoint_saved: Callable[[str, Path, Path], None] | None, pause_points: list[str] | None, on_pause: Callable[[str], None] | None, user_feedback: dict | None) -> tuple[Solution, dict, EvaluationReport | None] | None` - Main workflow orchestrator.
732732
- `sanitize_dataset_column_names(spark: SparkSession, dataset_uri: str, context: BuildContext) -> str` - Sanitize column names by replacing special characters with underscores.
733733
- `analyze_data(spark: SparkSession, dataset_uri: str, context: BuildContext, config: Config, on_checkpoint_saved: Callable[[str, Path, Path], None] | None)` - Phase 1: Layout detection + Statistical + ML task analysis + metric selection.
734-
- `prepare_data(spark: SparkSession, training_dataset_uri: str, test_dataset_uri: str | None, context: BuildContext, config: Config, integration: WorkflowIntegration, generate_test_set: bool, on_checkpoint_saved: Callable[[str, Path, Path], None] | None)` - Phase 2: Split dataset and extract sample.
734+
- `prepare_data(spark: SparkSession, training_dataset_uri: str, val_dataset_uri: str | None, test_dataset_uri: str | None, context: BuildContext, config: Config, integration: WorkflowIntegration, generate_test_set: bool, on_checkpoint_saved: Callable[[str, Path, Path], None] | None)` - Phase 2: Split dataset and extract sample.
735735
- `build_baselines(spark: SparkSession, context: BuildContext, config: Config, on_checkpoint_saved: Callable[[str, Path, Path], None] | None)` - Phase 3: Build baseline models.
736736
- `search_models(spark: SparkSession, context: BuildContext, runner: TrainingRunner, search_policy: SearchPolicy, config: Config, integration: WorkflowIntegration, on_checkpoint_saved: Callable[[str, Path, Path], None] | None, restored_journal: SearchJournal | None, restored_insight_store: InsightStore | None) -> Solution | None` - Phase 4: Iterative tree-search for best model.
737737
- `retrain_on_full_dataset(spark: SparkSession, best_solution: Solution, context: BuildContext, runner: TrainingRunner, config: Config) -> Solution` - Retrain best solution on FULL dataset.

0 commit comments

Comments
 (0)