Running the ensemble attack on CTGAN by lotif · Pull Request #129 · VectorInstitute/midst-toolkit

lotif · 2026-03-05T20:05:18Z

PR Type

Feature

Short Description

Clickup Ticket(s): https://app.clickup.com/t/868h6nkzy

Making the Ensemble Attack run with CTGAN and adding the code required to make it run in an example.

The code is very rough, mostly just some if conditions and minor modifications to make it work with both TabDDPM/ClavaDDPM and CTGAN at the same time.

Also, there were lots of minor changes made in order for the code to work outside of the context of a challenge and make the code a little more flexible for other dataset types.

On follow up PRs, I will be working on moving parts of this code from the examples to the main lib folder and also make it more extensible to other model types.

Tests Added

No tests have been added.

…README file.

coderabbitai · 2026-03-05T20:21:38Z

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive refactoring of the model training and attack toolkit to support multiple model types and generalize configuration handling. The changes rename legacy configuration classes to use ClavaDDPM prefixes (e.g., DiffusionConfig → ClavaDDPMDiffusionConfig), introduce a new ModelType enum to support both TABDDPM and CTGAN model workflows, and add type-specific training result dataclasses (TabDDPMTrainingResult, CTGANTrainingResult). Configuration keys are standardized from hardcoded "tabddpm_training_config_path" to a generic "training_config_path". New ensemble attack scripts are added for CTGAN workflows, data handling is generalized to drop all ID columns, and type hints are updated across the codebase to reflect the new structures. Multiple example scripts and test files are updated to use the refactored naming conventions and new branching logic.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Running the ensemble attack on CTGAN' directly and clearly summarizes the main change—enabling the ensemble attack to work with CTGAN, which aligns with the core objective of this feature PR.
Description check	✅ Passed	The PR description follows the required template with all essential sections (PR Type, Short Description, and Tests Added) properly filled out, providing context about the feature, scope, and related ticket.
Docstring Coverage	✅ Passed	Docstring coverage is 91.30% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch marcelo/ensamble-ctgan

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py (1)

196-224: ⚠️ Potential issue | 🔴 Critical

Persist full TrainingResult objects instead of only synthetic_data.

Line 223 and Line 350 currently store just pd.DataFrame, which discards model/config/save-dir metadata and breaks the new result-object contract used by integration tests.

🐛 Proposed fix

-        attack_data["fine_tuned_results"].append(train_result.synthetic_data)
+        attack_data["fine_tuned_results"].append(train_result)
...
-        attack_data["trained_results"].append(train_result.synthetic_data)
+        attack_data["trained_results"].append(train_result)

Also applies to: 325-350

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py` around
lines 196 - 224, The code currently appends only train_result.synthetic_data to
attack_data["fine_tuned_results"], losing the TrainingResult metadata; update
the two places where you append (the block using train_result from
fine_tune_tabddpm_and_synthesize and the block using train_result from
train_or_fine_tune_ctgan) to append the entire TrainingResult object
(train_result) instead of train_result.synthetic_data, and adjust any downstream
consumers/tests to expect TrainingResult entries (preserving
model/config/save_dir fields from initial_model_training_results.models and the
called functions fine_tune_tabddpm_and_synthesize / train_or_fine_tune_ctgan).

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py (1)

191-200: ⚠️ Potential issue | 🟡 Minor

Return type annotation mismatch.

The function fine_tune_tabddpm_and_synthesize has return type TrainingResult (line 200), but it actually returns a TabDDPMTrainingResult (line 249). This should be consistent with train_tabddpm_and_synthesize which correctly declares TabDDPMTrainingResult as its return type.

🛠️ Proposed fix

 def fine_tune_tabddpm_and_synthesize(
     trained_models: dict[Relation, ClavaDDPMModelArtifacts],
     fine_tune_set: pd.DataFrame,
     configs: ClavaDDPMTrainingConfig,
     save_dir: Path,
     fine_tuning_diffusion_iterations: int = 100,
     fine_tuning_classifier_iterations: int = 10,
     synthesize: bool = True,
     number_of_points_to_synthesize: int = 20000,
-) -> TrainingResult:
+) -> TabDDPMTrainingResult:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/midst_toolkit/attacks/ensemble/shadow_model_utils.py` around lines 191 -
200, The return type annotation for fine_tune_tabddpm_and_synthesize is
incorrect: it declares TrainingResult but actually returns a
TabDDPMTrainingResult; update the function signature of
fine_tune_tabddpm_and_synthesize to return TabDDPMTrainingResult (matching
train_tabddpm_and_synthesize) and ensure any imports or type references for
TabDDPMTrainingResult are present so the annotation resolves.

🧹 Nitpick comments (7)

examples/gan/synthesize.py (1)

37-42: Consider more robust config access for data_path.

The current check config.training.data_path is not None doesn't handle:

Missing data_path key in config (would raise ConfigAttributeError before the comparison).
Empty string "" (would pass the check, resulting in _synthetic.csv filename).

Using a truthy check with safe attribute access would be more defensive:

♻️ Proposed fix for robustness

-    if config.training.data_path is not None:
-        dataset_name = Path(config.training.data_path).stem
-    else:
-        dataset_name = get_table_name(config.base_data_dir)
+    data_path = getattr(config.training, "data_path", None)
+    if data_path:
+        dataset_name = Path(data_path).stem
+    else:
+        dataset_name = get_table_name(config.base_data_dir)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/gan/synthesize.py` around lines 37 - 42, The code that sets
dataset_name uses a fragile check (config.training.data_path is not None) which
can raise if data_path is missing or produce wrong names for empty strings;
update the logic to safely access and truthily validate data_path (e.g., use
getattr(config.training, "data_path", None) or vars-accessor and check if
isinstance(data_path, str) and data_path.strip() != "") and only then set
dataset_name = Path(data_path).stem; otherwise call
get_table_name(config.base_data_dir). Ensure the new condition covers missing
attribute and empty/whitespace strings before constructing synthetic_data_file.

examples/gan/ensemble_attack/compute_attack_success.py (1)

21-24: Prefer config-driven target_ids over hardcoded [0].

Line [23] hardcodes a sentinel ID. Even if ignored today, this can make run provenance unclear and brittle for future multi-target support.

Suggested refactor

-    compute_attack_success_for_given_targets(
+    target_ids_cfg = config.ensemble_attack.target_model.get("target_ids")
+    target_ids = list(target_ids_cfg) if target_ids_cfg else [0]
+
+    compute_attack_success_for_given_targets(
         target_model_config=config.ensemble_attack.target_model,
-        # TODO: refactor this to work better outside of the challenge context (i.e. no target ID)
-        # No target ID needed for CTGAN, but it needs at least one element in this array. The value does not matter.
-        target_ids=[0],
+        target_ids=target_ids,
         experiment_directory=Path(config.results_dir),
         metaclassifier_model_name=config.ensemble_attack.metaclassifier.meta_classifier_model_name,
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/gan/ensemble_attack/compute_attack_success.py` around lines 21 - 24,
Replace the hardcoded sentinel target_ids=[0] with a config-driven value: read
target_ids from the existing config object (e.g., config.target_ids or
config.get("target_ids")) and pass that into the call, validating that it is a
non-empty list and falling back to a single-element list (e.g., [0]) only if the
config provides nothing valid; update the code around the target_ids parameter
in compute_attack_success.py where target_ids is passed and keep
experiment_directory=Path(config.results_dir) unchanged.

examples/gan/ensemble_attack/config.yaml (1)

29-33: TODO noted: Pipeline flags need proper testing.

The TODO comment on line 30 indicates that the pipeline control flags (run_data_processing, run_shadow_model_training, run_metaclassifier_training) haven't been fully tested. Ensure these are validated before merging to main or track this in the follow-up work mentioned in the PR description.

Would you like me to open an issue to track proper testing of these pipeline flags?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/gan/ensemble_attack/config.yaml` around lines 29 - 33, The TODO
notes that the pipeline control flags under the pipeline config
(run_data_processing, run_shadow_model_training, run_metaclassifier_training)
are untested; add validation and tests: implement runtime validation in the
pipeline bootstrap (check the pipeline config object) to assert supported flag
combinations and emit clear warnings/errors, add unit/integration tests that
exercise each flag and common combinations (e.g., data only, shadow only,
metaclassifier only, and all false/true), update config.yaml defaults if needed,
and add a CI job or test matrix to run these scenarios so the flags are covered
before merging.

examples/gan/ensemble_attack/train_attack_model.py (1)

101-109: Segmentation fault workaround warrants investigation.

The dynamic import to avoid a segmentation fault is a red flag. This could indicate memory corruption, circular imports, or incompatible library interactions. The TODO should be prioritized to understand the root cause, as segfaults can be symptoms of deeper issues.

Would you like me to open an issue to track investigating this segmentation fault?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/gan/ensemble_attack/train_attack_model.py` around lines 101 - 109,
The dynamic import of "examples.ensemble_attack.run_metaclassifier_training"
(meta_pipeline) to avoid a segmentation fault is a workaround, not a fix;
investigate the root cause by reproducing the segfault with a minimal script
that imports the module at top-level and running it under a native debugger
(gdb) or memory checker (valgrind) to capture the crash stack, inspect for
circular imports between train_attack_model.py and
examples.ensemble_attack.run_metaclassifier_training, and check for problematic
native extensions or incompatible package versions; after identifying the
culprit, either refactor to remove the circular dependency (or fix the native
extension/version), add a failing test that imports the module normally, and
replace the dynamic import/meta_pipeline.run_metaclassifier_training usage with
the normal top-level import once fixed, and open a tracked issue documenting
reproduction steps, stack trace, and environment (Python version, OS, dependent
library versions).

examples/gan/ensemble_attack/test_attack_model.py (1)

10-12: Avoid pytest-style naming for an executable Hydra entrypoint.

Using a test_* filename and test_* function for this script triggers pytest collection/execution. The repository has no testpaths or norecursedirs configuration to exclude the examples/ directory, so pytest will discover and attempt to run this file by default.

♻️ Suggested change

 `@hydra.main`(config_path="./", config_name="config", version_base=None)
-def test_attack_model(config: DictConfig) -> None:
+def main(config: DictConfig) -> None:
     """Main function to test the attack model."""
     log(
         INFO,
         f"Testing attack model against synthetic data at {config.ensemble_attack.target_model.target_synthetic_data_path}...",
     )
     run_metaclassifier_testing(config.ensemble_attack)
 
 
 if __name__ == "__main__":
-    test_attack_model()
+    main()

Also applies to: 20-21

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/gan/ensemble_attack/test_attack_model.py` around lines 10 - 12, The
script defines a Hydra entrypoint named test_attack_model and lives in a file
whose name starts with test_, causing pytest to discover and run it; rename the
entrypoint function (e.g., to run_attack_model or main_attack_model) and rename
the file to not start with test_ (or otherwise avoid a test_ prefix) so pytest
won't collect it, and update the `@hydra.main-decorated` function name references
(test_attack_model) and any other similar functions on lines ~20-21 to the new
non-test_* names to preserve Hydra behavior while preventing pytest collection.

tests/unit/attacks/ensemble/test_shadow_model_utils.py (1)

19-19: Rename the test to reflect the generalized API.

Line 19 still uses test_save_additional_tabddpm_config, but the test now validates save_additional_training_config. A generic test name will be clearer for multi-model support.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/attacks/ensemble/test_shadow_model_utils.py` at line 19, Rename
the test function test_save_additional_tabddpm_config to a generic name like
test_save_additional_training_config to match the updated API
(save_additional_training_config); update the function definition and any
references to it in the test module (e.g., fixtures or calls) and ensure the
test still invokes save_additional_training_config instead of the old
TabDDPM-specific helper so the name and behavior are consistent.

tests/integration/attacks/ensemble/test_shadow_model_training.py (1)

71-82: Prefer isinstance over exact type(...) is ... assertions.

Line 72 and Line 118 (and similarly DataFrame checks) are stricter than needed and can break with harmless subclassing/wrappers.

💡 Suggested change

-        assert type(result) is TabDDPMTrainingResult
+        assert isinstance(result, TabDDPMTrainingResult)
...
-        assert type(result.synthetic_data) is pd.DataFrame
+        assert isinstance(result.synthetic_data, pd.DataFrame)

Also applies to: 117-127

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/integration/attacks/ensemble/test_shadow_model_training.py` around
lines 71 - 82, The tests use strict type equality checks (e.g., "type(result) is
TabDDPMTrainingResult" and "type(result.synthetic_data) is pd.DataFrame") which
can fail for valid subclasses or wrappers; change these to use isinstance checks
instead: replace exact type comparisons with isinstance(result,
TabDDPMTrainingResult) and isinstance(result.synthetic_data, pd.DataFrame) and
keep the existing attribute and length assertions (e.g., for synthetic_data
length == 5) to preserve test intent; update occurrences around the shadow model
training assertions referencing TabDDPMTrainingResult and synthetic_data.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/ensemble_attack/compute_attack_success.py`:
- Around line 79-84: Looping over multiple target IDs currently reuses the same
paths when target_model_config lacks "target_model_id", causing duplicated
scores; fix by setting target_model_config.target_model_id = target_id whenever
the key is missing (not only when present) and then regenerate any dependent
paths (attack_probabilities_result_path and challenge_label_path) from
target_model_config (or include target_id in their filenames) so each target
gets distinct prediction/label files; update the code around the existing
conditional that touches target_model_config and ensure these regenerated paths
are used downstream.
- Around line 46-47: The CSV label-loading assumes a single column but uses
to_numpy().squeeze(), which can silently produce wrong shapes if multiple
columns (e.g., index + label) exist; replace the squeeze approach in the block
that sets test_target from challenge_label_path by reading into a DataFrame,
assert df.shape[1] == 1 (or raise a clear ValueError mentioning
challenge_label_path), and set test_target = df.iloc[:, 0].to_numpy(); apply the
identical change in the same pattern in
examples/ensemble_attack/test_attack_model.py where labels are loaded.

In `@examples/ensemble_attack/README.md`:
- Line 8: The README references the wrong config key name for the population
output path; update the documentation in the sentence that describes running
run_attack.py (and mentions pipeline.run_data_processing and
configs/experiment_config.yaml) to use data_paths.population_path instead of
data_paths.population_data so it matches the actual config keys (leave
data_paths.midst_data_path and data_paths.processed_attack_data_path unchanged).

In `@examples/ensemble_attack/real_data_collection.py`:
- Around line 225-226: The function mixes a dynamic id_columns list (and
df_population_no_id) with later hardcoded "trans_id" and "account_id"
references; update the code to consistently derive ID column names from
id_columns (e.g., select the transaction and account IDs by matching suffixes
like endswith("_id") or via a small helper that returns the correct column name)
and replace any direct uses of the string literals "trans_id" and "account_id"
with those derived names so all ID handling (creation, selection, and dropping)
uses the same computed variables (id_columns, df_population_no_id or a new
get_id_column helper).

In `@examples/ensemble_attack/test_attack_model.py`:
- Around line 346-349: The CSV branch reading challenge labels should validate
the CSV has exactly one column and produce a 1D numpy array; update the elif
handling of challenge_label_path to read into a DataFrame, check df.shape[1] ==
1 and raise ValueError if not, then set test_target = df.iloc[:, 0].to_numpy()
(ensuring a 1D array) so downstream code expecting a 1D label vector won’t
break; reference the variables challenge_label_path and test_target to locate
the change.
- Around line 268-277: The code only checks (processed_attack_data_path /
challenge_data_file_name).exists() before skipping collection but then
unconditionally calls load_dataframe for both challenge_data_file_name and
"master_challenge_train.csv", which can fail if the master CSV is missing;
change the condition to verify both files exist (e.g., check existence of
(processed_attack_data_path / challenge_data_file_name) AND
(processed_attack_data_path / "master_challenge_train.csv") ) before setting
df_challenge_experiment and df_master_train via load_dataframe, otherwise fall
through to perform the data collection path or raise a clear error.

In `@examples/gan/ensemble_attack/make_challenge_dataset.py`:
- Line 27: The current sampling line may raise ValueError when the untrained
pool is smaller than the training set; modify the logic around untrained_data =
real_data[~real_data[id_column].isin(training_data[id_column])].sample(len(training_data))
to first compute pool_size =
len(real_data[~real_data[id_column].isin(training_data[id_column])]) and then:
if pool_size >= len(training_data) keep sampling without replacement, else
either call sample(len(training_data), replace=True) or handle it explicitly
(raise a clearer error or sample only pool_size and log a warning) so sampling
never throws; update references to untrained_data, real_data, id_column, and
training_data accordingly.

In `@examples/gan/ensemble_attack/README.md`:
- Around line 64-65: Fix the typo in the README note: change the config key
reference from ensemble_attack.shadow_trainig.model_name to
ensemble_attack.shadow_training.model_name so the documentation matches the
actual config key used (update the text in the README where the mistaken key
appears).
- Line 81: Update the broken markdown anchor in
examples/gan/ensemble_attack/README.md: replace the link target
`#2-training-the-target-model` with the actual section anchor that matches the
heading "Generating target synthetic data to be tested" (e.g., use the correct
slugified anchor for that heading), so the [step 2] jump link navigates to the
"Generating target synthetic data to be tested" section.
- Around line 3-4: The relative link [Ensemble Attack](examples/ensemble_attack)
in the README line starting "On this example, we demonstrate how to run the
[Ensemble Attack]" resolves incorrectly; update that link target to the correct
relative path from examples/gan/ensemble_attack/README.md (for example change to
../../ensemble_attack or to an absolute /examples/ensemble_attack) so the
[Ensemble Attack] reference points to the actual docs.

In `@examples/gan/train.py`:
- Around line 27-35: The code sets dataset_name and real_data differently
depending on config.training.data_path but later always reads domain/metadata
from config.base_data_dir, causing domain-file coupling; change the logic so the
metadata path is chosen consistently: when config.training.data_path is
provided, set dataset_name = Path(config.training.data_path).stem and derive
metadata_dir = Path(config.training.data_path).parent, otherwise use table_name
and metadata_dir = Path(config.base_data_dir); then load domain metadata from
metadata_dir / f"{dataset_name}_domain.json" (or the existing metadata filename)
so that real_data and domain metadata come from the same directory.
- Around line 37-42: Before calling real_data.sample, validate
config.training.sample_size: ensure it's an int greater than 0 and <=
len(real_data); if not, raise a ValueError with a clear message referencing
sample_size and the available row count. Update the block that currently uses
config.training.sample_size and real_data.sample to perform this check (use
config.training.sample_size, real_data, and dataset_name to build the error
text) so invalid values fail fast with an informative message rather than
letting pandas.sample raise an opaque exception.

In `@src/midst_toolkit/attacks/ensemble/shadow_model_utils.py`:
- Line 13: The file shadow_model_utils.py imports helper functions from examples
(from examples.gan.utils import get_single_table_svd_metadata, get_table_name)
which breaks module boundaries; move or reimplement those helpers inside the src
package (e.g., create a new util module under src/midst_toolkit/, e.g.,
midst_toolkit.utils or midst_toolkit.helpers) and update shadow_model_utils.py
to import get_single_table_svd_metadata and get_table_name from that new
internal module; ensure the new functions keep identical signatures and behavior
so usages in functions/classes inside shadow_model_utils.py continue to work
without changing callers.

---

Outside diff comments:
In `@src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py`:
- Around line 196-224: The code currently appends only
train_result.synthetic_data to attack_data["fine_tuned_results"], losing the
TrainingResult metadata; update the two places where you append (the block using
train_result from fine_tune_tabddpm_and_synthesize and the block using
train_result from train_or_fine_tune_ctgan) to append the entire TrainingResult
object (train_result) instead of train_result.synthetic_data, and adjust any
downstream consumers/tests to expect TrainingResult entries (preserving
model/config/save_dir fields from initial_model_training_results.models and the
called functions fine_tune_tabddpm_and_synthesize / train_or_fine_tune_ctgan).

In `@src/midst_toolkit/attacks/ensemble/shadow_model_utils.py`:
- Around line 191-200: The return type annotation for
fine_tune_tabddpm_and_synthesize is incorrect: it declares TrainingResult but
actually returns a TabDDPMTrainingResult; update the function signature of
fine_tune_tabddpm_and_synthesize to return TabDDPMTrainingResult (matching
train_tabddpm_and_synthesize) and ensure any imports or type references for
TabDDPMTrainingResult are present so the annotation resolves.

---

Nitpick comments:
In `@examples/gan/ensemble_attack/compute_attack_success.py`:
- Around line 21-24: Replace the hardcoded sentinel target_ids=[0] with a
config-driven value: read target_ids from the existing config object (e.g.,
config.target_ids or config.get("target_ids")) and pass that into the call,
validating that it is a non-empty list and falling back to a single-element list
(e.g., [0]) only if the config provides nothing valid; update the code around
the target_ids parameter in compute_attack_success.py where target_ids is passed
and keep experiment_directory=Path(config.results_dir) unchanged.

In `@examples/gan/ensemble_attack/config.yaml`:
- Around line 29-33: The TODO notes that the pipeline control flags under the
pipeline config (run_data_processing, run_shadow_model_training,
run_metaclassifier_training) are untested; add validation and tests: implement
runtime validation in the pipeline bootstrap (check the pipeline config object)
to assert supported flag combinations and emit clear warnings/errors, add
unit/integration tests that exercise each flag and common combinations (e.g.,
data only, shadow only, metaclassifier only, and all false/true), update
config.yaml defaults if needed, and add a CI job or test matrix to run these
scenarios so the flags are covered before merging.

In `@examples/gan/ensemble_attack/test_attack_model.py`:
- Around line 10-12: The script defines a Hydra entrypoint named
test_attack_model and lives in a file whose name starts with test_, causing
pytest to discover and run it; rename the entrypoint function (e.g., to
run_attack_model or main_attack_model) and rename the file to not start with
test_ (or otherwise avoid a test_ prefix) so pytest won't collect it, and update
the `@hydra.main-decorated` function name references (test_attack_model) and any
other similar functions on lines ~20-21 to the new non-test_* names to preserve
Hydra behavior while preventing pytest collection.

In `@examples/gan/ensemble_attack/train_attack_model.py`:
- Around line 101-109: The dynamic import of
"examples.ensemble_attack.run_metaclassifier_training" (meta_pipeline) to avoid
a segmentation fault is a workaround, not a fix; investigate the root cause by
reproducing the segfault with a minimal script that imports the module at
top-level and running it under a native debugger (gdb) or memory checker
(valgrind) to capture the crash stack, inspect for circular imports between
train_attack_model.py and examples.ensemble_attack.run_metaclassifier_training,
and check for problematic native extensions or incompatible package versions;
after identifying the culprit, either refactor to remove the circular dependency
(or fix the native extension/version), add a failing test that imports the
module normally, and replace the dynamic
import/meta_pipeline.run_metaclassifier_training usage with the normal top-level
import once fixed, and open a tracked issue documenting reproduction steps,
stack trace, and environment (Python version, OS, dependent library versions).

In `@examples/gan/synthesize.py`:
- Around line 37-42: The code that sets dataset_name uses a fragile check
(config.training.data_path is not None) which can raise if data_path is missing
or produce wrong names for empty strings; update the logic to safely access and
truthily validate data_path (e.g., use getattr(config.training, "data_path",
None) or vars-accessor and check if isinstance(data_path, str) and
data_path.strip() != "") and only then set dataset_name = Path(data_path).stem;
otherwise call get_table_name(config.base_data_dir). Ensure the new condition
covers missing attribute and empty/whitespace strings before constructing
synthetic_data_file.

In `@tests/integration/attacks/ensemble/test_shadow_model_training.py`:
- Around line 71-82: The tests use strict type equality checks (e.g.,
"type(result) is TabDDPMTrainingResult" and "type(result.synthetic_data) is
pd.DataFrame") which can fail for valid subclasses or wrappers; change these to
use isinstance checks instead: replace exact type comparisons with
isinstance(result, TabDDPMTrainingResult) and isinstance(result.synthetic_data,
pd.DataFrame) and keep the existing attribute and length assertions (e.g., for
synthetic_data length == 5) to preserve test intent; update occurrences around
the shadow model training assertions referencing TabDDPMTrainingResult and
synthetic_data.

In `@tests/unit/attacks/ensemble/test_shadow_model_utils.py`:
- Line 19: Rename the test function test_save_additional_tabddpm_config to a
generic name like test_save_additional_training_config to match the updated API
(save_additional_training_config); update the function definition and any
references to it in the test module (e.g., fixtures or calls) and ensure the
test still invokes save_additional_training_config instead of the old
TabDDPM-specific helper so the name and behavior are consistent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 66ffa678-a318-4cab-a033-9f92592645e9

📥 Commits

Reviewing files that changed from the base of the PR and between 03a029e and 4e9a8c9.

📒 Files selected for processing (39)

examples/ensemble_attack/README.md
examples/ensemble_attack/compute_attack_success.py
examples/ensemble_attack/configs/experiment_config.yaml
examples/ensemble_attack/configs/original_attack_config.yaml
examples/ensemble_attack/real_data_collection.py
examples/ensemble_attack/run_shadow_model_training.py
examples/ensemble_attack/test_attack_model.py
examples/gan/README.md
examples/gan/ensemble_attack/README.md
examples/gan/ensemble_attack/compute_attack_success.py
examples/gan/ensemble_attack/config.yaml
examples/gan/ensemble_attack/make_challenge_dataset.py
examples/gan/ensemble_attack/test_attack_model.py
examples/gan/ensemble_attack/train_attack_model.py
examples/gan/synthesize.py
examples/gan/train.py
examples/synthesizing/multi_table/README.md
examples/synthesizing/multi_table/run_synthesizing.py
examples/synthesizing/single_table/README.md
examples/synthesizing/single_table/run_synthesizing.py
examples/training/multi_table/README.md
examples/training/multi_table/run_training.py
examples/training/single_table/README.md
examples/training/single_table/run_training.py
src/midst_toolkit/attacks/ensemble/blending.py
src/midst_toolkit/attacks/ensemble/clavaddpm_fine_tuning.py
src/midst_toolkit/attacks/ensemble/rmia/shadow_model_training.py
src/midst_toolkit/attacks/ensemble/shadow_model_utils.py
src/midst_toolkit/common/config.py
src/midst_toolkit/models/clavaddpm/clustering.py
src/midst_toolkit/models/clavaddpm/enumerations.py
src/midst_toolkit/models/clavaddpm/synthesizer.py
src/midst_toolkit/models/clavaddpm/train.py
tests/integration/attacks/ensemble/configs/shadow_training_config.yaml
tests/integration/attacks/ensemble/test_shadow_model_training.py
tests/integration/models/clavaddpm/test_model.py
tests/integration/models/clavaddpm/test_synthesizer.py
tests/unit/attacks/ensemble/configs/shadow_training_config.yaml
tests/unit/attacks/ensemble/test_shadow_model_utils.py

examples/ensemble_attack/compute_attack_success.py

examples/ensemble_attack/README.md

examples/ensemble_attack/real_data_collection.py

examples/ensemble_attack/test_attack_model.py

examples/gan/ensemble_attack/README.md

examples/gan/train.py

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py

examples/ensemble_attack/compute_attack_success.py

sarakodeiri

Took a quick look and things looks neat in general. Added a few minor comments.

examples/ensemble_attack/run_shadow_model_training.py

examples/gan/ensemble_attack/README.md

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py

…celo/ensamble-ctgan

fatemetkl

Awesome changes! Thank you!
I just had a few comments and questions, but it is almost ready to be merged, in my opinion.

examples/ensemble_attack/compute_attack_success.py

examples/gan/ensemble_attack/config.yaml

examples/gan/ensemble_attack/train_attack_model.py

examples/ensemble_attack/test_attack_model.py

examples/gan/ensemble_attack/make_challenge_dataset.py

examples/ensemble_attack/test_attack_model.py

…n optional parameter to the config

fatemetkl

Changes look great to me!

lotif added 17 commits January 8, 2026 14:51

wip

1a40fd7

wip

1d18580

WIP moving forward with the ensemble attack code changes

e42e630

WIP adding training and sythesizing code

a46a010

More info on readme

30c0ed3

More ctgan changes

9464962

Adding the split data code

e5c8fda

More config changes and bug fixes

8f10678

Removing ids dynamically

077d909

Working!

b711fbd

Merge branch 'main' into marcelo/ensamble-ctgan

efdde68

Fixing indent on config file and adding some more information to the …

1a38af2

…README file.

Adding test attack model code

af4f04e

Small bug fixes

5afb774

Updates to readme and config file values

e4ec793

Small changes on configs and script bug fixes

1c13126

Adding the compute attack success script and fixing minor issues

4e9a8c9

lotif requested review from emersodb and fatemetkl March 5, 2026 20:05

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

lotif requested a review from sarakodeiri March 5, 2026 21:20

sarakodeiri reviewed Mar 5, 2026

View reviewed changes

examples/ensemble_attack/compute_attack_success.py Show resolved Hide resolved

sarakodeiri reviewed Mar 5, 2026

View reviewed changes

examples/ensemble_attack/run_shadow_model_training.py Show resolved Hide resolved

examples/gan/ensemble_attack/README.md Outdated Show resolved Hide resolved

src/midst_toolkit/attacks/ensemble/shadow_model_utils.py Show resolved Hide resolved

lotif and others added 6 commits March 9, 2026 13:06

Cr by CodeRabbit and Sara

d83aabf

Reducing the amount of training samples to 20k

a198fe9

Merge branch 'main' into marcelo/ensamble-ctgan

0416dbc

Change function name to avoid pytest thinking it's a test

e69b07e

Merge remote-tracking branch 'origin/marcelo/ensamble-ctgan' into mar…

579d0f3

…celo/ensamble-ctgan

Fixing test assertions

5fa4fef

Merge branch 'main' into marcelo/ensamble-ctgan

8b6bf10

lotif requested a review from sarakodeiri March 10, 2026 20:01

fatemetkl reviewed Mar 11, 2026

View reviewed changes

fatemetkl reviewed Mar 13, 2026

View reviewed changes

examples/gan/ensemble_attack/make_challenge_dataset.py Show resolved Hide resolved

examples/ensemble_attack/test_attack_model.py Show resolved Hide resolved

lotif added 2 commits March 13, 2026 14:52

Making population_all_with_challenge.csv into a constant and adding a…

a9369f6

…n optional parameter to the config

Addressing last comments by Fatemeh

163bba8

lotif requested a review from fatemetkl March 16, 2026 16:20

Merge branch 'main' into marcelo/ensamble-ctgan

bf805c1

fatemetkl approved these changes Mar 17, 2026

View reviewed changes

lotif merged commit 99d8e2c into main Mar 17, 2026
6 checks passed

lotif deleted the marcelo/ensamble-ctgan branch March 17, 2026 15:30

Conversation

lotif commented Mar 5, 2026

PR Type

Short Description

Tests Added

Uh oh!

coderabbitai bot commented Mar 5, 2026

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarakodeiri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fatemetkl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fatemetkl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants