You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: "Filter by search term in the name or description."
2074
2070
in: "query"
2075
2071
style: "form"
2076
2072
explode: false
@@ -2097,7 +2093,7 @@ components:
2097
2093
required: true
2098
2094
filterByVisibility:
2099
2095
name: "visibility"
2100
-
description: "Filter by visibility"
2096
+
description: "Filter by visibility."
2101
2097
in: "query"
2102
2098
style: "form"
2103
2099
explode: false
@@ -2107,7 +2103,7 @@ components:
2107
2103
$ref: "#/components/schemas/Visibility"
2108
2104
filterByCreatedFrom:
2109
2105
name: "createdFrom"
2110
-
description: "Filter connectors created from this date (inclusive). Format: YYYY-MM-DD."
2106
+
description: "Filter by creation date, not older than this date. Format: YYYY-MM-DD."
2111
2107
in: "query"
2112
2108
style: "form"
2113
2109
explode: false
@@ -2116,7 +2112,7 @@ components:
2116
2112
format: "date"
2117
2113
filterByCreatedTo:
2118
2114
name: "createdTo"
2119
-
description: "Filter connectors created until this date (inclusive). Format: YYYY-MM-DD."
2115
+
description: "Filter by creation date, not younger than this date. Format: YYYY-MM-DD."
2120
2116
in: "query"
2121
2117
style: "form"
2122
2118
explode: false
@@ -4679,33 +4675,47 @@ components:
4679
4675
description: |
4680
4676
Specifies the maximum allowable epsilon value. If the training process exceeds this threshold, it will be terminated early. Only model checkpoints with epsilon values below this limit will be retained.
4681
4677
If not provided, the training will proceed without early termination based on epsilon constraints.
4678
+
default: 10.0
4682
4679
minimum: 0.0
4680
+
exclusiveMinimum: true
4683
4681
maximum: 10000.0
4682
+
delta:
4683
+
type: "number"
4684
+
format: "double"
4685
+
description: |
4686
+
The delta value for differential privacy. It is the probability of the privacy guarantee not holding.
4687
+
The smaller the delta, the more confident you can be that the privacy guarantee holds.
4688
+
This delta will be equally distributed between the analysis and the training phase.
4689
+
default: 1e-5
4690
+
minimum: 0.0
4691
+
exclusiveMinimum: true
4692
+
maximum: 1.0
4684
4693
noiseMultiplier:
4685
4694
type: "number"
4686
4695
format: "double"
4687
4696
description: |
4688
-
The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add).
4697
+
Determines how much noise while training the model with differential privacy. This is the ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added.
4689
4698
default: 1.5
4690
4699
minimum: 0.0
4691
4700
maximum: 10000.0
4692
4701
maxGradNorm:
4693
4702
type: "number"
4694
4703
format: "double"
4695
4704
description: |
4696
-
The maximum norm of the per-sample gradients for training the model with differential privacy.
4705
+
Determines the maximum impact of a single sample on updating the model weights during training with differential privacy. This is the maximum norm of the per-sample gradients.
4697
4706
default: 1.0
4698
4707
minimum: 0.0
4699
4708
maximum: 10000.0
4700
-
delta:
4709
+
valueProtectionEpsilon:
4701
4710
type: "number"
4702
4711
format: "double"
4703
4712
description: |
4704
-
The delta value for differential privacy. It is the probability of the privacy guarantee not holding.
4705
-
The smaller the delta, the more confident you can be that the privacy guarantee holds.
4706
-
default: 1e-5
4713
+
The DP epsilon of the privacy budget for determining the value ranges, which are gathered prior to the model training during the analysis step. Only applicable if value protection is True.
4714
+
Privacy budget will be equally distributed between the columns. For categorical we calculate noisy histograms and use a noisy threshold. For numeric and datetime we calculate bounds based on noisy histograms.
4715
+
default: 1.0
4707
4716
minimum: 0.0
4708
-
maximum: 1.0
4717
+
exclusiveMinimum: true
4718
+
maximum: 10000.0
4709
4719
4710
4720
#################
4711
4721
## mostlyai-qa ##
@@ -4721,7 +4731,7 @@ components:
4721
4731
2. **Similarity**: Metrics regarding the similarity of the full joint distributions of samples within an embedding
4722
4732
space.
4723
4733
3. **Distances**: Metrics regarding the nearest neighbor distances between training, holdout, and synthetic samples
4724
-
in an embedding space. Useful for assessing the novelty / privacy of synthetic data.
4734
+
in an numeric encoding space. Useful for assessing the novelty / privacy of synthetic data.
4725
4735
4726
4736
The quality of synthetic data is assessed by comparing these metrics to the same metrics of a holdout dataset.
4727
4737
The holdout dataset is a subset of the original training data, that was not used for training the synthetic data
@@ -4738,20 +4748,21 @@ components:
4738
4748
description: |
4739
4749
Metrics regarding the accuracy of synthetic data, measured as the closeness of discretized lower dimensional
4740
4750
marginal distributions.
4741
-
4751
+
4742
4752
1. **Univariate Accuracy**: The accuracy of the univariate distributions for all target columns.
4743
4753
2. **Bivariate Accuracy**: The accuracy of all pair-wise distributions for target columns, as well as for target
4744
4754
columns with respect to the context columns.
4745
-
3. **Coherence Accuracy**: The accuracy of the auto-correlation for all target columns.
4746
-
4755
+
3. **Trivariate Accuracy**: The accuracy of all three-way distributions for target columns.
4756
+
4. **Coherence Accuracy**: The accuracy of the auto-correlation for all target columns.
4757
+
4747
4758
Accuracy is defined as 100% - [Total Variation Distance](https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures) (TVD),
4748
4759
whereas TVD is half the sum of the absolute differences of the relative frequencies of the corresponding
4749
4760
distributions.
4750
-
4761
+
4751
4762
These accuracies are calculated for all discretized univariate, and bivariate distributions. In case of sequential
4752
4763
data, also for all coherence distributions. Overall metrics are then calculated as the average across these
4753
4764
accuracies.
4754
-
4765
+
4755
4766
All metrics can be compared against a theoretical maximum accuracy, which is calculated for a same-sized holdout.
4756
4767
The accuracy metrics shall be as close as possible to the theoretical maximum, but not significantly higher, as
4757
4768
this would indicate overfitting.
@@ -4777,6 +4788,13 @@ components:
4777
4788
format: "double"
4778
4789
minimum: 0.0
4779
4790
maximum: 1.0
4791
+
trivariate:
4792
+
description: |
4793
+
Average accuracy of discretized trivariate distributions.
4794
+
type: "number"
4795
+
format: "double"
4796
+
minimum: 0.0
4797
+
maximum: 1.0
4780
4798
coherence:
4781
4799
description: |
4782
4800
Average accuracy of discretized coherence distributions. Only applicable for sequential data.
@@ -4805,6 +4823,13 @@ components:
4805
4823
format: "double"
4806
4824
minimum: 0.0
4807
4825
maximum: 1.0
4826
+
trivariateMax:
4827
+
description: |
4828
+
Expected trivariate accuracy of a same-sized holdout. Serves as a reference for `trivariate`.
4829
+
type: "number"
4830
+
format: "double"
4831
+
minimum: 0.0
4832
+
maximum: 1.0
4808
4833
coherenceMax:
4809
4834
description: |
4810
4835
Expected coherence accuracy of a same-sized holdout. Serves as a reference for `coherence`.
@@ -4864,20 +4889,20 @@ components:
4864
4889
Distances:
4865
4890
type: "object"
4866
4891
description: |
4867
-
Metrics regarding the nearest neighbor distances between training, holdout, and synthetic samples in an embedding
4868
-
space. Useful for assessing the novelty / privacy of synthetic data.
4869
-
4892
+
Metrics regarding the nearest neighbor distances between training, holdout, and synthetic samples in an numerically
4893
+
encoded space. Useful for assessing the novelty / privacy of synthetic data.
4894
+
4870
4895
The provided data is first down-sampled, so that the number of samples match across all datasets. Note, that for
4871
4896
an optimal sensitivity of this privacy assessment it is recommended to use a 50/50 split between training and
4872
4897
holdout data, and then generate synthetic data of the same size.
4873
-
4874
-
The embeddings of these samples are then computed, and the nearest neighbor distances are calculated for each
4898
+
4899
+
The numerical encodings of these samples are then computed, and the nearest neighbor distances are calculated for each
4875
4900
synthetic sample to the training and holdout samples. Based on these nearest neighbor distances the following
4876
4901
metrics are calculated:
4877
-
- Identical Match Share (IMS): The share of synthetic samples that are identical to a training or holdout sample.
4878
-
- Distance to Closest Record (DCR): The average distance of synthetic to training or holdout samples.
4879
-
- Nearest Neighbor Distance Ratio (NNDR): The 10-th smallest ratio of the distance to nearest and second nearest neighbor.
4880
-
4902
+
- Identical Match Share (IMS): The share of synthetic samples that are identical to a training or holdout sample.
4903
+
- Distance to Closest Record (DCR): The average distance of synthetic to training or holdout samples.
4904
+
- Nearest Neighbor Distance Ratio (NNDR): The 10-th smallest ratio of the distance to nearest and second nearest neighbor.
4905
+
4881
4906
For privacy-safe synthetic data we expect to see about as many identical matches, and about the same distances
4882
4907
for synthetic samples to training, as we see for synthetic samples to holdout.
0 commit comments