Add new percentile aggregator and tests for comparing it with Byzantine#1052
Add new percentile aggregator and tests for comparing it with Byzantine#1052mina5rovic wants to merge 1 commit intodevelopfrom
Conversation
JulienVig
left a comment
There was a problem hiding this comment.
The comparison is very useful, thanks!
I have a bunch of questions, mostly due to my superficial knowledge of the algorithms. Overall, we're often getting exact values like 1.0000 which seems a bit fishy. Also, the new aggregator seems worse than the old both time- and robustness-wise. I believe it should at least be more robust so there are maybe some bugs that slipped into the implementation?
Finally, could you add a memory test case to ensure that the implementation is currently doing garbage disposal of unused TF tensors? You can do so by running multiple rounds/aggregations and keeping track of the number of tensors allocated in memory (tf.memory()), which ideally should stay constant throughout the training. You can read more on memory management here.
| expect(timingNew.result).toBeLessThan(50); | ||
| expect(timingOld.result).toBeLessThan(50); |
There was a problem hiding this comment.
Both aggregators result in exactly 1, is this expected? Shouldn't it be at least slightly influenced by the Byzantine outlier?
=== Timing Comparison: Simple Outlier Rejection ===
New ByzantineRobust | 14.655ms | result: 1.0000
Old PercentileClipping | 1.197ms | result: 1.0000
| console.log(formatTiming([timingNew, timingOld])); | ||
| }); | ||
|
|
||
| it("old aggregator with different percentiles", async () => { |
There was a problem hiding this comment.
So for this test I'm getting these results:
=== Timing Comparison: Old Aggregator with Different Percentiles ===
tau=0.05 | 2.350ms | result: 1.0000
tau=0.1 | 1.645ms | result: 1.0000
tau=0.2 | 2.607ms | result: 1.0000
tau=0.5 | 1.826ms | result: 1.0000
=== Timing Comparison: New Aggregator with Different Clipping Radii ===
radius=0.5 | 2.293ms | result: 0.5000
radius=1 | 0.949ms | result: 1.0000
radius=2 | 0.922ms | result: 1.4000
radius=5 | 0.749ms | result: 2.6000
If I understand correctly, the old aggregator is more robust than the new (but slower) in this case? Why aren't the results changing with different tau for the old aggregator?
| // With centering on previous (5.0), updates to 10.0 should result in something close to 10.0 | ||
| expect(timingRound2.result).toBeGreaterThan(5.0); | ||
|
|
||
| console.log("\n=== Timing Comparison: State Preservation Across Rounds ==="); | ||
| console.log(formatTiming([timingRound1, timingRound2])); | ||
| }); |
There was a problem hiding this comment.
I'm getting these:
=== Timing Comparison: State Preservation Across Rounds ===
Round 1 | 0.290ms | result: 5.0000
Round 2 | 0.305ms | result: 10.0000
Shouldn't it be a bit lower than 10 because of the previous round? Seems that round 1's state has been entirely discarded
| for (const iter of iterations) { | ||
| const agg = new ByzantineRobustAggregator(0, 5, "absolute", 1.0, iter, 0); | ||
| agg.setNodes(Set(allPeers)); | ||
|
|
||
| const timing = await measureAggregation(agg, `iterations=${iter}`, peersWithValues); | ||
| timings.push(timing); | ||
| } | ||
|
|
||
| console.log("\n=== Performance Impact of Iterative Refinement ==="); |
There was a problem hiding this comment.
=== Performance Impact of Iterative Refinement ===
iterations=1 | 0.762ms | result: 1.0000
iterations=2 | 1.233ms | result: 1.4000
iterations=5 | 5.417ms | result: 1.6496
iterations=10 | 7.004ms | result: 1.6665
Intuitively I expected that more iterations would improve the result while it seems that it strays further from the optimal result, is this behavior correct?
| console.log("\n=== Byzantine Robustness: High Ratio Attack (4/10 = 40% malicious) ==="); | ||
| console.log(formatTiming([timingNew, timingOld])); | ||
| console.log(` Result gap: new=${timingNew.result.toFixed(4)}, old=${timingOld.result.toFixed(4)}`); | ||
| console.log(` Winner: ${timingNew.result < timingOld.result ? "NEW (closer to honest 1.0)" : "OLD (closer to honest 1.0)"}`); |
There was a problem hiding this comment.
It's missing the case where both results are equal and it's a tie
| console.log("\n=== Gradient Poisoning Attack (coordinated Byzantine values) ==="); | ||
| console.log(formatTiming([timingNew, timingOld])); | ||
| console.log(` Result gap: new=${timingNew.result.toFixed(4)}, old=${timingOld.result.toFixed(4)}`); | ||
| console.log(` Expected honest value: 1.0000`); |
There was a problem hiding this comment.
=== Gradient Poisoning Attack (coordinated Byzantine values) ===
New (5 iterations) | 8.113ms | result: 2.3094
Old (tau=0.2) | 0.473ms | result: 1.0000
Result gap: new=2.3094, old=1.0000
Expected honest value: 1.0000
Winner: OLD (closer to honest)
So the old aggregator is both faster and more robust? Doesn't this go against the theory for the robustness?
| console.log(formatTiming([timing2New, timing2Old])); | ||
| console.log(` New result: ${timing2New.result.toFixed(4)} (expected ~10.0)`); | ||
| console.log(` Old result: ${timing2Old.result.toFixed(4)} (expected ~10.0)`); | ||
| console.log(` Winner: ${Math.abs(timing2New.result - 10.0) < Math.abs(timing2Old.result - 10.0) ? "NEW (better rejects attack)" : "OLD (better rejects attack)"}`); |
|
|
||
| const timing2New = await measureAggregation(newAgg, "New Round 2 (attack)", round2Values); | ||
| const timing2Old = await measureAggregation(oldAgg, "Old Round 2 (attack)", round2Values); | ||
|
|
There was a problem hiding this comment.
=== Adaptive Multi-Round Attack ===
Round 1 (cooperation):
New Round 1 | 4.476ms | result: 5.0000
Old Round 1 | 2.309ms | result: 5.0000
Round 2 (adaptive Byzantine attack):
New Round 2 (attack) | 32.874ms | result: 10.0000
Old Round 2 (attack) | 17.731ms | result: 10.0000
New result: 10.0000 (expected ~10.0)
Old result: 10.0000 (expected ~10.0)
Winner: OLD (better rejects attack)
Smells fishy that the results are exactly the expected value no?
| const newError = Math.abs((arrNew[0][0] - 100) / 100) + Math.abs((arrNew[0][1] - 10) / 10) + Math.abs(arrNew[0][2] - 1); | ||
| const oldError = Math.abs((arrOld[0][0] - 100) / 100) + Math.abs((arrOld[0][1] - 10) / 10) + Math.abs(arrOld[0][2] - 1); | ||
| console.log(`\nTotal relative error: new=${newError.toFixed(3)}, old=${oldError.toFixed(3)}`); | ||
| console.log(`Winner: ${newError < oldError ? "NEW (better handles multi-scale)" : "OLD (better handles multi-scale)"}`); |
| console.log("\n=== Heterogeneous Gradients: Multi-Tensor Federated Model ==="); | ||
| console.log("Gradient structure: [layer1=100×value, layer2=10×value, layer3=value]"); | ||
| console.log("Honest peers send: [100, 10, 1]"); | ||
| console.log("Byzantine sends: [10000, 1000, 100]"); | ||
| console.log(`\nNew aggregator (${timeNew.toFixed(2)}ms):`); | ||
| console.log(` Layer 1: ${arrNew[0][0].toFixed(2)} (expected ~100)`); | ||
| console.log(` Layer 2: ${arrNew[0][1].toFixed(2)} (expected ~10)`); | ||
| console.log(` Layer 3: ${arrNew[0][2].toFixed(2)} (expected ~1)`); | ||
| console.log(`\nOld aggregator (${timeOld.toFixed(2)}ms):`); | ||
| console.log(` Layer 1: ${arrOld[0][0].toFixed(2)} (expected ~100)`); | ||
| console.log(` Layer 2: ${arrOld[0][1].toFixed(2)} (expected ~10)`); | ||
| console.log(` Layer 3: ${arrOld[0][2].toFixed(2)} (expected ~1)`); |
There was a problem hiding this comment.
=== Heterogeneous Gradients: Multi-Tensor Federated Model ===
Gradient structure: [layer1=100×value, layer2=10×value, layer3=value]
Honest peers send: [100, 10, 1]
Byzantine sends: [10000, 1000, 100]
New aggregator (1.98ms):
Layer 1: 14.92 (expected ~100)
Layer 2: 1.49 (expected ~10)
Layer 3: 0.15 (expected ~1)
Old aggregator (0.45ms):
Layer 1: 100.00 (expected ~100)
Layer 2: 10.00 (expected ~10)
Layer 3: 1.00 (expected ~1)
Total relative error: new=2.552, old=0.000
Winner: OLD (better handles multi-scale)
The new aggregator is quite far off the expected values, potentially there is an implementation issue to fix there
Implementation of new percentile aggregator based on the previously implemented Byzantine (in previous version of disco). This one work faster and in some cases better than the Byzantine. Also implemented tests for comparing those two.