Skip to content

Commit 5655960

Browse files
committed
Refactor failure policy handling: Introduce FailurePolicySet for managing multiple policies
1 parent cc85ccf commit 5655960

19 files changed

Lines changed: 512 additions & 152 deletions

docs/reference/api-full.md

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ For a curated, example-driven API guide, see **[api.md](api.md)**.
1010
> - **[CLI Reference](cli.md)** - Command-line interface
1111
> - **[DSL Reference](dsl.md)** - YAML syntax guide
1212
13-
**Generated from source code on:** June 13, 2025 at 03:15 UTC
13+
**Generated from source code on:** June 13, 2025 at 10:43 UTC
1414

1515
**Modules auto-discovered:** 37
1616

@@ -313,14 +313,15 @@ repeats multiple times for Monte Carlo experiments.
313313
Attributes:
314314
network (Network): The underlying network to mutate (enable/disable nodes/links).
315315
traffic_matrix_set (TrafficMatrixSet): Traffic matrices to place after failures.
316+
failure_policy_set (FailurePolicySet): Set of named failure policies.
316317
matrix_name (Optional[str]): Name of specific matrix to use, or None for default.
317-
failure_policy (Optional[FailurePolicy]): The policy describing what fails.
318+
policy_name (Optional[str]): Name of specific failure policy to use, or None for default.
318319
default_flow_policy_config: The default flow policy for any demands lacking one.
319320

320321
**Methods:**
321322

322323
- `apply_failures(self) -> 'None'`
323-
- Apply the current failure_policy to self.network (in-place).
324+
- Apply the current failure policy to self.network (in-place).
324325
- `run_monte_carlo_failures(self, iterations: 'int', parallelism: 'int' = 1) -> 'Dict[str, Any]'`
325326
- Repeatedly applies (randomized) failures to the network and accumulates
326327
- `run_single_failure_scenario(self) -> 'List[TrafficResult]'`
@@ -401,6 +402,8 @@ Attributes:
401402

402403
- `apply_failures(self, network_nodes: 'Dict[str, Any]', network_links: 'Dict[str, Any]', network_risk_groups: 'Dict[str, Any] | None' = None) -> 'List[str]'`
403404
- Identify which entities fail given the defined rules, then optionally
405+
- `to_dict(self) -> 'Dict[str, Any]'`
406+
- Convert to dictionary for JSON serialization.
404407

405408
### FailureRule
406409

@@ -633,6 +636,33 @@ Attributes:
633636
- `to_dict(self) -> 'dict[str, Any]'`
634637
- Convert to dictionary for JSON serialization.
635638

639+
### FailurePolicySet
640+
641+
Named collection of FailurePolicy objects.
642+
643+
This mutable container maps failure policy names to FailurePolicy objects,
644+
allowing management of multiple failure policies for analysis.
645+
646+
Attributes:
647+
policies: Dictionary mapping failure policy names to FailurePolicy objects.
648+
649+
**Attributes:**
650+
651+
- `policies` (dict[str, 'FailurePolicy']) = {}
652+
653+
**Methods:**
654+
655+
- `add(self, name: 'str', policy: "'FailurePolicy'") -> 'None'`
656+
- Add a failure policy to the collection.
657+
- `get_all_policies(self) -> "list['FailurePolicy']"`
658+
- Get all failure policies from the collection.
659+
- `get_default_policy(self) -> "'FailurePolicy | None'"`
660+
- Get the default failure policy.
661+
- `get_policy(self, name: 'str') -> "'FailurePolicy'"`
662+
- Get a specific failure policy by name.
663+
- `to_dict(self) -> 'dict[str, Any]'`
664+
- Convert to dictionary for JSON serialization.
665+
636666
### PlacementResultSet
637667

638668
Aggregated traffic placement results from one or many runs.
@@ -693,7 +723,7 @@ Represents a complete scenario for building and executing network workflows.
693723

694724
This scenario includes:
695725
- A network (nodes/links), constructed via blueprint expansion.
696-
- A failure policy (one or more rules).
726+
- A failure policy set (one or more named failure policies).
697727
- A traffic matrix set containing one or more named traffic matrices.
698728
- A list of workflow steps to execute.
699729
- A results container for storing outputs.
@@ -708,8 +738,8 @@ Typical usage example:
708738
**Attributes:**
709739

710740
- `network` (Network)
711-
- `failure_policy` (Optional[FailurePolicy])
712741
- `workflow` (List[WorkflowStep])
742+
- `failure_policy_set` (FailurePolicySet) = FailurePolicySet(policies={})
713743
- `traffic_matrix_set` (TrafficMatrixSet) = TrafficMatrixSet(matrices={})
714744
- `results` (Results) = Results(_store={})
715745
- `components_library` (ComponentsLibrary) = ComponentsLibrary(components={})

docs/reference/api.md

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -125,16 +125,35 @@ demand = TrafficDemand(
125125

126126
## Failure Modeling
127127

128-
### FailurePolicy
129-
Configure failure simulation parameters.
128+
### FailurePolicy and FailurePolicySet
129+
Configure failure simulation parameters using named policies.
130130

131131
```python
132-
from ngraph.failure_policy import FailurePolicy
132+
from ngraph.failure_policy import FailurePolicy, FailureRule
133+
from ngraph.results_artifacts import FailurePolicySet
134+
135+
# Create individual failure rules
136+
rule = FailureRule(
137+
entity_scope="link",
138+
rule_type="choice",
139+
count=2
140+
)
133141

134-
policy = FailurePolicy(
135-
enable_failures=True,
136-
max_concurrent_failures=2,
137-
failure_probability=0.01
142+
# Create failure policy
143+
policy = FailurePolicy(rules=[rule])
144+
145+
# Create policy set to manage multiple policies
146+
policy_set = FailurePolicySet()
147+
policy_set.add("light_failures", policy)
148+
policy_set.add("default", policy)
149+
150+
# Use with FailureManager
151+
from ngraph.failure_manager import FailureManager
152+
manager = FailureManager(
153+
network=network,
154+
traffic_matrix_set=traffic_matrix_set,
155+
failure_policy_set=policy_set,
156+
policy_name="light_failures" # Optional: specify which policy to use
138157
)
139158
```
140159

docs/reference/dsl.md

Lines changed: 35 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ The main sections of a scenario YAML file work together to define a complete net
1515
- `components`: **[Optional]** A library of hardware and optics definitions with attributes like power consumption.
1616
- `risk_groups`: **[Optional]** Defines groups of components that might fail together (e.g., all components in a rack or multiple parallel links sharing the same DWDM transmission).
1717
- `traffic_matrix_set`: **[Optional]** Defines traffic demand matrices between network nodes with various placement policies.
18-
- `failure_policy`: **[Optional]** Specifies availability parameters and rules for simulating network failures.
18+
- `failure_policy_set`: **[Optional]** Specifies named failure policies and rules for simulating network failures.
1919
- `workflow`: **[Optional]** A list of steps to be executed, such as building graphs, running simulations, or performing analyses.
2020

2121
## `network` - Core Foundation
@@ -398,30 +398,45 @@ traffic_matrix_set:
398398

399399
- **`full_mesh`**: Creates individual demands for each (source_node, sink_node) pair, excluding self-pairs (where source equals sink). The total demand volume is split evenly among all valid pairs. This is useful for modeling distributed traffic patterns where every source communicates with every sink.
400400

401-
## `failure_policy` - Failure Simulation
401+
## `failure_policy_set` - Failure Simulation
402402

403-
Defines how network failures are simulated to test resilience and analyze failure scenarios.
403+
Defines named failure policies for simulating network failures to test resilience and analyze failure scenarios. Each policy contains rules and configuration for how failures are applied.
404404

405405
```yaml
406-
failure_policy:
407-
name: "PolicyName" # Optional
408-
fail_shared_risk_groups: true | false
409-
fail_risk_group_children: true | false
410-
use_cache: true | false
411-
attrs: # Optional custom attributes for the policy
412-
custom_key: value
413-
rules:
414-
- entity_scope: "node" | "link" | "risk_group"
415-
conditions: # Optional: list of conditions to select entities
416-
- attr: "attribute_name"
417-
operator: "==" | "!=" | ">" | "<" | ">=" | "<=" | "contains" | "not_contains" | "any_value" | "no_value"
418-
value: "some_value"
419-
logic: "and" | "or" | "any" # How to combine conditions
420-
rule_type: "all" | "choice" | "random" # How to select entities matching conditions
421-
count: N # For 'choice' rule_type
422-
probability: P # For 'random' rule_type (0.0 to 1.0)
406+
failure_policy_set:
407+
policy_name_1:
408+
name: "PolicyName" # Optional
409+
fail_shared_risk_groups: true | false
410+
fail_risk_group_children: true | false
411+
use_cache: true | false
412+
attrs: # Optional custom attributes for the policy
413+
custom_key: value
414+
rules:
415+
- entity_scope: "node" | "link" | "risk_group"
416+
conditions: # Optional: list of conditions to select entities
417+
- attr: "attribute_name"
418+
operator: "==" | "!=" | ">" | "<" | ">=" | "<=" | "contains" | "not_contains" | "any_value" | "no_value"
419+
value: "some_value"
420+
logic: "and" | "or" | "any" # How to combine conditions
421+
rule_type: "all" | "choice" | "random" # How to select entities matching conditions
422+
count: N # For 'choice' rule_type
423+
probability: P # For 'random' rule_type (0.0 to 1.0)
424+
policy_name_2:
425+
# Another failure policy...
426+
default:
427+
# Default failure policy (used when no specific policy is selected)
428+
rules:
429+
- entity_scope: "link"
430+
rule_type: "choice"
431+
count: 1
423432
```
424433

434+
**Policy Selection:**
435+
436+
- If a `default` policy exists, it will be used when no specific policy is selected
437+
- If only one policy exists and no `default` is specified, that policy becomes the default
438+
- Multiple policies allow testing different failure scenarios in the same network
439+
425440
## `workflow` - Execution Steps
426441

427442
A list of operations to perform on the network. Each step has a `step_type` and specific arguments. This section defines the analysis workflow to be executed.

ngraph/failure_manager.py

Lines changed: 28 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,9 @@
55
from concurrent.futures import ThreadPoolExecutor, as_completed
66
from typing import Any, Dict, List, Optional, Tuple
77

8-
from ngraph.failure_policy import FailurePolicy
98
from ngraph.lib.flow_policy import FlowPolicyConfig
109
from ngraph.network import Network
11-
from ngraph.results_artifacts import TrafficMatrixSet
10+
from ngraph.results_artifacts import FailurePolicySet, TrafficMatrixSet
1211
from ngraph.traffic_manager import TrafficManager, TrafficResult
1312

1413

@@ -19,47 +18,65 @@ class FailureManager:
1918
Attributes:
2019
network (Network): The underlying network to mutate (enable/disable nodes/links).
2120
traffic_matrix_set (TrafficMatrixSet): Traffic matrices to place after failures.
21+
failure_policy_set (FailurePolicySet): Set of named failure policies.
2222
matrix_name (Optional[str]): Name of specific matrix to use, or None for default.
23-
failure_policy (Optional[FailurePolicy]): The policy describing what fails.
23+
policy_name (Optional[str]): Name of specific failure policy to use, or None for default.
2424
default_flow_policy_config: The default flow policy for any demands lacking one.
2525
"""
2626

2727
def __init__(
2828
self,
2929
network: Network,
3030
traffic_matrix_set: TrafficMatrixSet,
31+
failure_policy_set: FailurePolicySet,
3132
matrix_name: Optional[str] = None,
32-
failure_policy: Optional[FailurePolicy] = None,
33+
policy_name: Optional[str] = None,
3334
default_flow_policy_config: Optional[FlowPolicyConfig] = None,
3435
) -> None:
3536
"""Initialize a FailureManager.
3637
3738
Args:
3839
network: The Network to be modified by failures.
3940
traffic_matrix_set: Traffic matrices containing demands to place after failures.
41+
failure_policy_set: Set of named failure policies.
4042
matrix_name: Name of specific matrix to use. If None, uses default matrix.
41-
failure_policy: A FailurePolicy specifying the rules of what fails.
43+
policy_name: Name of specific failure policy to use. If None, uses default policy.
4244
default_flow_policy_config: Default FlowPolicyConfig if demands do not specify one.
4345
"""
4446
self.network = network
4547
self.traffic_matrix_set = traffic_matrix_set
48+
self.failure_policy_set = failure_policy_set
4649
self.matrix_name = matrix_name
47-
self.failure_policy = failure_policy
50+
self.policy_name = policy_name
4851
self.default_flow_policy_config = default_flow_policy_config
4952

5053
def apply_failures(self) -> None:
51-
"""Apply the current failure_policy to self.network (in-place).
54+
"""Apply the current failure policy to self.network (in-place).
5255
53-
If failure_policy is None, this method does nothing.
56+
If failure_policy_set is empty or no valid policy is found, this method does nothing.
5457
"""
55-
if not self.failure_policy:
56-
return
58+
# Check if we have any policies
59+
if len(self.failure_policy_set.policies) == 0:
60+
return # No policies, do nothing
61+
62+
# Get the failure policy to use
63+
if self.policy_name:
64+
# Use specific named policy
65+
try:
66+
failure_policy = self.failure_policy_set.get_policy(self.policy_name)
67+
except KeyError:
68+
return # Policy not found, do nothing
69+
else:
70+
# Use default policy
71+
failure_policy = self.failure_policy_set.get_default_policy()
72+
if failure_policy is None:
73+
return # No default policy, do nothing
5774

5875
# Collect node/links as dicts {id: attrs}, matching FailurePolicy expectations
5976
node_map = {n_name: n.attrs for n_name, n in self.network.nodes.items()}
6077
link_map = {link_id: link.attrs for link_id, link in self.network.links.items()}
6178

62-
failed_ids = self.failure_policy.apply_failures(node_map, link_map)
79+
failed_ids = failure_policy.apply_failures(node_map, link_map)
6380

6481
# Disable the failed entities
6582
for f_id in failed_ids:

ngraph/failure_policy.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -365,6 +365,37 @@ def _expand_failed_risk_group_children(
365365
failed_rgs.add(child_name)
366366
queue.append(child_name)
367367

368+
def to_dict(self) -> Dict[str, Any]:
369+
"""Convert to dictionary for JSON serialization.
370+
371+
Returns:
372+
Dictionary representation with all fields as JSON-serializable primitives.
373+
"""
374+
return {
375+
"rules": [
376+
{
377+
"entity_scope": rule.entity_scope,
378+
"conditions": [
379+
{
380+
"attr": cond.attr,
381+
"operator": cond.operator,
382+
"value": cond.value,
383+
}
384+
for cond in rule.conditions
385+
],
386+
"logic": rule.logic,
387+
"rule_type": rule.rule_type,
388+
"probability": rule.probability,
389+
"count": rule.count,
390+
}
391+
for rule in self.rules
392+
],
393+
"attrs": self.attrs,
394+
"fail_shared_risk_groups": self.fail_shared_risk_groups,
395+
"fail_risk_group_children": self.fail_risk_group_children,
396+
"use_cache": self.use_cache,
397+
}
398+
368399

369400
def _evaluate_condition(entity_attrs: Dict[str, Any], cond: FailureCondition) -> bool:
370401
"""Evaluate a single FailureCondition against entity attributes.

0 commit comments

Comments
 (0)