Constrained Policy Gradient algorithms for power grid management using Grid2Op.
This repository is based on MagicRL. It has been adapted and extended to focus on constrained RL for power grid operation via Grid2Op.
This project implements policy gradient methods -- including the constrained variant CPGPE -- applied to the Learning to Run a Power Network (L2RPN) environments. The agent learns to operate a power grid by selecting discrete topology actions (e.g., connecting/disconnecting lines), while optionally satisfying safety constraints on line loading, voltage deviations, and other operational metrics.
- Installation
- Quick Start
- Running Experiments
- Algorithms
- Policies
- Grid2Op Environments
- Cost Functions for Constrained RL
- Project Structure
- Diagnostic Baselines
- Documentation
- License
Python 3.10+ is required.
conda create --name constrained_grid2op python=3.10
conda activate constrained_grid2oppip install -r requirements.txtpip install numbaThe first time you use a Grid2Op scenario, it will be downloaded automatically. You can also pre-download it:
import grid2op
grid2op.make("l2rpn_case14_sandbox", test=True)Run a short training session with CPGPE on the default Grid2Op environment:
python run.py --alg cpgpe --ite 10 --batch 20 --horizon 100 --var 0.1 --costs 1 --cost_type mvResults are saved automatically in the experiments/ directory.
All experiments are launched through run.py. Results (JSON logs and best policy parameters) are saved under the --dir directory, organized by experiment configuration and trial number.
| Parameter | Type | Default | Description |
|---|---|---|---|
--dir |
str | experiments/ |
Directory where results are saved |
--ite |
int | 100 |
Number of training iterations |
--batch |
int | 100 |
Number of trajectories per iteration |
--horizon |
int | 100 |
Episode length (max timesteps) |
--gamma |
float | 1 |
Discount factor |
--n_trials |
int | 1 |
Number of independent runs |
--n_workers |
int | 1 |
Number of parallel workers for trajectory sampling |
--clip |
int | 1 |
Whether to clip actions (0 or 1) |
| Parameter | Type | Default | Choices | Description |
|---|---|---|---|---|
--alg |
str | pgpe |
pgpe, pg, dpg, cpgpe |
Algorithm to use |
--var |
float | 1 |
-- | Exploration variance ((\sigma^2)) for parameter perturbation |
--lr |
float | 0.001 |
-- | Learning rate |
--lr_strategy |
str | adam |
adam, constant |
Learning rate schedule |
| Parameter | Type | Default | Choices | Description |
|---|---|---|---|---|
--pol |
str | nn_softmax |
linear, nn, big_nn, nn_softmax |
Policy architecture |
nn_softmax(recommended for Grid2Op): Neural network with softmax output for discrete actions. Initialized with a bias towards the "do nothing" action (action 0) for stable learning.nn: Neural network with continuous output.big_nn: Larger neural network (4 hidden layers).linear: Linear policy.
| Parameter | Type | Default | Description |
|---|---|---|---|
--grid2op_env |
str | l2rpn_case14_sandbox |
Grid2Op scenario name (see available environments) |
| Parameter | Type | Default | Choices | Description |
|---|---|---|---|---|
--costs |
int | 0 |
0, 1 |
Enable cost-aware environment (1 = enabled) |
--cost_type |
str | tc |
tc, cvar, mv, chance |
Cost aggregation type |
Cost types:
tc-- Trajectory cost: penalizes the cumulative cost along the trajectory.mv-- Mean-variance: penalizes both expected cost and its variance (risk-sensitive).cvar-- Conditional Value at Risk: penalizes the tail of the cost distribution.chance-- Chance constraint: penalizes the probability of exceeding a threshold.
PGPE (unconstrained) on the sandbox environment:
python run.py --alg pgpe --pol nn_softmax --ite 100 --batch 50 --horizon 100 --var 0.1 --lr 0.001CPGPE (constrained) with mean-variance cost:
python run.py --alg cpgpe --pol nn_softmax --ite 100 --batch 100 --horizon 100 \
--var 0.1 --lr 0.001 --costs 1 --cost_type mvCPGPE on the WCCI 2020 environment (larger grid):
python run.py --alg cpgpe --pol nn_softmax --ite 100 --batch 100 --horizon 200 \
--var 0.1 --lr 0.001 --costs 1 --cost_type tc --grid2op_env l2rpn_wcci_2020Multiple trials with parallelism:
python run.py --alg cpgpe --pol nn_softmax --ite 100 --batch 50 --horizon 100 \
--var 0.1 --costs 1 --cost_type mv --n_trials 5 --n_workers 4| Policy | Class | Suitable For |
|---|---|---|
nn_softmax |
policies.NNSoftmax |
Discrete action spaces (Grid2Op). Outputs action probabilities via softmax. |
nn |
policies.NeuralNetworkPolicy |
Continuous action spaces. |
big_nn |
policies.NeuralNetworkPolicy |
Continuous action spaces (larger network). |
linear |
policies.OldLinearPolicy |
Linear parametrization. |
When using nn_softmax, the output layer bias is initialized to strongly prefer the "do nothing" action (action index 0). This is critical for Grid2Op, where unnecessary interventions often cause cascading failures. The bias is part of the learnable parameters, so the agent can learn to take other actions when beneficial.
The following L2RPN environments are supported (sorted by grid size):
| Environment | Substations | Lines | Features |
|---|---|---|---|
l2rpn_case14_sandbox |
14 | 20 | Smallest grid, no maintenance or opponent. Good for development. |
l2rpn_wcci_2020 |
36 | 59 | Maintenance events, redispatching. Medium difficulty. |
l2rpn_wcci_2022 |
36 | -- | Newer version of the WCCI scenario. |
l2rpn_neurips_2020_track1_small |
36 | -- | Maintenance, opponent attacks, redispatching. Small dataset. |
l2rpn_neurips_2020_track1_large |
36 | -- | Same grid as above, larger dataset. |
l2rpn_neurips_2020_track2_small |
118 | -- | Largest grid. Small dataset. |
l2rpn_neurips_2020_track2_large |
118 | -- | Largest grid. Large dataset. |
Note: "small/large" in the NeurIPS environments refers to the chronics dataset size, not the grid size.
When --costs 1 is passed, the environment computes per-step cost signals that the CPGPE algorithm uses to enforce constraints. The available cost functions are configured in the cost_config dictionary inside run.py:
| Cost Function | Description |
|---|---|
rho_max |
Maximum line loading excess above threshold |
rho_violations_count |
Number of overloaded lines |
rho_violations_sum |
Sum of line loading above threshold (default) |
rho_violations_quadratic |
Quadratic penalty for loading violations |
disconnections |
Number of disconnected lines |
overflow_duration |
Total overflow timesteps across all lines |
voltage_deviation |
Maximum voltage deviation from acceptable bounds |
curtailment |
Total renewable energy curtailment (MW) |
redispatch |
Total absolute redispatch (MW) |
Default configuration uses rho_violations_sum with a threshold of 0.8 (80% line loading). To customize, edit the cost_config in run.py:
"cost_config": {
"costs": ["rho_violations_sum", "disconnections"], # Multiple costs
"rho_threshold": 0.8,
"voltage_bounds": (0.95, 1.05),
"weights": [1.0, 0.5]
}For a detailed reference of Grid2Op observation attributes and cost metric definitions, see docs/grid2op_attributes.md.
The test_baselines.py script evaluates simple baseline policies to establish performance bounds:
python test_baselines.pyThis runs three policies and reports statistics:
- Do Nothing: Always takes action 0 (no intervention). Typically the strongest baseline on simple scenarios.
- Random: Uniformly random actions. Usually causes rapid grid failure.
- Biased Random: Randomly selects actions with a strong preference for "do nothing".
The output includes mean/std of rewards, costs, episode lengths, and the percentage of episodes that survive the full horizon.
docs/grid2op_attributes.md-- Comprehensive reference for Grid2Op observation attributes, cost metric formulas, and recommended configurations for CPGPE.- Grid2Op Documentation
- L2RPN Competition