Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
9a5c9fd
Implement Pod BackendPool
georgeedward2000 Jan 9, 2025
e830609
complete implementation CLB ReconcileBackendPools
georgeedward2000 Jan 10, 2025
5e0729d
update getBackendPoolNameForCLBService
georgeedward2000 Jan 10, 2025
e3510f3
complete implementation CLB GetBackendPrivateIPs
georgeedward2000 Jan 10, 2025
66de73a
Update NSG - CLB Load Balancer
georgeedward2000 Jan 17, 2025
3f0f834
Add unit tests for a part of CLB NSG flow + update CLB NSG implementa…
georgeedward2000 Jan 20, 2025
2ff6f2d
update nsg logic for clb + add/refactor tests
georgeedward2000 Jan 27, 2025
807b4d6
Complete testing for Container Load Balancer BackendPool
georgeedward2000 Jan 28, 2025
182d11d
update CLB NSG to support dual-stack rules
georgeedward2000 Jan 30, 2025
e9693c6
add formation flow for clb lb rules
georgeedward2000 Feb 4, 2025
2a45c63
comment typo
georgeedward2000 Feb 4, 2025
e1eb35d
named targetPort is not supported within CLB context
georgeedward2000 Feb 7, 2025
29bb3c8
add unit tests
georgeedward2000 Feb 10, 2025
eae85c9
Implement Pod BackendPool
georgeedward2000 Jan 9, 2025
7737e3b
complete implementation CLB ReconcileBackendPools
georgeedward2000 Jan 10, 2025
b74ed14
update getBackendPoolNameForCLBService
georgeedward2000 Jan 10, 2025
6de9646
complete implementation CLB GetBackendPrivateIPs
georgeedward2000 Jan 10, 2025
0d329fb
Complete testing for Container Load Balancer BackendPool
georgeedward2000 Jan 28, 2025
f52242e
add formation flow for clb lb rules
georgeedward2000 Feb 4, 2025
f51f58c
comment typo
georgeedward2000 Feb 4, 2025
1027221
named targetPort is not supported within CLB context
georgeedward2000 Feb 7, 2025
8f489e7
add unit tests
georgeedward2000 Feb 10, 2025
056ae46
Update NSG - CLB Load Balancer
georgeedward2000 Jan 17, 2025
8e0ad77
Add unit tests for a part of CLB NSG flow + update CLB NSG implementa…
georgeedward2000 Jan 20, 2025
329464c
update nsg logic for clb + add/refactor tests
georgeedward2000 Jan 27, 2025
9e334c0
update CLB NSG to support dual-stack rules
georgeedward2000 Jan 30, 2025
9141356
Update internal/testutil/fixture/azure_managedcluster.go
georgeedward2000 Feb 4, 2025
a0170f1
Merge branch 'clb-backendpool' of https://github.com/david-kow/cloud-…
georgeedward2000 Feb 19, 2025
b2624b4
update error message + return type
georgeedward2000 Feb 19, 2025
aa360af
update error message + return type
georgeedward2000 Feb 19, 2025
55bfee0
Merge branch 'clb-backendpool' of https://github.com/david-kow/cloud-…
georgeedward2000 Feb 19, 2025
f841903
Merge branch 'clb-nsg' of https://github.com/david-kow/cloud-provider…
georgeedward2000 Feb 19, 2025
873212a
Typo
georgeedward2000 Feb 19, 2025
f33f04e
Merge branch 'clb-lb-rules' of https://github.com/david-kow/cloud-pro…
georgeedward2000 Feb 21, 2025
81d5568
extract function + check standardV2 sku
georgeedward2000 Feb 21, 2025
34c9c32
apply DRY by extracting common code in a util function
georgeedward2000 Feb 21, 2025
90d3039
update code + tests for new context: pod cidrs setup through provisio…
georgeedward2000 Feb 24, 2025
9727188
difftracker initial integration (left TODOs)
georgeedward2000 Feb 27, 2025
ac1d5af
1. Updated namings
georgeedward2000 Mar 27, 2025
47c6502
Changes List:
georgeedward2000 Apr 1, 2025
31fbbc7
Changes:
georgeedward2000 Apr 14, 2025
248d01c
Updating namings, add ServiceType to ServicesDataDTO, update tests, r…
georgeedward2000 Apr 17, 2025
ec0944d
Changes:
georgeedward2000 Apr 29, 2025
bdc02d2
updated trigger flow + refactor
georgeedward2000 May 2, 2025
52d99b1
update endpointslice informer flow
georgeedward2000 May 19, 2025
fe09787
Merge remote-tracking branch 'fork/clb-backendpool' into clb-combined…
georgeedward2000 May 19, 2025
5980efe
Merge remote-tracking branch 'fork/clb-lb-rules' into clb-combined-prs
georgeedward2000 May 19, 2025
9f00a1c
Merge remote-tracking branch 'fork/clb-nsg' into clb-combined-prs
georgeedward2000 May 19, 2025
3dfc063
Merge remote-tracking branch 'fork/clb-difftracker' into clb-combined…
georgeedward2000 May 19, 2025
98e59c2
Merge remote-tracking branch 'fork/clb-loadbalancer' into clb-combine…
georgeedward2000 May 19, 2025
f1c703e
update logic to use IsLBBackendPoolTypePodIPAndUseStandardV2LoadBalan…
georgeedward2000 May 19, 2025
823bd59
initial integration + update testing
georgeedward2000 May 20, 2025
eac6384
test EnsureLoadBalancer and EnsureLoadBalancerDeleted for CLB context
georgeedward2000 May 21, 2025
eade66b
WIP - update tests + added debug logs + misc
georgeedward2000 May 26, 2025
8894fd3
implement NRPAPIClientUpdateNRPLocations and NRPAPIClientUpdateNRPSer…
georgeedward2000 Jun 19, 2025
b0ac331
WIP - Added main CLB outbound processing flow and configuration
georgeedward2000 Jul 24, 2025
bcf779b
WIP - Added main CLB outbound processing flow and configuration
georgeedward2000 Jul 24, 2025
cbf74c3
Merge branch 'clb-combined-prs-outbound' of https://github.com/george…
georgeedward2000 Jul 24, 2025
3116fb6
remove file
georgeedward2000 Jul 24, 2025
95e09f4
start background processing worker goroutine
georgeedward2000 Jul 24, 2025
1bc4115
generated azclient.natgatewayclient files
georgeedward2000 Jul 28, 2025
1a468ef
0. Addressed comments
georgeedward2000 Aug 1, 2025
cec3ee5
nil pointer exception issue solution
georgeedward2000 Aug 11, 2025
e3a0392
update instantiation order flow for CLB
georgeedward2000 Aug 12, 2025
9ef2747
Merge branch 'master' into eddie/dev/clb-combined-prs-outbound-synced
Aug 25, 2025
049ecab
feat: implement NAT Gateway and Public IP management functions
Sep 1, 2025
529f8d6
Changes:
georgeedward2000 Sep 15, 2025
f548ae4
checkpoint
georgeedward2000 Sep 16, 2025
c5626e3
wip - checkpoint
georgeedward2000 Sep 18, 2025
e1505e8
Added SGW API locally and generated azclient + created repo within cl…
georgeedward2000 Sep 18, 2025
491e8d9
Added SGW API locally and generated azclient + created repo within cl…
georgeedward2000 Sep 18, 2025
858f06a
Merge remote-tracking branch 'refs/remotes/origin/eddie/dev/clb-combi…
georgeedward2000 Sep 18, 2025
b6077e0
wip - added sku in sgw api
georgeedward2000 Sep 22, 2025
a432779
feat: Add Service Gateway support and refactor NRP service updates
georgeedward2000 Sep 25, 2025
4c07ab3
feat: Implement disassociation of NAT Gateway from Service Gateway an…
georgeedward2000 Sep 25, 2025
fbe2c4f
checkpoint
georgeedward2000 Oct 1, 2025
96425d9
Refactor Service Gateway references to use dynamic resource name
georgeedward2000 Oct 21, 2025
909ebe1
feat: Implement Service Gateway attachment to subnet and ensure defau…
georgeedward2000 Oct 28, 2025
a7dc6ab
Refactor comments and formatting in YAML and inf packages
georgeedward2000 Oct 30, 2025
a6e4f0f
Enhance service gateway existence check and configuration validation
georgeedward2000 Nov 3, 2025
7a28e59
checkpoint: Enhance service gateway existence check and configuration…
georgeedward2000 Nov 19, 2025
83d08cc
wip
georgeedward2000 Nov 25, 2025
ef00f63
feat: Implement orphaned Public IP cleanup in DiffTracker initialization
georgeedward2000 Nov 26, 2025
b9d8ada
Add metrics tracking for service gateway operations and adjust resync…
georgeedward2000 Dec 2, 2025
41aff2d
Enhance Service Gateway handling and backend pool management
georgeedward2000 Dec 12, 2025
82bc36c
clb-wip
georgeedward2000 Dec 15, 2025
3f58d50
wip
georgeedward2000 Dec 16, 2025
7d8851d
wip
georgeedward2000 Dec 16, 2025
9dd49a2
Refactor CLB DiffTracker Engine: Replace IsInbound flag with ServiceC…
georgeedward2000 Dec 16, 2025
e213a8c
wip
georgeedward2000 Dec 16, 2025
6f7a8fe
wip
georgeedward2000 Dec 16, 2025
ebd7499
wip
georgeedward2000 Dec 16, 2025
c469f8b
Refactor Azure Service Gateway Pod Management and Enhance Tests
georgeedward2000 Dec 17, 2025
c82f53c
Enhance DiffTracker: Improve address conversion, handle service delet…
georgeedward2000 Dec 17, 2025
4fbee07
wip
georgeedward2000 Dec 19, 2025
3545390
wip
georgeedward2000 Dec 19, 2025
eb12ff6
wip
georgeedward2000 Dec 19, 2025
eed6011
Enhance DiffTracker with initialization tracking and service synchron…
georgeedward2000 Dec 19, 2025
e7f7d69
Enhance EnsureLoadBalancerDeleted to support asynchronous service del…
georgeedward2000 Jan 12, 2026
d7c17cd
Enhance CLB performance tests and add CCM utilities
georgeedward2000 Jan 15, 2026
562fffd
Add Container Load Balancer deletion crash recovery tests and improve…
georgeedward2000 Jan 23, 2026
97a655f
Refactor Azure Service Gateway code and improve finalizer handling
georgeedward2000 Jan 28, 2026
9aa5a95
Add stress and scale tests for Azure Load Balancer
georgeedward2000 Jan 28, 2026
01323fd
Add validation for ServiceGateway and multiple SLB configurations; im…
georgeedward2000 Feb 23, 2026
25e0d02
Implement Azure error handling and logging enhancements
georgeedward2000 Mar 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,9 @@ _rundir
site/.hugo_build.lock

.kubeconfig/

my-vendor/

# pkg/

# vendor/
246 changes: 246 additions & 0 deletions DEFERRED_ISSUES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
# Deferred Issues for CLB DiffTracker Engine

This document catalogs design decisions and improvements that should be addressed in future phases.

## Critical Issues (Resolved)
- ✅ **Engine goroutines startup** - ServiceUpdater and LocationsUpdater now started in `InitializeCloudFromConfig`
- ✅ **sync.Map return bug** - Fixed to return `*sync.Map` pointer in `azure_servicegateway_difftracker_init.go`
- ✅ **Test compilation errors** - Fixed old batch updater references in test files

## Design Issues (Deferred)

### 1. EnsureLoadBalancer Return Status Strategy
**Location**: `pkg/provider/azure_loadbalancer.go:368`

**Current State**: Returns empty `LoadBalancerStatus{}` when ServiceGatewayEnabled
**Problem**: Kubernetes expects LoadBalancer IP/hostname to be populated in status
**Impact**: Services may not get external IPs until subsequent reconciliation loops

**Proposed Solutions**:
1. **Status Watcher Pattern**: Implement a goroutine that watches `pendingServiceOps` and updates K8s Service status when `StateCreated` is reached
2. **Blocking Wait**: Make EnsureLoadBalancer wait for Engine completion (contradicts async design)
3. **Status Reconciliation Loop**: Periodically sync Engine state to K8s Service objects

**Recommendation**: Implement Status Watcher (option 1) to maintain async benefits while providing timely status updates

**Code Location**:
```go
// pkg/provider/azure_loadbalancer.go:368
if az.ServiceGatewayEnabled && az.diffTracker != nil {
serviceUID := getServiceUID(service)
az.diffTracker.AddService(serviceUID, true)
// TODO: Implement status watcher to update LoadBalancerStatus after async creation completes
return &v1.LoadBalancerStatus{}, nil
}
```

---

### 2. Cleanup Code Placement in EnsureLoadBalancerDeleted
**Location**: `pkg/provider/azure_loadbalancer.go:605-615`

**Current State**: Cleanup code (`reconcilePublicIPs`, `localServiceNameToServiceInfoMap.Delete`) exists after synchronous deletion path
**Problem**: This cleanup should also run for ServiceGateway mode deletions, but currently only runs for non-ServiceGateway mode

**Impact**: May leave stale data in `localServiceNameToServiceInfoMap` after async deletions

**Proposed Solution**: Move cleanup code before the ServiceGateway early-return check, or add it to Engine's deletion completion callback

**Code Location**:
```go
// pkg/provider/azure_loadbalancer.go:605-615
// Synchronous cleanup for non-ServiceGateway mode
if _, err = az.reconcilePublicIPs(ctx, clusterName, service, "", false); err != nil {
return err
}
if az.UseMultipleStandardLoadBalancers() && isLocalService(service) {
key := strings.ToLower(svcName)
az.localServiceNameToServiceInfoMap.Delete(key)
}
```

---

## Minor Issues (Deferred)

### 3. GetClusterName() Returns Empty String
**Location**: `pkg/provider/azure_servicegateway_cloud_provider_interface.go:22`

**Current State**: Returns `""` because clusterName is passed as parameter, not stored in Cloud struct
**Impact**: May cause issues if Engine or difftracker code expects non-empty cluster name

**Proposed Solution**: Store clusterName in Cloud struct during initialization for easier access

---

### 4. Duplicate Config Import
**Location**: `pkg/provider/azure.go:47-48`

**Current State**: `"sigs.k8s.io/cloud-provider-azure/pkg/provider/config"` imported twice
**Impact**: Go linter warning (ST1019)

**Solution**: Remove duplicate import line

---

### 5. Phase 6 Work Deferred
**Status**: Files renamed to `.phase6` extension

**Scope**: Outbound flow migration to Engine pattern
**Files**:
- `azure_servicegateway_pod_egress_resource_updater.go.phase6`
- `azure_servicegateway_pods.go.phase6`

**Work Required**:
1. Restore files (remove `.phase6` extension)
2. Wire pod informer to call `Engine.AddPod()` / `Engine.UpdatePod()` / `Engine.DeletePod()`
3. Update Engine to handle pod egress assignments
4. Remove `podEgressResourceUpdater` references (old batch updater)

**Commented Code**:
```go
// pkg/provider/azure.go:789
// TODO Phase 6: Restore pod informer setup for egress after migrating to Engine pattern
// if az.ServiceGatewayEnabled {
// az.setUpPodInformerForEgress()
// }
```

---

## Testing Work (Deferred)

### 6. Update Tests to Use Engine Pattern
**Location**: Multiple test files

**Current State**: Tests have old batch updater code commented out with TODOs
**Files**:
- `pkg/provider/azure_loadbalancer_test.go:857, 958`
- `pkg/provider/azure_local_services_test.go:660`

**Work Required**:
1. Start Engine goroutines in test setup
2. Mock ServiceUpdater/LocationsUpdater if needed
3. Add Engine state assertions to verify async operations
4. Test buffered endpoint scenarios

---

### 7. Comprehensive Engine Testing
**Status**: No dedicated Engine tests exist

**Work Required**:
1. Unit tests for `engine.go` methods (AddService, UpdateEndpoints, DeleteService, OnServiceCreationComplete, CheckPendingDeletions)
2. Integration tests for ServiceUpdater parallel execution (test semaphore, retries, error handling)
3. Integration tests for LocationsUpdater sync logic
4. Race condition testing (concurrent AddService + UpdateEndpoints + DeleteService)
5. Buffering behavior tests (endpoints arrive before service creation complete)

---

## Metrics Work (Deferred)

### 8. Add Engine Metrics
**Status**: Placeholder metrics in locations_updater.go, no comprehensive instrumentation

**Metrics Needed**:
1. **Service Operation Metrics**:
- `engine_service_operations_total{operation="create|delete", status="success|failure"}`
- `engine_service_operation_duration_seconds{operation="create|delete"}`
- `engine_pending_service_operations{state="not_started|creation_in_progress|created|deletion_pending|deletion_in_progress"}`

2. **Buffering Metrics**:
- `engine_buffered_endpoints_total`
- `engine_buffered_pods_total`
- `engine_buffer_duration_seconds` (time endpoints/pods spend buffered)

3. **LocationsUpdater Metrics**:
- `engine_locations_sync_total{status="success|failure"}`
- `engine_locations_sync_duration_seconds`
- `engine_location_addresses_synced{action="add|remove|update"}`

4. **Retry Metrics**:
- `engine_operation_retries_total{operation="create|delete"}`

---

## Documentation Work (Deferred)

### 9. Engine Architecture Documentation
**Status**: No formal documentation exists

**Documentation Needed**:
1. Architecture diagram showing Engine components (ServiceUpdater, LocationsUpdater, channels, state machine)
2. State transition diagram (StateNotStarted → StateCreationInProgress → StateCreated → StateDeletionPending → StateDeletionInProgress)
3. Sequence diagrams for:
- Service creation flow (EnsureLoadBalancer → AddService → ServiceUpdater → OnServiceCreationComplete)
- Service deletion flow (EnsureLoadBalancerDeleted → DeleteService → LocationsUpdater → CheckPendingDeletions)
- Endpoint update flow (EndpointSlice handler → UpdateEndpoints → LocationsUpdater)
4. Buffering behavior explanation (when/why endpoints are buffered)
5. Migration guide from old batch updater pattern to Engine pattern

---

## Future Enhancements

### 10. Status Watcher Implementation
**Status**: Not implemented

**Purpose**: Update K8s Service status after async LoadBalancer creation completes
**Design**: Goroutine watching `pendingServiceOps` map, triggers K8s client updates when state transitions to `StateCreated`

---

### 11. Graceful Shutdown
**Status**: Context cancellation exists, but no drain logic

**Work Required**:
1. Drain `serviceUpdaterTrigger` and `locationsUpdaterTrigger` channels before shutdown
2. Wait for in-flight operations to complete (use sync.WaitGroup in ServiceUpdater)
3. Persist pending operations to disk or allow them to retry on restart

---

### 12. Observability Improvements
**Status**: Basic klog statements exist

**Work Required**:
1. Structured logging with consistent fields (serviceUID, operation, state, duration)
2. Trace context propagation through Engine operations
3. Debug endpoints for inspecting Engine state (`/debug/difftracker/services`, `/debug/difftracker/buffered`)

---

## Phase Status Summary

| Phase | Status | Description |
|-------|--------|-------------|
| Phase 0-4 | ✅ Complete | Engine infrastructure (types, ServiceUpdater, LocationsUpdater) |
| Phase 5 | ✅ Complete | Inbound flow integration (AddService, DeleteService, UpdateEndpoints) |
| Phase 6 | ⏸️ Deferred | Outbound flow migration (pod egress to Engine pattern) |
| Phase 7 | ✅ Complete | Old code cleanup (batch updaters removed) |
| Phase 8 | ✅ Complete | Engine goroutine startup |
| Phase 9 | ⏸️ Deferred | Metrics instrumentation |
| Phase 10 | ⏸️ Deferred | Comprehensive testing |

---

## Priority Recommendations

**High Priority** (Address before production):
1. EnsureLoadBalancer return status strategy (#1)
2. Cleanup code placement (#2)
3. Engine metrics (#8)
4. Basic Engine testing (#7)

**Medium Priority** (Address in next iteration):
5. Phase 6 - Outbound flow migration (#5)
6. Status watcher implementation (#10)
7. Test updates (#6)
8. Documentation (#9)

**Low Priority** (Nice to have):
9. GetClusterName fix (#3)
10. Duplicate import cleanup (#4)
11. Graceful shutdown (#11)
12. Observability improvements (#12)
Loading