daemon: fix data race in getClientConfig iterating rules during load#1604
Draft
tredondo wants to merge 1 commit into
Draft
daemon: fix data race in getClientConfig iterating rules during load#1604tredondo wants to merge 1 commit into
tredondo wants to merge 1 commit into
Conversation
Loader.GetAll() returned the loader internal map by reference under an RLock that was released as the call returned. getClientConfig() then called GetAll() twice -- once for len() to size a slice, once to range over the entries -- without holding any lock. A concurrent goroutine (e.g. the loader still synchronously processing rules during reloadConfiguration) writing to the map between the two calls makes the iteration yield more entries than the pre-sized slice, panicking with "index out of range [N] with length N". On bad luck the Go runtime trips "concurrent map iteration and map write" instead. The race is reachable on real systems whenever the UI poller, started by reloadConfiguration via c.Connect(), reconnects before c.rules.Reload(path) finishes loading rules from disk. Consistently reproducible with ~2100 rules on disk. Make GetAll() return a shallow snapshot of the rule map under the read lock so length and iteration are guaranteed consistent for the caller. Update the one caller (getClientConfig) to take the snapshot once. Add a regression test running concurrent writers and readers against the loader; -race catches the original bug, and the new GetAll contract makes it pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
On a host with a large rule corpus (~2100
*.jsonfiles underrules.Path), the daemon panics during startup, reliably, every time the UI is already running:systemd
Restart=on-failurethen puts the daemon into a permanent crash loop (counter climbed to 8+ in our testing) — the firewall never comes up.Indices in the panic line vary across restarts (1002, 1222, 1353, 1415, ...) — they're snapshots of partial rule loads.
Root cause
Loader.GetAll()returns the loader's internal map by reference under an RLock that's released the instant the call returns:getClientConfig()then calls it twice — once forlen()to size a slice, once to range over — without holding any lock:If the loader is still inserting entries between the two
GetAll()calls — which is the common case at startup — the second call's map has grown, the iteration produces more entries than the pre-sized slice, andruleList[idx]walks off the end.The race is reachable because
reloadConfigurationcallsc.Connect()(which starts the connection poller) beforec.rules.Reload(newConfig.Rules.Path). With a small rule set,Reloadfinishes before the poller's first tick and there's no observable race. With a few thousand rules, the poller wins, the UI subscribes back, andgetClientConfigruns in a separate goroutine while the loader is still synchronously parsing files.go test -racecatches the underlying data race too:mapIterNext(reader, post-GetAll) vs.mapassign_faststrinreplaceUserRule(writer, holdingl.Lock()).Fix
daemon/rule/loader.go—GetAll()now returns a shallow snapshot of the map under the read lock. Same return type, same*Rulepointers, just a freshly allocated map container solen()and iteration are guaranteed consistent for the caller, with no lock-discipline requirement.daemon/ui/notifications.go— callGetAll()once into a local variable; size and iterate that one map. Minimal change to keep the diff focused.Loader.GetAllhas exactly one caller in the daemon today (getClientConfig), so the contract change is contained. Other callers (none) would have been broken by the previous behavior anyway, since iterating the returned map without holding any lock is unsafe.The
*Rulevalues in the snapshot are still shared with the loader, but rules are replaced wholesale (replaceUserRulewrites a new pointer into the map) rather than mutated in place, so the snapshot remains internally consistent for the read-only use cases this method serves.Test
TestRuleLoaderGetAllSnapshotindaemon/rule/loader_test.goexercises the exact pattern that used to crash: writers churn the map viareplaceUserRule/Delete, readers do thelen()-then-iterate dance, and the test asserts the two counts agree. Verified that the test:-race: race detector flagsmapIterNextvs.mapassign_faststr, plus the inconsistency check trips.-race.What this PR doesn't fix
firewall.Serialize()from the samegetClientConfigsite reported in [Bug Report] daemon v1.7.0.0-1 panic runtime error INVALIDARGUMENT #1380 — different root cause (firewall not initialized yet), worth its own change.reloadConfigurationwherec.Connect()runs beforec.rules.Reload(). Fixing the ordering would make the race unreachable too, but is a broader surgery; fixing the locking contract atGetAlladdresses the bug at the right layer.