Skip to content

daemon: fix data race in getClientConfig iterating rules during load#1604

Draft
tredondo wants to merge 1 commit into
evilsocket:masterfrom
tredondo:fix/getclientconfig-rules-race
Draft

daemon: fix data race in getClientConfig iterating rules during load#1604
tredondo wants to merge 1 commit into
evilsocket:masterfrom
tredondo:fix/getclientconfig-rules-race

Conversation

@tredondo
Copy link
Copy Markdown
Contributor

Bug

On a host with a large rule corpus (~2100 *.json files under rules.Path), the daemon panics during startup, reliably, every time the UI is already running:

panic: runtime error: index out of range [1002] with length 1002
goroutine 98 [running]:
github.com/evilsocket/opensnitch/daemon/ui.(*Client).getClientConfig(...)
        /root/build/opensnitch/daemon/ui/notifications.go:44 +0x33a
github.com/evilsocket/opensnitch/daemon/ui.(*Client).Subscribe(...)
        /root/build/opensnitch/daemon/ui/notifications.go:349 +0x85
created by github.com/evilsocket/opensnitch/daemon/ui.(*Client).onStatusChange in goroutine 30
        /root/build/opensnitch/daemon/ui/client.go:238 +0xa9

systemd Restart=on-failure then puts the daemon into a permanent crash loop (counter climbed to 8+ in our testing) — the firewall never comes up.

Indices in the panic line vary across restarts (1002, 1222, 1353, 1415, ...) — they're snapshots of partial rule loads.

Root cause

Loader.GetAll() returns the loader's internal map by reference under an RLock that's released the instant the call returns:

func (l *Loader) GetAll() map[string]*Rule {
    l.RLock()
    defer l.RUnlock()
    return l.rules
}

getClientConfig() then calls it twice — once for len() to size a slice, once to range over — without holding any lock:

rulesTotal := len(c.rules.GetAll())
ruleList := make([]*protocol.Rule, rulesTotal)
idx := 0
for _, r := range c.rules.GetAll() {
    ruleList[idx] = r.Serialize()       // ← panic when idx == rulesTotal
    idx++
}

If the loader is still inserting entries between the two GetAll() calls — which is the common case at startup — the second call's map has grown, the iteration produces more entries than the pre-sized slice, and ruleList[idx] walks off the end.

The race is reachable because reloadConfiguration calls c.Connect() (which starts the connection poller) before c.rules.Reload(newConfig.Rules.Path). With a small rule set, Reload finishes before the poller's first tick and there's no observable race. With a few thousand rules, the poller wins, the UI subscribes back, and getClientConfig runs in a separate goroutine while the loader is still synchronously parsing files.

go test -race catches the underlying data race too: mapIterNext (reader, post-GetAll) vs. mapassign_faststr in replaceUserRule (writer, holding l.Lock()).

Fix

daemon/rule/loader.goGetAll() now returns a shallow snapshot of the map under the read lock. Same return type, same *Rule pointers, just a freshly allocated map container so len() and iteration are guaranteed consistent for the caller, with no lock-discipline requirement.

daemon/ui/notifications.go — call GetAll() once into a local variable; size and iterate that one map. Minimal change to keep the diff focused.

Loader.GetAll has exactly one caller in the daemon today (getClientConfig), so the contract change is contained. Other callers (none) would have been broken by the previous behavior anyway, since iterating the returned map without holding any lock is unsafe.

The *Rule values in the snapshot are still shared with the loader, but rules are replaced wholesale (replaceUserRule writes a new pointer into the map) rather than mutated in place, so the snapshot remains internally consistent for the read-only use cases this method serves.

Test

TestRuleLoaderGetAllSnapshot in daemon/rule/loader_test.go exercises the exact pattern that used to crash: writers churn the map via replaceUserRule/Delete, readers do the len()-then-iterate dance, and the test asserts the two counts agree. Verified that the test:

  • Fails on master (without the fix) under -race: race detector flags mapIterNext vs. mapassign_faststr, plus the inconsistency check trips.
  • Passes with the fix, both with and without -race.

What this PR doesn't fix

  • The nil-deref in firewall.Serialize() from the same getClientConfig site reported in [Bug Report] daemon v1.7.0.0-1 panic runtime error INVALIDARGUMENT #1380 — different root cause (firewall not initialized yet), worth its own change.
  • The ordering bug in reloadConfiguration where c.Connect() runs before c.rules.Reload(). Fixing the ordering would make the race unreachable too, but is a broader surgery; fixing the locking contract at GetAll addresses the bug at the right layer.

Loader.GetAll() returned the loader internal map by reference under an

RLock that was released as the call returned. getClientConfig() then

called GetAll() twice -- once for len() to size a slice, once to range

over the entries -- without holding any lock. A concurrent goroutine

(e.g. the loader still synchronously processing rules during

reloadConfiguration) writing to the map between the two calls makes the

iteration yield more entries than the pre-sized slice, panicking with

"index out of range [N] with length N". On bad luck the Go runtime

trips "concurrent map iteration and map write" instead.

The race is reachable on real systems whenever the UI poller, started

by reloadConfiguration via c.Connect(), reconnects before

c.rules.Reload(path) finishes loading rules from disk. Consistently

reproducible with ~2100 rules on disk.

Make GetAll() return a shallow snapshot of the rule map under the read

lock so length and iteration are guaranteed consistent for the caller.

Update the one caller (getClientConfig) to take the snapshot once.

Add a regression test running concurrent writers and readers against

the loader; -race catches the original bug, and the new GetAll

contract makes it pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant