Skip to content

Enforce mode analysis: current limits too generous, tuning recommendations #42

Description

@mrtwebdesign

Summary

Analysis of enforce mode effectiveness based on 18 days of query_guard observe-mode data (May 30 – Jun 16, 2026) from binoidcbd.com production. With current limits, enforce mode would have killed 2 out of 3,499 slow queries on the worst day (Jun 8) and saved 23 seconds out of 32,218 seconds of slow query DB time (0.07%). The limits are too generous for the actual query duration distribution. Tuning recommendations below would increase effectiveness to an estimated 2,000-3,000s/day saved.

Prepared by Claude Code.


What Enforce Mode Does

When switched from observe to enforce, query_guard sets SET SESSION MAX_EXECUTION_TIME on every MySQL connection, with per-context ceilings:

Context Limit What runs here
wp_cli Unlimited CLI operations
action_scheduler Unlimited FB sync, subscriptions, background jobs
wp_cron 10s Cron tasks
admin_ajax 10s CS agent searches (AJAX), order note loading
checkout 60s Customer checkout flow
rest_api 30s Goaffpro, WooCommerce REST API
wp_admin 45s Admin page loads (non-AJAX)
frontend 30s Customer-facing page loads

When a query exceeds its context's limit, MySQL kills it (error 3024). For admin searches specifically, the plugin shows a "Search timed out" notice instead of "no results found," breaking the retry loop where CS agents re-search and re-trigger the same slow query.

Why Current Limits Don't Help

The damage is not from individual extreme queries. It is from volume — thousands of 5-10 second queries running concurrently and saturating the MySQL thread pool. The enforce-mode limits are set high enough that almost nothing gets killed.

Jun 8 (Peak Crisis — 3,499 Slow Queries, 32,218s Total DB Time)

Context Slow queries DB time Would be killed DB time saved
rest_api 3,055 26,361s 1 (at 36s) 6s
wp_admin 205 4,497s 1 (at 62s) 17s
admin_ajax 223 1,262s 0 0s
action_scheduler 16 98s 0 (unlimited) 0s
Total 3,499 32,218s 2 (0.06%) 23s (0.07%)

Jun 14 (Post-Crisis Baseline — 367 Slow Queries)

Zero queries exceeded any enforce-mode limit.

All 18 Days (May 30 – Jun 16)

Date Slow queries Would be killed Breakdown
May 30 299 8 ajax=2, cron=1, admin=2, rest=2, front=1
May 31 243 0
Jun 1 296 1 ajax=1
Jun 2 347 0
Jun 3 417 0
Jun 4 441 0
Jun 5 358 0
Jun 6 217 0
Jun 7 492 0
Jun 8 3,499 2 admin=1, rest=1
Jun 9 2,828 25 admin=20, rest=5
Jun 10 2,885 19 admin=16, rest=3
Jun 11 3,216 20 ajax=2, admin=17, rest=1
Jun 12 2,840 23 admin=21, rest=2
Jun 13 2,794 7 ajax=3, admin=1, rest=3
Jun 14 367 0
Jun 15 626 2 ajax=1, admin=1
Jun 16 196 12 admin=1, rest=10, front=1

Across all 18 days, ~97 queries would have been killed out of ~21,000 total slow queries (~0.5%).

Why the Limits Miss

The actual query durations sit comfortably under their context ceilings:

  • admin_ajax (10s limit): Queries average 5.6s, max ~9.5s on a normal day. Nothing gets killed.
  • REST API (30s limit): Goaffpro queries average 8.6s, max ~10s normally. Nothing gets killed.
  • wp_admin (45s limit): CS admin searches average 17-22s, max 40s normally. Nearly all pass through — and these consume the most DB time (3,334s/day at baseline).
  • action_scheduler (unlimited): FB sync queries (up to 50s) and subscription queries (up to 117s) are never constrained.

Risk Assessment

Safe to turn on. Specific risks evaluated:

Concern Status
Customer checkout Protected — 60s ceiling, detected before admin_ajax
WP-CLI operations Unlimited — no impact
Action Scheduler jobs Unlimited — no impact
WPE connection rotation Handled — plugin re-applies timeout on connection change
Legitimate long admin operations Low risk — bulk exports/edits rarely exceed 45s
Per-request overhead Negligible — one SET SESSION query per request

What it does provide:

  • Safety net against truly runaway queries (60s+ admin, 30s+ REST)
  • Admin search "timed out" notice for the handful of 45s+ searches (breaks CS retry loop on those specific cases)
  • query_killed event logging for visibility

Root Cause Comparison

Fix Estimated daily savings Enforce mode equivalent
FB sync cleanup (396K orphaned rows) ~715s/day Enforce saves 0s (AS is unlimited)
Admin postmeta search fix ~3,300s/day Enforce saves ~0-17s (most under 45s)
Goaffpro REST volume (already resolved) ~26,000s/day at peak Enforce saved 6s on worst day
Enforce mode (all contexts) ~0-23s/day

Tuning Recommendations

The current limits were chosen conservatively. Based on 18 days of observed query durations, the following changes would make enforce mode substantially more effective without breaking legitimate operations.

1. Lower wp_admin from 45s to 20s

The wp_admin context is dominated by CS admin postmeta LIKE '%term%' searches. On a baseline day, these average 17s with a median of 18s. A 20s ceiling would:

  • Kill the top ~45% of admin searches that currently run 20-40s
  • Save an estimated ~1,500-2,000s of DB time per day at baseline
  • Still allow the majority of admin page loads (non-search pages complete in <5s)
  • The "Search timed out" notice already handles the UX — CS agents see a clear message instead of blank results

Risk: legitimate admin operations (WooCommerce exports, bulk edits) that exceed 20s would be killed. These are rare in the logs but should be monitored after the change. Specific operations can be exempted via the hypercart_query_guard_limit_ms filter.

2. Lower rest_api from 30s to 15s

REST API slow queries average 7-9s. A 15s ceiling would:

  • Catch any future Goaffpro-style REST abuse much earlier (the Jun 8 crisis queries averaged 8.6s but would pile up to 36s under load — a 15s cap would have killed them once contention inflated durations)
  • Still allow normal WooCommerce REST operations (product syncs, order lookups) which complete well under 15s
  • The checkout endpoint is detected separately (60s ceiling), so customer transactions are unaffected

3. Cap action_scheduler at 30s instead of unlimited

The unlimited AS ceiling is the largest gap. FB sync queries run up to 50s, and subscription renewal queries have been observed at 117s. A 30s ceiling would:

  • Kill the extreme tail of background jobs that hold threads during load spikes
  • Still allow normal AS operations (most complete in 5-8s)
  • The AS throttle system (batch_size, time_limit, deferrals) provides separate concurrency control, so a per-query ceiling adds defense-in-depth without conflicting

Risk: some legitimate long-running background jobs (large batch imports, subscription processing) could be killed. The AS runner retries failed actions, so killed jobs would be re-attempted on the next batch cycle. Monitor query_killed events with context: action_scheduler after enabling.

4. Consider a dedicated admin_search context

The current context detection routes all admin AJAX through admin_ajax (10s) and all admin page loads through wp_admin (45s). But admin searches on the orders/products list tables are the specific high-cost operation. A dedicated admin_search context (detected by the presence of $_GET['s'] on admin list table pages) could apply a tighter ceiling (e.g., 15s) to searches specifically, while leaving other wp_admin operations at a more generous limit.

Summary of Proposed Limits

Context Current Proposed Rationale
wp_cli Unlimited Unlimited No change
action_scheduler Unlimited 30s Cap extreme tail, AS retries handle failures
wp_cron 10s 10s No change
admin_ajax 10s 10s No change — already tight
checkout 60s 60s No change — protects revenue
rest_api 30s 15s Catches abuse earlier, normal ops well under 15s
wp_admin 45s 20s Kills top ~45% of admin search scans
frontend 30s 30s No change

With these tighter limits, enforce mode would have killed an estimated ~100-150 queries/day at baseline (vs. ~0 currently) and ~500-800/day during the Jun 8-13 crisis (vs. ~16/day), saving an estimated 2,000-3,000s of DB time daily instead of 0-23s.

Conclusion

Enforce mode with the current limits is a per-query duration guard that is too generous to catch the actual problem. Turning it on adds a reasonable safety net at negligible cost, but should not be expected to improve performance without tightening the limits.

With the proposed tuning above, enforce mode becomes a meaningful contributor — not a replacement for root-cause fixes (FB sync cleanup, admin search optimization), but a useful complement that limits the blast radius of the queries those fixes target.

AS Throttle Mode: Detailed Analysis

The Action Scheduler throttle is a separate system from the main enforce mode, with its own mode setting (as_throttle_mode) and its own enforce behavior. It is currently running at test_observe (capability probe only — logs what it would do but takes no action).

What AS Throttle Enforce Does

The AS throttle monitors two signals every time Action Scheduler runs a batch:

Signal Elevated threshold Critical threshold Exit threshold
threads_running (MySQL) >= 5 >= 15 <= 3 (elevated), <= 10 (critical)
queue_depth (overdue AS actions) >= 100 >= 500 <= 50 (elevated), <= 250 (critical)

The higher of the two signals wins. When elevated or critical, enforce mode applies:

Batch constraints (reduce concurrency):

Level batch_size time_limit concurrent_batches
Normal 25 (WC default) 30s (WC default) Unlimited
Elevated 5 15s 1
Critical 1 10s 1

Action deferrals (delay low-priority work):

Priority tier Elevated delay Critical delay Example hooks
Critical 0 (run immediately) 0 (run immediately) nofraud_*, woocommerce_payment_*
High 0 (run immediately) 300s (5 min) woocommerce_scheduled_subscription_*, webhooks
Normal 300s (5 min) 900s (15 min) woocommerce_run_*, action_scheduler_*
Deferrable 900s (15 min) 3600s (1 hr) facebook_for_woocommerce_*, shipstation_*, klaviyo_*

Recurring actions are never deferred (to avoid breaking recurrence chains). Each action can be deferred at most 5 times before forced execution.

The Queue Pileup Problem

This is the primary concern. The data shows that AS throttle enforce would create a feedback loop that prevents recovery.

On Jun 8 (crisis day), the critical load level was triggered 2,602 times. Of those:

  • 2,585 were queue-driven (queue_depth >= 500)
  • Only 1 was thread-driven (threads_running >= 15)

This means the queue depth — not actual MySQL thread pressure — was almost entirely responsible for triggering critical mode. Once the queue backs up past 500, the following cycle occurs:

Queue hits 500 overdue actions
  → Critical level triggered
    → Batch size reduced to 1 action per batch
      → AS drains queue ~25x slower
        → Queue grows faster than it drains
          → Queue stays above 500
            → Critical level persists indefinitely

Observed queue depths confirm this:

Day Max queue depth Max threads_running % of AS runs at critical
Jun 1 (baseline) 1,129 25 1.6%
Jun 7 (healthy) 47 10 0%
Jun 8 (crisis) 6,524 37 54%
Jun 9 3,277 12 27%
Jun 13 3,932 100 34%
Jun 14 (recovered) 64 8 0%
Jun 15 1,436 17 5%

The queue is already piling up without throttling — 6,524 overdue actions on Jun 8. Throttling the drain rate to 1 action/batch would make this dramatically worse. At that rate, the queue would take over 100 hours to drain even if no new actions were added.

The deferral system compounds this: deferred actions get rescheduled 5-60 minutes into the future. When load drops back to normal, all those deferred actions come due simultaneously, creating a secondary surge that can re-trigger elevated/critical and restart the cycle.

Why Threads vs. Queue Matters

The two signals measure fundamentally different things:

  • threads_running >= 15 means the MySQL server is genuinely overloaded right now. Throttling AS is correct — stop adding work to a saturated system.
  • queue_depth >= 500 means AS has a backlog. Throttling makes the backlog worse. The queue is a symptom (actions take too long because the DB is slow), not the cause.

On baseline days (Jun 1), elevated was triggered by threads_running 391 times vs. queue_depth 63 times — a healthier ratio. On crisis days, queue_depth completely dominates the signal.

Recommendation: Do Not Enable AS Throttle Enforce Yet

Short answer: With the current thresholds, AS throttle enforce would make crisis events worse by preventing queue drainage while the queue_depth signal keeps the system locked in critical mode.

If AS throttle enforce is desired in the future, the following changes would mitigate the feedback loop:

  1. Remove or raise the queue_depth thresholds. Queue depth is a lagging indicator — by the time the queue hits 500, the damage is done and throttling can't help. Either disable queue_depth as a signal entirely, or raise the critical threshold to 2,000+ so it only fires during genuine runaway scenarios.

  2. Rely on threads_running as the primary signal. This directly measures MySQL pressure and is the correct trigger for throttling. The existing thresholds (5 elevated, 15 critical) are reasonable.

  3. Soften the critical batch constraints. batch_size=1 is too aggressive — it reduces throughput 25x. A critical batch_size of 5 (same as current elevated) with time_limit=10s would reduce throughput ~5x while still providing meaningful relief.

  4. Start with observe mode (not enforce). Move from test_observe to observe first. Observe mode logs what it would do without actually throttling. This provides the data to validate whether the throttle decisions are correct before enabling enforce.

Priority: Fix Root Causes First

The AS throttle is a compensating control for when the DB is overloaded. It cannot fix the underlying overload. The root-cause fixes (FB sync cleanup, postmeta search optimization) would reduce the average AS action execution time, which would:

  • Prevent queue backup in the first place (actions complete faster → queue stays small)
  • Reduce threads_running naturally (fewer long queries → fewer concurrent threads)
  • Make the throttle unnecessary for most scenarios

Once the root causes are addressed, the throttle's value shifts from "crisis management" to "safety net" — and at that point, the threads_running signal alone (with queue_depth removed) would be sufficient.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions