Skip to content

Improve CO save performance for bulk updates#8057

Open
acwhite211 wants to merge 7 commits intov7_12_0_6from
issue-8055
Open

Improve CO save performance for bulk updates#8057
acwhite211 wants to merge 7 commits intov7_12_0_6from
issue-8055

Conversation

@acwhite211
Copy link
Copy Markdown
Member

@acwhite211 acwhite211 commented May 4, 2026

Fixes #8055

The new CollectionObject/Component catalognumber uniqueness rule was running during every CollectionObject save, even when the catalognumber and collection were unchanged. These catalognumber checks seemed to be unnecessary likely the main cause of the slowdown during batch edits, data set commits, and bulk copy operations.

The most straight forward fix seemed to be skipping the component catalognumber duplicate check when a CO save does not change catalognumber or collection. I also added caching to the CollectionPreferences uniqueness setting during bulk create, bulk copy, and data set upload operations. One other thing that might add some speed up is scoping component duplicate lookups to just the current collection? I added that for now, but might remove it if it doesn't provide much speedup or needs to remain unique between collections.

Still need to do some more testing and performance checking to confirm this solution completely solves the issue.

Checklist

  • Self-review the PR after opening it to make sure the changes look good and
    self-explanatory (or properly documented)
  • Add relevant issue to release milestone
  • Add pr to documentation list

Testing instructions

#8055 (comment)

  • In Batch Edit, update a non-catalognumber field across many Collection Objects and confirm the operation completes noticeably faster than before.
  • In Data Set commit, create or update Collection Objects and confirm progress advances steadily without the previous long delay.
  • Create a Collection Object with a catalognumber that matches a Component in the same collection and confirm the duplicate warning still appears.
  • Create or update a Collection Object with a catalog number that matches a Component in a different collection and confirm it is allowed.
  • Confirm regular Collection Object create/edit workflows still save successfully.

Test cat num uniqueness feat from #6708

Uniqueness: Preference to have catalog number unique across CO and Component table

  • In your collection preference add:
    "uniqueCatalogNumberAccrossComponentAndCO": {"behavior": {"uniqueness": true}}
  • Save
  • Clear cache
  • Refresh
  • Create a new CO
  • Save
  • Open a new tab
  • Create a CO
  • Save
  • Add a component
  • Enter the cat num used in your first CO in your component cat number field
  • Try to save
  • Verify there is a save blocker saying that the cat num value is already in use
  • Do the same but this time use a cat num value from another component in your component cat num field
  • Verify there is also a save blocker when using another component cat num value

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

A context-scoped caching mechanism for catalog-number preference lookups is introduced across bulk operations and workbench uploads to reduce redundant computations. The cache is initialized via a context manager and consulted by preference-checking logic before full computation. All bulk creation, bulk copy, and upload entry points are wired to activate this cache.

Changes

Catalog Number Preference Caching

Layer / File(s) Summary
Cache Infrastructure
specifyweb/backend/businessrules/utils.py
ContextVar-backed cache with (collection.id, user.id) keys stores computed unique catalog-number preferences. New cache_unique_catnum_preferences() context manager initializes/resets the cache per context. Function get_unique_catnum_across_comp_co_coll_pref() consults cache before computing and caches the result.
Business Logic Refinement
specifyweb/backend/businessrules/rules/collectionobject_rules.py
Added _collection_object_catalog_check_needed(co) helper to conditionally gate catalog-number duplication checks based on whether the object has a catalog number and is new or pre-existing. Updated collectionobject_pre_save to run the BusinessRuleException path only when agent has specifyuser and the helper returns true. Simplified duplicate-check query to remove exclude(pk=co.pk).
Bulk Operation Wiring
specifyweb/backend/bulk_copy/bulk_copy.py, specifyweb/specify/api/dispatch.py
collection_dispatch_bulk_copy and collection_dispatch_bulk wrap their per-object post_resource() loops inside cache_unique_catnum_preferences() context. Both functions remain unchanged otherwise.
Workbench Upload Integration
specifyweb/backend/workbench/upload/upload.py
Main upload savepoint in do_upload() is wrapped with cache_unique_catnum_preferences() context manager to cache preferences during the upload loop.
🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Automatic Tests ⚠️ Warning PR includes significant functional changes but lacks new automatic tests for the caching mechanism and catalog check logic, despite existing comprehensive test infrastructure. Add tests verifying cache initialization/reset, catalog check detection, caching mechanism usage in bulk operations, and skipped checks when catalog fields unchanged.
Testing Instructions ⚠️ Warning PR template requires 'Testing instructions' section but it is marked TODO and not filled in. Add 'Testing instructions' section specifying test suites, manual verification steps, performance benchmarking, and regression testing to validate the changes.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective: improving CollectionObject save performance during bulk operations through caching and optimized checks.
Linked Issues check ✅ Passed The PR addresses the core requirement from issue #8055 by implementing optimizations (caching of uniqueness preferences and skipping unnecessary checks) to restore acceptable bulk operation performance.
Out of Scope Changes check ✅ Passed All changes are directly focused on performance optimization for bulk operations and catalog number uniqueness checks; no unrelated modifications detected.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-8055

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@acwhite211 acwhite211 changed the base branch from main to v7_12_0_6 May 4, 2026 22:20
@acwhite211
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
specifyweb/backend/businessrules/utils.py (1)

33-57: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid caching the fallback False after a preference read/parsing failure.

Right now any exception path leaves unique_catnum_enabled = False and still stores that value in the context cache. A single transient get_app_resource(...)/JSON failure would then disable the uniqueness rule for every later save in the same bulk operation.

Suggested fix
 def get_unique_catnum_across_comp_co_coll_pref(collection, user) -> bool:
     import specifyweb.backend.context.app_resource as app_resource

     cache = _unique_catnum_pref_cache.get()
@@
-    unique_catnum_enabled: bool = False
+    unique_catnum_enabled: bool = False
+    cacheable = True
@@
     except json.JSONDecodeError:
+        cacheable = False
         logger.warning(f"Error: Could not decode JSON for collection preferences")
     except TypeError as e:
+        cacheable = False
         logger.warning(f"Error: Unexpected data structure in collection preferences: {e}")
     except Exception as e:
+        cacheable = False
         logger.warning(f"An unexpected error occurred: {e}")

-    if cache is not None:
+    if cache is not None and cacheable:
         cache[cache_key] = unique_catnum_enabled
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specifyweb/backend/businessrules/utils.py` around lines 33 - 57, The current
try/except always leaves unique_catnum_enabled=False on errors and then writes
that fallback into cache; change this so the cache is only updated when the
preference read/parse actually succeeded. For example, after calling
app_resource.get_app_resource and json.loads and validating the nested keys (the
logic that computes unique_catnum_enabled from collection_prefs_json,
unique_catalog_number_pref and behavior), set a local success flag (or move the
cache write) and only assign cache[cache_key] = unique_catnum_enabled when
success is True and cache is not None; do not write to cache inside any except
path that handles JSONDecodeError/TypeError/Exception. Ensure references:
unique_catnum_enabled, cache, cache_key, app_resource.get_app_resource,
collection_prefs_json, json.loads, unique_catalog_number_pref, behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@specifyweb/backend/businessrules/utils.py`:
- Around line 33-57: The current try/except always leaves
unique_catnum_enabled=False on errors and then writes that fallback into cache;
change this so the cache is only updated when the preference read/parse actually
succeeded. For example, after calling app_resource.get_app_resource and
json.loads and validating the nested keys (the logic that computes
unique_catnum_enabled from collection_prefs_json, unique_catalog_number_pref and
behavior), set a local success flag (or move the cache write) and only assign
cache[cache_key] = unique_catnum_enabled when success is True and cache is not
None; do not write to cache inside any except path that handles
JSONDecodeError/TypeError/Exception. Ensure references: unique_catnum_enabled,
cache, cache_key, app_resource.get_app_resource, collection_prefs_json,
json.loads, unique_catalog_number_pref, behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7b811e07-b495-4def-9f5b-ac142c11a48f

📥 Commits

Reviewing files that changed from the base of the PR and between 52eaa9d and b3b0b2c.

📒 Files selected for processing (5)
  • specifyweb/backend/bulk_copy/bulk_copy.py
  • specifyweb/backend/businessrules/rules/collectionobject_rules.py
  • specifyweb/backend/businessrules/utils.py
  • specifyweb/backend/workbench/upload/upload.py
  • specifyweb/specify/api/dispatch.py

@acwhite211
Copy link
Copy Markdown
Member Author

Merci, Monsieur Lapin 🐰

@CarolineDenis CarolineDenis added this to the 7.12.0.6 milestone May 5, 2026
@acwhite211
Copy link
Copy Markdown
Member Author

If this fix isn't enough of a speed up, I have another solution idea that involves a schema migration for adding an index on the Component.catalognumber field:

    operations = [
        migrations.AddIndex(
            model_name="component",
            index=models.Index(
                fields=["catalognumber"],
                name="ComponentCatNumIDX",
            ),
        ),
    ]

@acwhite211 acwhite211 marked this pull request as ready for review May 6, 2026 15:19
@acwhite211 acwhite211 requested review from a team May 6, 2026 15:19
Copy link
Copy Markdown
Collaborator

@bhumikaguptaa bhumikaguptaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • In Batch Edit, update a non-catalognumber field across many Collection Objects and confirm the operation completes noticeably faster than before.
  • In Data Set commit, create or update Collection Objects and confirm progress advances steadily without the previous long delay.
  • Create a Collection Object with a catalognumber that matches a Component in the same collection and confirm the duplicate warning still appears.
  • Create or update a Collection Object with a catalog number that matches a Component in a different collection and confirm it is allowed.
  • Confirm regular Collection Object create/edit workflows still save successfully.

Everything works as expected except when I tried to upload a dataset with approximately 12000 records via Workbench, it took around 20 minutes (and is still going on) to validate and then upload the dataset. The time it took is similar to the time Demo fish/main took. I am unsure if it is supposed to take this much time.

Image

Copy link
Copy Markdown
Contributor

@kwhuber kwhuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • In Batch Edit, update a non-catalognumber field across many Collection Objects and confirm the operation completes noticeably faster than before.

  • In Data Set commit, create or update Collection Objects and confirm progress advances steadily without the previous long delay.

  • Create a Collection Object with a catalognumber that matches a Component in the same collection and confirm the duplicate warning still appears.

  • Create or update a Collection Object with a catalog number that matches a Component in a different collection and confirm it is allowed.

  • Confirm regular Collection Object create/edit workflows still save successfully.

  • For Data set commit, I noticed marginal improvement, but it wasn't too high of a wait to begin with. I altered a comments field of ~4800 CO records. I tested Data set manipulation on uw_geo and on main experienced 17 records committed every 3 seconds; meanwhile on the issue-8055 branch I noticed about 20-25 records committed every 3 seconds so slightly quicker.

@CarolineDenis CarolineDenis requested a review from melton-jason May 6, 2026 19:56
@acwhite211
Copy link
Copy Markdown
Member Author

acwhite211 commented May 6, 2026

I worked some more on trying to speed up the data set commit by adding some caching functionality. I updated the fix so the bulk/dataset context now caches Component catalog numbers per collection.

Good to test out the speed of the data set commit again.

If the caching solution looks good, then maybe we can discus if adding a index to the component.catalognumber field would be better.

@CarolineDenis CarolineDenis requested review from a team, bhumikaguptaa and kwhuber May 7, 2026 08:44
@CarolineDenis
Copy link
Copy Markdown
Contributor

Updated 1153 COs in less than 5 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 📋Back Log

Development

Successfully merging this pull request may close these issues.

v7.12 slow down

5 participants