Improve CO save performance for bulk updates#8057
Improve CO save performance for bulk updates#8057acwhite211 wants to merge 7 commits intov7_12_0_6from
Conversation
📝 WalkthroughWalkthroughA context-scoped caching mechanism for catalog-number preference lookups is introduced across bulk operations and workbench uploads to reduce redundant computations. The cache is initialized via a context manager and consulted by preference-checking logic before full computation. All bulk creation, bulk copy, and upload entry points are wired to activate this cache. ChangesCatalog Number Preference Caching
🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
specifyweb/backend/businessrules/utils.py (1)
33-57:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid caching the fallback
Falseafter a preference read/parsing failure.Right now any exception path leaves
unique_catnum_enabled = Falseand still stores that value in the context cache. A single transientget_app_resource(...)/JSON failure would then disable the uniqueness rule for every later save in the same bulk operation.Suggested fix
def get_unique_catnum_across_comp_co_coll_pref(collection, user) -> bool: import specifyweb.backend.context.app_resource as app_resource cache = _unique_catnum_pref_cache.get() @@ - unique_catnum_enabled: bool = False + unique_catnum_enabled: bool = False + cacheable = True @@ except json.JSONDecodeError: + cacheable = False logger.warning(f"Error: Could not decode JSON for collection preferences") except TypeError as e: + cacheable = False logger.warning(f"Error: Unexpected data structure in collection preferences: {e}") except Exception as e: + cacheable = False logger.warning(f"An unexpected error occurred: {e}") - if cache is not None: + if cache is not None and cacheable: cache[cache_key] = unique_catnum_enabled🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@specifyweb/backend/businessrules/utils.py` around lines 33 - 57, The current try/except always leaves unique_catnum_enabled=False on errors and then writes that fallback into cache; change this so the cache is only updated when the preference read/parse actually succeeded. For example, after calling app_resource.get_app_resource and json.loads and validating the nested keys (the logic that computes unique_catnum_enabled from collection_prefs_json, unique_catalog_number_pref and behavior), set a local success flag (or move the cache write) and only assign cache[cache_key] = unique_catnum_enabled when success is True and cache is not None; do not write to cache inside any except path that handles JSONDecodeError/TypeError/Exception. Ensure references: unique_catnum_enabled, cache, cache_key, app_resource.get_app_resource, collection_prefs_json, json.loads, unique_catalog_number_pref, behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@specifyweb/backend/businessrules/utils.py`:
- Around line 33-57: The current try/except always leaves
unique_catnum_enabled=False on errors and then writes that fallback into cache;
change this so the cache is only updated when the preference read/parse actually
succeeded. For example, after calling app_resource.get_app_resource and
json.loads and validating the nested keys (the logic that computes
unique_catnum_enabled from collection_prefs_json, unique_catalog_number_pref and
behavior), set a local success flag (or move the cache write) and only assign
cache[cache_key] = unique_catnum_enabled when success is True and cache is not
None; do not write to cache inside any except path that handles
JSONDecodeError/TypeError/Exception. Ensure references: unique_catnum_enabled,
cache, cache_key, app_resource.get_app_resource, collection_prefs_json,
json.loads, unique_catalog_number_pref, behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 7b811e07-b495-4def-9f5b-ac142c11a48f
📒 Files selected for processing (5)
specifyweb/backend/bulk_copy/bulk_copy.pyspecifyweb/backend/businessrules/rules/collectionobject_rules.pyspecifyweb/backend/businessrules/utils.pyspecifyweb/backend/workbench/upload/upload.pyspecifyweb/specify/api/dispatch.py
|
Merci, Monsieur Lapin 🐰 |
|
If this fix isn't enough of a speed up, I have another solution idea that involves a schema migration for adding an index on the Component.catalognumber field: operations = [
migrations.AddIndex(
model_name="component",
index=models.Index(
fields=["catalognumber"],
name="ComponentCatNumIDX",
),
),
] |
bhumikaguptaa
left a comment
There was a problem hiding this comment.
- In Batch Edit, update a non-catalognumber field across many Collection Objects and confirm the operation completes noticeably faster than before.
- In Data Set commit, create or update Collection Objects and confirm progress advances steadily without the previous long delay.
- Create a Collection Object with a catalognumber that matches a Component in the same collection and confirm the duplicate warning still appears.
- Create or update a Collection Object with a catalog number that matches a Component in a different collection and confirm it is allowed.
- Confirm regular Collection Object create/edit workflows still save successfully.
Everything works as expected except when I tried to upload a dataset with approximately 12000 records via Workbench, it took around 20 minutes (and is still going on) to validate and then upload the dataset. The time it took is similar to the time Demo fish/main took. I am unsure if it is supposed to take this much time.
kwhuber
left a comment
There was a problem hiding this comment.
-
In Batch Edit, update a non-catalognumber field across many Collection Objects and confirm the operation completes noticeably faster than before.
-
In Data Set commit, create or update Collection Objects and confirm progress advances steadily without the previous long delay.
-
Create a Collection Object with a catalognumber that matches a Component in the same collection and confirm the duplicate warning still appears.
-
Create or update a Collection Object with a catalog number that matches a Component in a different collection and confirm it is allowed.
-
Confirm regular Collection Object create/edit workflows still save successfully.
-
For Data set commit, I noticed marginal improvement, but it wasn't too high of a wait to begin with. I altered a comments field of ~4800 CO records. I tested Data set manipulation on uw_geo and on main experienced 17 records committed every 3 seconds; meanwhile on the issue-8055 branch I noticed about 20-25 records committed every 3 seconds so slightly quicker.
|
I worked some more on trying to speed up the data set commit by adding some caching functionality. I updated the fix so the bulk/dataset context now caches Component catalog numbers per collection. Good to test out the speed of the data set commit again. If the caching solution looks good, then maybe we can discus if adding a index to the component.catalognumber field would be better. |
|
Updated 1153 COs in less than 5 min |
Fixes #8055
The new CollectionObject/Component catalognumber uniqueness rule was running during every CollectionObject save, even when the catalognumber and collection were unchanged. These catalognumber checks seemed to be unnecessary likely the main cause of the slowdown during batch edits, data set commits, and bulk copy operations.
The most straight forward fix seemed to be skipping the component catalognumber duplicate check when a CO save does not change
catalognumberorcollection. I also added caching to theCollectionPreferencesuniqueness setting during bulk create, bulk copy, and data set upload operations. One other thing that might add some speed up is scoping component duplicate lookups to just the current collection? I added that for now, but might remove it if it doesn't provide much speedup or needs to remain unique between collections.Still need to do some more testing and performance checking to confirm this solution completely solves the issue.
Checklist
self-explanatory (or properly documented)
Testing instructions
#8055 (comment)
Test cat num uniqueness feat from #6708
Uniqueness: Preference to have catalog number unique across CO and Component table
"uniqueCatalogNumberAccrossComponentAndCO": {"behavior": {"uniqueness": true}}