Skip to content

feat(indexers): make persisting VectorStore to disk optional#174

Open
lukeroantreeONS wants to merge 5 commits into
mainfrom
173-make-persisting-a-vectorstore-optional
Open

feat(indexers): make persisting VectorStore to disk optional#174
lukeroantreeONS wants to merge 5 commits into
mainfrom
173-make-persisting-a-vectorstore-optional

Conversation

@lukeroantreeONS
Copy link
Copy Markdown
Contributor

@lukeroantreeONS lukeroantreeONS commented May 19, 2026

📌 Make persisting VectorStore to disk optional

✨ Summary

To cover use-cases where a created VectorStore will not be reloaded (e.g. experimenting, testing, running in-memory with no ability to write to disk, etc.), this PR adds an optional flag to the creation of a VectorStore, persist_to_disk (default: True).

📜 Changes Introduced

  • feat(indexers): optional extra parameter persist_to_disk=True. If False, skips writing the parquet & metadata files.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code passes linting with Ruff
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

  1. Run the existing general_workflow_demo.ipynb notebook start to finish.
  2. Delete the testdata/ VDB directory that was created in step 1.
  3. Change Cell 5 to the following:
from classifai.indexers import VectorStore
my_vector_store = VectorStore(
    file_name="data/testdata.csv",
    data_type="csv",
    vectoriser=vectoriser,
    persist_to_disk=False
)

and re-run cells 1-5.
4) Confirm the testdata/ VDB is not created.
5) Repeat with any variations of output_dir & overwrite parameters (these should be followed if persist_to_disk=True, and ignored if persist_to_disk=False).


Note: upon approval, I will update the documentation and example notebooks to describe this functionality & re-request approval.
Note 2: upon the fsspec PR being merged, I will update the testing instructions to ensure that functionality is covered too.

@lukeroantreeONS lukeroantreeONS requested a review from a team as a code owner May 19, 2026 15:00
@lukeroantreeONS lukeroantreeONS linked an issue May 19, 2026 that may be closed by this pull request
@github-actions github-actions Bot added the enhancement New feature or request label May 19, 2026
Copy link
Copy Markdown
Contributor

@frayle-ons frayle-ons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea for an addition, this is something users would likely want to do but currently there's no way to do it.

Current logic implementation looks fine, but i think the new parameter/attribute persist_to_disk overlaps semantically with the existing output_dir param. And then the interaction between them might be confusing - i.e. what takes priority, specifying an output_dir or setting persist=False?

Also we're already suppressing the linting warning PLR0913 for too many arguments on this constructor.

Alternatively, could we instead add this functionality into the existing output_dir arg? We could have a special string value '_DO_NOT_PERSIST_' and if thats passed to the output_dir arg then the new changes in this PR are triggered. We could provide documentation on the default None behaviour and this new special string for its usecase.

@lukeroantreeONS
Copy link
Copy Markdown
Contributor Author

i think the new parameter/attribute persist_to_disk overlaps semantically with the existing output_dir param

I'm happy to rename it to skip_save and reverse the boolean logic, to make it semantically distinct 👍

Also we're already suppressing the linting warning PLR0913 for too many arguments on this constructor.

I think we should deviate from pylint's advice on this one, there are many examples of popular, easily understood classes which have many optional arguments in the same way (e.g. most of scikit-learn).

And then the interaction between them might be confusing - i.e. what takes priority, specifying an output_dir or setting persist=False?

This is clarified within the PR's docstring updates, and will be clarified further in the documentation & example notebook updates.

Alternatively, could we instead add this functionality into the existing output_dir arg? We could have a special string value 'DO_NOT_PERSIST' and if thats passed to the output_dir arg then the new changes in this PR are triggered. We could provide documentation on the default None behaviour and this new special string for its usecase.

I don't love the idea of introducing magic values for this. I think having a distinct parameter for it would be more explicit and make it less likely to be overlooked as a feature.

@frayle-ons
Copy link
Copy Markdown
Contributor

i think the new parameter/attribute persist_to_disk overlaps semantically with the existing output_dir param

I'm happy to rename it to skip_save and reverse the boolean logic, to make it semantically distinct 👍

Also we're already suppressing the linting warning PLR0913 for too many arguments on this constructor.

I think we should deviate from pylint's advice on this one, there are many examples of popular, easily understood classes which have many optional arguments in the same way (e.g. most of scikit-learn).

And then the interaction between them might be confusing - i.e. what takes priority, specifying an output_dir or setting persist=False?

This is clarified within the PR's docstring updates, and will be clarified further in the documentation & example notebook updates.

Alternatively, could we instead add this functionality into the existing output_dir arg? We could have a special string value 'DO_NOT_PERSIST' and if thats passed to the output_dir arg then the new changes in this PR are triggered. We could provide documentation on the default None behaviour and this new special string for its usecase.

I don't love the idea of introducing magic values for this. I think having a distinct parameter for it would be more explicit and make it less likely to be overlooked as a feature.

Okay, final alternate suggestion would be can we just keep output_dir with no extra parameters, and change the None behaviour so that if an output_dir isn't provided, then it just isn't persisted to disk? If you disagree, then happy to go ahead with the above skip_save functionality.

@lukeroantreeONS
Copy link
Copy Markdown
Contributor Author

lukeroantreeONS commented May 20, 2026

Okay, final alternate suggestion would be can we just keep output_dir with no extra parameters, and change the None behaviour so that if an output_dir isn't provided, then it just isn't persisted to disk? If you disagree, then happy to go ahead with the above skip_save functionality.

I would much prefer that approach, but it would be a breaking change from the current behaviour, so if we went for it we'd need to wait for the 2.0.0 release.

My preference would be to go with the skip_save approach for the moment to get this functionality out, and make a plan to refactor output_dir as you described for the 2.0.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make persisting a VectorStore optional

2 participants