feat(indexers): make persisting VectorStore to disk optional#174
feat(indexers): make persisting VectorStore to disk optional#174lukeroantreeONS wants to merge 5 commits into
Conversation
frayle-ons
left a comment
There was a problem hiding this comment.
Nice idea for an addition, this is something users would likely want to do but currently there's no way to do it.
Current logic implementation looks fine, but i think the new parameter/attribute persist_to_disk overlaps semantically with the existing output_dir param. And then the interaction between them might be confusing - i.e. what takes priority, specifying an output_dir or setting persist=False?
Also we're already suppressing the linting warning PLR0913 for too many arguments on this constructor.
Alternatively, could we instead add this functionality into the existing output_dir arg? We could have a special string value '_DO_NOT_PERSIST_' and if thats passed to the output_dir arg then the new changes in this PR are triggered. We could provide documentation on the default None behaviour and this new special string for its usecase.
I'm happy to rename it to
I think we should deviate from pylint's advice on this one, there are many examples of popular, easily understood classes which have many optional arguments in the same way (e.g. most of scikit-learn).
This is clarified within the PR's docstring updates, and will be clarified further in the documentation & example notebook updates.
I don't love the idea of introducing magic values for this. I think having a distinct parameter for it would be more explicit and make it less likely to be overlooked as a feature. |
Okay, final alternate suggestion would be can we just keep |
I would much prefer that approach, but it would be a breaking change from the current behaviour, so if we went for it we'd need to wait for the 2.0.0 release. My preference would be to go with the |
📌 Make persisting VectorStore to disk optional
✨ Summary
To cover use-cases where a created VectorStore will not be reloaded (e.g. experimenting, testing, running in-memory with no ability to write to disk, etc.), this PR adds an optional flag to the creation of a VectorStore,
persist_to_disk(default: True).📜 Changes Introduced
persist_to_disk=True. If False, skips writing the parquet & metadata files.✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
general_workflow_demo.ipynbnotebook start to finish.testdata/VDB directory that was created in step 1.and re-run cells 1-5.
4) Confirm the
testdata/VDB is not created.5) Repeat with any variations of
output_dir&overwriteparameters (these should be followed ifpersist_to_disk=True, and ignored ifpersist_to_disk=False).Note: upon approval, I will update the documentation and example notebooks to describe this functionality & re-request approval.
Note 2: upon the
fsspecPR being merged, I will update the testing instructions to ensure that functionality is covered too.