feat: Cloud storage compatibility with build, save and from_filespace vectorstore operations#170
feat: Cloud storage compatibility with build, save and from_filespace vectorstore operations#170frayle-ons wants to merge 1 commit into
Conversation
|
This looks great, and works well on initial testing! Couple of thoughts:
|
Thanks for reviewing. Fsspec supports many different back ends such as gcsfs, but also s3 storage on aws, Git file systems, google drive and seemingly many more which means there might be many different backend installations the users might need: https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations With the current implementation, users who want to work with s3:// buckets can do so. If we want to keep the wider range of compatible filesystems provided by Fsspec, then I think the informative error would be the way to go as otherwise we might need to add a lot of (optional) project dependencies. We could consider adding a new subclass of ClassifaiError to specifically capture this missing fsspec dependency and gracefully explain the user needs to install an extra lib. Otherwise, we could limit the use of fsspec to just local file system or gs:// and then go the optional install route with classifai[gcp] |
You're right - definitely keep the informative error to cover other cloud systems 👍 . When we know more about requirements for other cloud solutions, we could update those additional notes and/or add extra optional dependency categories. E.g. something like try:
in_fs, in_path = fsspec.core.url_to_fs(folder_path)
except Exception as e:
extra_info = ''
if folder_path.startswith('gs://'):
extra_info = ' Note: install classifai[gcp] to include requirements for interacting with Google Cloud Platform'
raise ConfigurationError(
f"Failed to read input directory with file loader.{extra_info}",
context={
"folder_path": folder_path,
"cause": str(e),
"cause_type": type(e).__name__,
},
) from e |
✨ Summary
These changes introduce the use of the fsspec library to enable:
output_dirfrom_filespaceargument.This has been validated to work with google cloud storage URIs (
gs://buket-name/foldername/filename.csv) but should also extend to other cloud spaces supported by the fsspec library such as AWS storage which uses thes3 protocol.Finally, the existing functionality of operating on local filespace is unchanged - users can also mix and match protocol types, i.e. reading a csv from cloud and saving on local directory, or reading from local and saving to cloud. Or fully cloud native, or fully local as mentioned already. When no output directory is specified the system attempts to save to a local folder with the same name as the input file.
The primary use of Fsspec in these changes uses the function:
fsspec.core.url_to_fs(file/folder/path). This function analyses the specified path and detects which protocol to use and sets up afilsesystemobject to interact with[ local, gs, s3, etc ]. It also provides apathobject which specifies which directory within thefilsystemthat should be operated on.Example extract from the
from_filespace()class method:📜 Changes Introduced
vectorstoreinit method to work with fsspec to handle filespace protocol detections and operationsvectorsorefrom_filespace method to work with loading from cloud✅ Checklist
All precommit checks passed.
🔍 How to Test
Using the following generic code that sets up a VectorStore, the tester could trial the different ways of loading and saving vector stores. The user will need to install an extra dependency
gcsfswhich currently is captured and presented by ClassifaiErrors if its not installed before attempting to use Google buckets.in the above code I've included the local path for test data that's available in this repo, as well as an example path that could be used to load the data from a bucket. To test this you would need to upload the data to a test bucket in a Gcloud environment and authenticate with the correct project to test the code.
Additionally, I've shown the example of how the output dir can be either local or remote under the same principles as the input file. I would recommend trying several combinations, local in -> cloud out. cloud in -> cloud out. cloud in - > local out etc.
And also try to test some error/edge cases:
zl23qfor exampleAlso it would be good to test the from_filespace method to load in a vectostore that was saved to cloud, code snippet: