feat(arangodb): add ArangoDocumentStore and ArangoEmbeddingRetriever#3340
feat(arangodb): add ArangoDocumentStore and ArangoEmbeddingRetriever#3340SyedShahmeerAli12 wants to merge 10 commits into
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
@bogdankostic All 26 unit tests pass and the integration test was verified locally against ArangoDB via Docker. |
bogdankostic
left a comment
There was a problem hiding this comment.
Thank you for the PR @SyedShahmeerAli12!
I left a couple of inline comments. Apart from those, this integration could be further improved by adding async counterpart to write_documents etc.m python-arango ships an async client. Moreover, we could add an AranfoBM25Retriever to support text search.
| *, | ||
| host: str = "http://localhost:8529", | ||
| database: str = "haystack", | ||
| username: str = "root", |
There was a problem hiding this comment.
Let's make username also a Secret that can be set via an environment variable.
| written += 1 | ||
| continue | ||
| else: | ||
| col.insert(arango_doc) |
There was a problem hiding this comment.
Could we make use here of col.import_bulk or col.insert_many which should be optimized for writing many document at once?
| col = cast(StandardCollection, self._col) | ||
| for doc_id in document_ids: | ||
| if col.has(doc_id): | ||
| col.delete(doc_id) |
| FOR doc IN {self.collection_name} | ||
| FILTER doc.embedding != null | ||
| {filter_clause} | ||
| LET score = COSINE_SIMILARITY(doc.embedding, @query_vec) |
There was a problem hiding this comment.
Should we use the approximate functions mentioned here? They are probably optimized for vector sector. Also, it would be nice to support dot product distance as well.
| reason="Export ARANGO_HOST and ARANGO_PASSWORD to run integration tests.", | ||
| ) | ||
| @pytest.mark.integration | ||
| class TestArangoDocumentStoreIntegration: |
There was a problem hiding this comment.
Let's use our DocumentStoreTestMixins for the integration tests
| readme = "README.md" | ||
| requires-python = ">=3.10" | ||
| license = "Apache-2.0" | ||
| keywords = [] |
There was a problem hiding this comment.
Let's add some relevant keywords.
| all = 'pytest {args:tests}' | ||
| unit-cov-retry = 'pytest --cov=haystack_integrations --reruns 3 --reruns-delay 30 -x -m "not integration" {args:tests}' | ||
| integration-cov-append-retry = 'pytest --cov=haystack_integrations --cov-append --reruns 3 --reruns-delay 30 -x -m "integration" {args:tests}' | ||
| types = "mypy -p haystack_integrations.document_stores.arangodb {args}" |
There was a problem hiding this comment.
Let's add the retriever here.
bogdankostic
left a comment
There was a problem hiding this comment.
Thank you for addressing the review comments @SyedShahmeerAli12! I just added two more minor comments.
Apart from that, the integration tests are currently skipped in the CI. It would be nice to have a docker compose file and run it in the CI, similar to how we do in the Opensearch integration.
| class TestArangoDocumentStoreIntegration( | ||
| CountDocumentsTest, | ||
| DeleteDocumentsTest, | ||
| FilterDocumentsTest, | ||
| WriteDocumentsTest, |
There was a problem hiding this comment.
You can use DocumentStoreBaseTests here which includes all of these mixins.
|
Switched to APPROX_NEAR_COSINE, APPROX_NEAR_INNER_PRODUCT, APPROX_NEAR_L2 as suggested - Replaced individual mixins with DocumentStoreBaseTests CI / docker:
|
bogdankostic
left a comment
There was a problem hiding this comment.
Thanks @SyedShahmeerAli12, this PR is almost good to go, I just have some last minor cosmetic comments to further improve this PR.
|
Addressed all comments updated the retriever docstring to reflect all three similarity functions, removed |
Related Issues
Proposed Changes:
Adds a new
arangodb-haystackintegration with:ArangoDocumentStorefull Haystack DocumentStore backed by ArangoDB, supportingwrite_documents,filter_documents,delete_documents,count_documents, and vector similarity retrieval via AQLCOSINE_SIMILARITY(requires ArangoDB 3.12+)ArangoEmbeddingRetriever@componentwrapper that calls_embedding_retrievalon the document storefilters.pytranslates Haystack metadata filter dicts to AQL FILTER expressionspython-arangoclient; password stored as a HaystackSecretto_dict/from_dictHow did you test it?
docker run -p 8529:8529 -e ARANGO_ROOT_PASSWORD=testpw arangodb:latest): writes 2 documents with embeddings, retrieves by cosine similarity, deletes one, asserts countNotes for the reviewer
COSINE_SIMILARITY(doc.embedding, @query_vec)requires ArangoDB 3.12+; documented in the docstringpython-arango>=8.0.0added as the only new dependencyARANGO_HOSTandARANGO_PASSWORDenv varsChecklist
feat: