feat: Add S3 support for artifacts#29
Conversation
Summary of ChangesHello @wmsnp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the artifact storage capabilities by integrating S3-compatible object storage. It provides a robust, self-hosted option for managing application artifacts, ensuring data persistence and version control. The new service is designed for high performance through asynchronous operations and includes comprehensive testing to guarantee reliability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an S3ArtifactService to support artifact storage on S3-compatible services. The implementation is robust, featuring asynchronous operations and atomic versioning with a retry mechanism. The accompanying unit tests are comprehensive and effectively mock the S3 interactions. I've identified a few areas for improvement, including a performance optimization for listing artifact versions, adding a test case for the version conflict retry logic, and some minor code cleanup in the project dependencies and tests. Overall, this is a solid contribution.
…equests and add version conflicts test for save_artifact
|
Can we git this PR or the other one referenced merged? Is there anything I can do to help speed this along? |
|
Hey @wmsnp — I've opened #115 which supersedes both this PR and my earlier #36. Your implementation here was a big influence — I adopted both the
Thanks for pioneering the async approach — it clearly belongs in the final version. 🙏 |
Thanks — really appreciate that. One thing I found later, though, is that not all S3-compatible implementations support |
|
@gemini-cli /review |
|
🤖 Hi @DeanChensj, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
Thanks for the contribution! The S3ArtifactService implementation is solid and follows the established patterns. I've left a few comments regarding:
- Performance: The
list_artifact_versionsmethod performs ahead_objectcall for every version, which could be a bottleneck. - Metadata Limits: S3 has a 2KB limit on metadata that we should be aware of when flattening custom metadata.
- Robustness: A finite default for retries in
save_artifactmight be safer than infinite.
Overall, great work!
| async def _client(self): | ||
| session = await self._session() | ||
| async with session.client(service_name="s3", **self.aws_configs) as s3: | ||
| yield s3 |
There was a problem hiding this comment.
S3 metadata has a total size limit of 2 KB (including keys and values). Since custom_metadata is flattened into JSON strings here, large metadata dictionaries might cause put_object to fail. It might be worth adding a check or a note about this limit.
| elif artifact.file_data: | ||
| raise NotImplementedError( | ||
| "Saving artifact with file_data is not supported yet in" | ||
| " S3ArtifactService." |
There was a problem hiding this comment.
With save_artifact_max_retries set to -1 (infinite), this loop could theoretically run forever if there's a consistent race condition or a logic error in version calculation. A high but finite default might be safer.
| metadata = head.get("Metadata", {}) | ||
|
|
||
| canonical_uri = f"s3://{self.bucket_name}/{obj['Key']}" | ||
|
|
There was a problem hiding this comment.
Calling head_object for every version in a loop will be very slow if an artifact has many versions (O(N) network calls). S3 doesn't return custom metadata in list_objects_v2, so this might be necessary if metadata is required, but we should consider if there's a way to cache or avoid this for large version sets.
Link to Issue or Description of Change
Description
Introduce
S3ArtifactServiceto provide an self-hosted Artifact storage solutionSolution
aioboto3Testing Plan