add tool to check for out-of-sync data between DB and production-definitions blob store#84
Open
add tool to check for out-of-sync data between DB and production-definitions blob store#84
Conversation
|
Nice this looks great! Thanks a ton for cleaning this script up |
qtomlinson
approved these changes
Jul 5, 2024
Collaborator
qtomlinson
left a comment
There was a problem hiding this comment.
Thanks for putting this together!
| # * MONGO_CONNECTION_STRING: The connection string to the MongoDB database | ||
| # * START_MONTH: The first month to include in the query | ||
| # * END_MONTH: The last month to include in the query | ||
| # * OUTPUT_FILE: The file to write the output to |
Collaborator
There was a problem hiding this comment.
BASE_AZURE_BLOB_URL seems to be mandatory as well
| print(f"Collection '{COLLECTION_NAME}' not found.") | ||
| else: | ||
| print(f"Using collection: '{COLLECTION_NAME}'.") | ||
|
|
Collaborator
There was a problem hiding this comment.
nit: logging blob container name as well?
This performs a check one month at a time hardcoded for all months in 2024. Output file is hardcoded to "2024-invalid_data.json”.
* allows for a check of a single week * continues to support processing a month at a time * expands support for controlling function through .env file * provides example .env file
482e8a2 to
4165126
Compare
Batch processing: * updates just the declared license in the DB documents using `collection.bulk_write()` * updates denitions using service API `POST /definitions?force=true` _NOTE: Updating the DB makes the fix of the declared license immediately available. When the `POST /definitions` request completes, the full DB document will be updated to be in sync with the blob definition._ Additional changes: * moves global variable definitions based on .env to the initialize() function * adds DRYRUN flag to check what would run and how many records would be evaluated * add estimated time to complete * adds script and function level documentation * includes timestamps to make it easier to estimate how long it will take to complete a run * generate filename based on date range and offset to avoid overwriting output files _NOTE: Azure only supports fetching one blob at a time. Not able to optimize that part of the code. _ _NOTE: Batch size of 500 was selected because that is the max number of coordinates supported in calls to service API `POST /definitions`._
8259871 to
392f8f6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Co-authored-by: @ajhenry
Description
The new tool, analyze_data_synchronization, checks for out-of-sync data between the database and the production-definitions blob store. The tool can be for multiple months, one month at a time, or for a custom date range. The tool outputs a JSON file with summary stats and the invalid data. The tool is controlled through a .env file, which can be customized to specify the start and end dates, the maximum number of documents to process, and the output file name.
See README for examples.
Minor fix
The README includes a fix to rename
production-snapshotstochanges-notifications. The switch tochanges-notificationshas been in production use since January 2024.