-
Notifications
You must be signed in to change notification settings - Fork 62
feat: Client side metrics support for mutateRows #1638
Changes from 14 commits
fe5b621
f74cee8
7b45ceb
bf9a52f
20167af
21f96e9
689efad
9d40bf7
2bb1a59
079a607
168e132
cdbdc20
ee12cea
d56ddf3
9b6ed02
bc70978
76e50e7
5022a7a
9aa6ce1
04e4e77
524eb47
285651f
42366c9
1514d0c
c1f273f
3c05e26
e4d2c28
33f449a
926c480
a82647b
4d74ab4
a0de09b
bbea829
9b13fba
f83727b
1594239
eb3fefe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -342,6 +342,12 @@ Please use the format 'prezzy' or '${instance.name}/tables/prezzy'.`); | |
| [], | ||
| ); | ||
|
|
||
| const metricsCollector = | ||
| this.bigtable._metricsConfigManager.createOperation( | ||
| MethodName.MUTATE_ROWS, | ||
| StreamingState.STREAMING, | ||
| this, | ||
| ); | ||
| /* | ||
| The following line of code sets the timeout if it was provided while | ||
| creating the client. This will be used to determine if the client should | ||
|
|
@@ -388,6 +394,7 @@ Please use the format 'prezzy' or '${instance.name}/tables/prezzy'.`); | |
| // Return if the error happened before a request was made | ||
| if (numRequestsMade === 0) { | ||
| callback(err); | ||
| metricsCollector.onOperationComplete(err ? err.code : 0); | ||
| return; | ||
| } | ||
|
|
||
|
|
@@ -399,9 +406,11 @@ Please use the format 'prezzy' or '${instance.name}/tables/prezzy'.`); | |
| options.gaxOptions?.retry?.backoffSettings || | ||
| DEFAULT_BACKOFF_SETTINGS; | ||
| const nextDelay = getNextDelay(numRequestsMade, backOffSettings); | ||
| metricsCollector.onAttemptComplete(err ? err.code : 0); | ||
| setTimeout(makeNextBatchRequest, nextDelay); | ||
| return; | ||
| } | ||
| metricsCollector.onOperationComplete(err ? err.code : 0); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is the operation complete here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because in this case the error is not retryable so we expect the operation to end. ie. the setTimeout call does the request again. |
||
|
|
||
| // If there's no more pending mutations, set the error | ||
| // to null | ||
|
|
@@ -431,7 +440,9 @@ Please use the format 'prezzy' or '${instance.name}/tables/prezzy'.`); | |
| callback(err); | ||
| }; | ||
|
|
||
| metricsCollector.onOperationStart(); | ||
| const makeNextBatchRequest = () => { | ||
| metricsCollector.onAttemptStart(); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these scattered onAttemptStart and onAttemptComplete lines make me a bit nervous. Especially since we have to repeat them across rpcs. Is there a way to encapsulate these better? Maybe wrap the stream object itself? And wrap the callbacks? Let me know if this makes sense to you. I can try to come up with some ideas if needed
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some ideas:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, those changes look a lot better. I think there are still improvements to be had, but this is a great start |
||
| const entryBatch = entries.filter((entry: Entry, index: number) => { | ||
| return pendingEntryIndices.has(index); | ||
| }); | ||
|
|
@@ -469,14 +480,16 @@ Please use the format 'prezzy' or '${instance.name}/tables/prezzy'.`); | |
| options.gaxOptions, | ||
| ); | ||
|
|
||
| this.bigtable | ||
| .request<google.bigtable.v2.MutateRowsResponse>({ | ||
| const requestStream = | ||
| this.bigtable.request<google.bigtable.v2.MutateRowsResponse>({ | ||
| client: 'BigtableClient', | ||
| method: 'mutateRows', | ||
| reqOpts, | ||
| gaxOpts: options.gaxOptions, | ||
| retryOpts, | ||
| }) | ||
| }); | ||
| metricsCollector.handleStatusAndMetadata(requestStream); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be nice if this returned the stream, so we could use it as more of a wrapper that could be applied at stream creation time: Then we could maybe rename
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. I don't see much of a downside for doing that.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Backlogged here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really do want to make sure we are using wrappers as much as we can here, because a bunch of one-off lines for metric collection can get ugly fast I'm not going to push too hard on this specific instance, since the one extra line doesn't add too much. But I really don't want brittle tests to stop us from merging code improvements, so keep that in mind so we can avoid this kind of problem in the future There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before we backlog: can you share the diff that was giving you trouble? That error message seems surprising, so I just want to make sure we're on the same page
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Try 359913994-mutate-rows-with-diff branch. f4c7162 is the commit that contains the change for mutateRows. You will be able to see the errors if you run this branch There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we should rename
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. This is wrapRequest now. |
||
| requestStream | ||
| .on('error', (err: ServiceError) => { | ||
| onBatchResponse(err); | ||
| }) | ||
|
|
@@ -499,6 +512,7 @@ Please use the format 'prezzy' or '${instance.name}/tables/prezzy'.`); | |
| (errorDetails as any).entry = originalEntry; | ||
| mutationErrorsByEntryIndex.set(originalEntriesIndex, errorDetails); | ||
| }); | ||
| metricsCollector.onResponse(); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. couldn't this be handled by
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Note that this change has to be applied to readRows as well so you'll see the change there too. |
||
| }) | ||
| .on('end', onBatchResponse); | ||
| numRequestsMade++; | ||
|
|
||

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should know that err exists at this point, right? It seems like error code 0 should never happen
If we need a null check, I'd say we should put it beside
isRetryable(err). But I assume that's already part of isRetryableThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. In fact if the error doesn't exist then most of the time this if block will be entered because
isRetryablewill be true. See the comment:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I know this is out of scope of this PR, but some comments here would be very useful. A docstring for isRetryable, and/or a comment at the start of this block describing what makes it retryable would help a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more important question though is: Do we record the attempt as successful if some mutations failed? or do we pass through the error code of a failed mutation?
In Python, it looks like I was attaching the status code of the exception at the top of the error list. I can't remember if that came from the product team or not though. Have you discussed this with Mattie? Or looked at Java? We should verify this before moving forward
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gemini says we record OK for PartialFailureError results in java. I'll verify that in the code next and ask Bigtable, but I am not too familiar with that codebase so may take time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we send back the OK code here so for the time being I'll change this to pass in code 0.
@mutianf In case of a PartialFailureError, what error code do we want to record for client side metrics on a completed attempt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And just to provide more background, I verified that stream ending (no error) corresponds to the case where we would expect a PartialFailureError this test. So in this case, if we are inside this if block and the err is null then we can conclude that we are retrying because some of the indicies are pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with Mattie, she said Java records a status of success when a mutation fails, but the rpc succeeds. So we can do the same here