Conversation
Throw CheckpointError with error_category as INVOCATION for 4xx InvalidParameterValueException errors related to payload size limit exceeded, as these errors are not retryable.
yaythomas
left a comment
There was a problem hiding this comment.
thank you so much for your contribution @andreas-baakind-adsk!
I've got a suggestion for streamlining this. It will result in behaviour change as follows:
4xx non-429 (other than Invalid Token) previously would have been EXECUTION errs that retried. They will now be EXECUTION errs that do NOT retry.
So any 4xx non-429 checkpoint error that previously caused Lambda to retry will now return FAILED immediately.
That includes:
- AccessDeniedException (403)
- ResourceNotFoundException (404)
- Any other InvalidParameterValueException with a different message
- Payload-too-large (the bug being fixed)
For 403 and 404 retrying was never going to help anyway, so permanent fail is more correct.
The only case where the old retry behaviour was possibly useful was payload-too-large with a non-deterministic callable, but in this case the expectation in the reference TS implementation is that the step is responsible for dealing with the result size before it gets to checkpointing.
CheckpointErrorCategory is not in the public API (init.py), so no public API break.
However, tests asserting on error_category values and is_retriable() will need updating:
- exceptions_test.py
- execution_test.py.
- Fix inverted is_retriable(): INVOCATION errors are retriable, EXECUTION errors are permanent failures (logic was backwards) - Classify payload size exceeded as EXECUTION (permanent), not INVOCATION — exceeding a size limit is not a transient failure - Simplify CheckpointError.from_exception() conditional logic - Update error-handling docs to reflect correct retry behavior - Update tests to match corrected classification semantics
|
thanks for the update @andreas-baakind-adsk, looking good! just some minor formatting/linting... tip: see https://github.com/aws/aws-durable-execution-sdk-python/blob/main/CONTRIBUTING.md#developer-workflow you can run the CI lint checks locally so you know it'll work by the time it gets to the CI :-) |
|
Thank you for the great feeedback, @yaythomas. I will for sure run the ci-checks locally next time to make sure there are no surprises after pushing my changes :) |
Issue #, if available:
#342
Description of changes:
Throw CheckpointError with error_category as INVOCATION for 4xx InvalidParameterValueException errors related to payload size limit exceeded, as these errors are not retryable.
The payload size exceeded constraint is deterministic: the exact same Lambda invocation, producing the exact same output, will fail with the exact same error on every retry. No amount of retrying resolves it — the data is too large by definition. AWS's own documentation classifies the equivalent Step Functions error (States.DataLimitExceeded) as terminal and permanent.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.