Skip to content

fix: exceeds payload size limit as invocation error#341

Merged
yaythomas merged 3 commits intoaws:mainfrom
andreas-baakind-adsk:fix/checkpoint-payload-size-handling
Apr 11, 2026
Merged

fix: exceeds payload size limit as invocation error#341
yaythomas merged 3 commits intoaws:mainfrom
andreas-baakind-adsk:fix/checkpoint-payload-size-handling

Conversation

@andreas-baakind-adsk
Copy link
Copy Markdown
Contributor

@andreas-baakind-adsk andreas-baakind-adsk commented Apr 9, 2026

Issue #, if available:
#342

Description of changes:
Throw CheckpointError with error_category as INVOCATION for 4xx InvalidParameterValueException errors related to payload size limit exceeded, as these errors are not retryable.

The payload size exceeded constraint is deterministic: the exact same Lambda invocation, producing the exact same output, will fail with the exact same error on every retry. No amount of retrying resolves it — the data is too large by definition. AWS's own documentation classifies the equivalent Step Functions error (States.DataLimitExceeded) as terminal and permanent.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Throw CheckpointError with error_category as INVOCATION for 4xx
InvalidParameterValueException errors related to payload size limit
exceeded, as these errors are not retryable.
Copy link
Copy Markdown
Contributor

@yaythomas yaythomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you so much for your contribution @andreas-baakind-adsk!

I've got a suggestion for streamlining this. It will result in behaviour change as follows:

4xx non-429 (other than Invalid Token) previously would have been EXECUTION errs that retried. They will now be EXECUTION errs that do NOT retry.

So any 4xx non-429 checkpoint error that previously caused Lambda to retry will now return FAILED immediately.

That includes:

  • AccessDeniedException (403)
  • ResourceNotFoundException (404)
  • Any other InvalidParameterValueException with a different message
  • Payload-too-large (the bug being fixed)

For 403 and 404 retrying was never going to help anyway, so permanent fail is more correct.

The only case where the old retry behaviour was possibly useful was payload-too-large with a non-deterministic callable, but in this case the expectation in the reference TS implementation is that the step is responsible for dealing with the result size before it gets to checkpointing.

CheckpointErrorCategory is not in the public API (init.py), so no public API break.

However, tests asserting on error_category values and is_retriable() will need updating:

  1. exceptions_test.py
  2. execution_test.py.

@yaythomas yaythomas moved this from Backlog to In review in aws-durable-execution Apr 10, 2026
@yaythomas yaythomas self-assigned this Apr 10, 2026
@yaythomas yaythomas added bug Something isn't working and removed bug Something isn't working labels Apr 10, 2026
- Fix inverted is_retriable(): INVOCATION errors are retriable,
  EXECUTION errors are permanent failures (logic was backwards)
- Classify payload size exceeded as EXECUTION (permanent), not
  INVOCATION — exceeding a size limit is not a transient failure
- Simplify CheckpointError.from_exception() conditional logic
- Update error-handling docs to reflect correct retry behavior
- Update tests to match corrected classification semantics
@yaythomas
Copy link
Copy Markdown
Contributor

thanks for the update @andreas-baakind-adsk, looking good!

just some minor formatting/linting...

tip: see https://github.com/aws/aws-durable-execution-sdk-python/blob/main/CONTRIBUTING.md#developer-workflow

you can run the CI lint checks locally so you know it'll work by the time it gets to the CI :-)

ops/ci-checks.sh

@andreas-baakind-adsk
Copy link
Copy Markdown
Contributor Author

Thank you for the great feeedback, @yaythomas. I will for sure run the ci-checks locally next time to make sure there are no surprises after pushing my changes :)

@yaythomas yaythomas merged commit 4ecb123 into aws:main Apr 11, 2026
6 of 10 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in aws-durable-execution Apr 11, 2026
@andreas-baakind-adsk andreas-baakind-adsk deleted the fix/checkpoint-payload-size-handling branch April 13, 2026 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants