Aws sdk migration#1845
Conversation
616a49d to
2a642fa
Compare
Yavor16
left a comment
There was a problem hiding this comment.
The azure sdk files are also shown as deleted in this PR. Maybe if you rebase, it will remove from the PR
| import io.swagger.annotations.Api; | ||
| import io.swagger.annotations.ApiOperation; | ||
| import io.swagger.annotations.ApiParam; | ||
| import io.swagger.annotations.ApiResponse; | ||
| import io.swagger.annotations.ApiResponses; | ||
| import io.swagger.annotations.Authorization; |
There was a problem hiding this comment.
Should this file be part of the PR because I don't see changes
There was a problem hiding this comment.
It was refactor after I formatted the code.
| void testConnection() { | ||
| Bucket mockBucket = mock(Bucket.class); | ||
| when(mockedStorage.get(CONTAINER)).thenReturn(mockBucket); | ||
| assertDoesNotThrow(mockedGcpFileStorage::testConnection); | ||
| } | ||
|
|
||
| @Test | ||
| void testTestConnectionWhenBucketExists() { | ||
| Bucket mockBucket = mock(Bucket.class); | ||
| when(mockedStorage.get(CONTAINER)).thenReturn(mockBucket); | ||
|
|
||
| assertDoesNotThrow(mockedGcpFileStorage::testConnection); | ||
| verify(mockedStorage).get(CONTAINER); | ||
| } |
| @Override | ||
| public void testConnection() { | ||
| storage.get(bucketName, "test"); | ||
| var bucket = storage.get(bucketName); |
| // DEBUG log messages: | ||
| public static final String STORED_FILE_0 = "Stored file: \"{0}\""; | ||
| public static final String STORED_FILE_0_WITH_SIZE_1 = "Stored file \"{0}\" with size {1}"; | ||
| public static final String STORED_FILE_0_WITH_SIZE_1_LOG = "Stored file \"{}\" with size {}"; |
There was a problem hiding this comment.
add numbers to the string {0}
| public static final String RETRIEVED_SECRET_TOKEN_WITH_ID_0_FOR_PROCESS_WITH_ID_1 = "Retrieved secret token with id \"{0}\" for process with id \"{1}\""; | ||
| public static final String DELETED_0_SECRET_TOKENS_FOR_PROCESS_WITH_ID_1 = "Deleted \"{0}\" secret tokens for process with id \"{1}\""; | ||
| public static final String DELETED_0_SECRET_TOKENS_WITH_EXPIRATION_DATE_1 = "Deleted secret tokens \"{0}\" with an expiration date \"{1}\""; | ||
| public static final String FAILED_TO_DELETE_FILE_0_IN_OBJECT_STORE_REASON_1 = "Failed to delete file \"{}\" in object store. Reason: {}"; |
There was a problem hiding this comment.
add numbers to the string {0}
| requires software.amazon.awssdk.services.s3; | ||
| requires software.amazon.awssdk.core; | ||
| requires software.amazon.awssdk.awscore; | ||
| requires software.amazon.awssdk.regions; | ||
| requires software.amazon.awssdk.utils; | ||
| requires software.amazon.awssdk.auth; | ||
| requires software.amazon.awssdk.http; | ||
| requires software.amazon.awssdk.http.urlconnection; | ||
| requires software.amazon.awssdk.retries; | ||
| requires software.amazon.awssdk.retries.api; |
There was a problem hiding this comment.
hm, i think some of these are not required. can you check?
| LOGGER.warn(Messages.JOB_WITH_ID_WAS_NOT_UPDATED_WITHIN_SECONDS_ON_START, existingJob.getId(), UPDATE_JOB_TIMEOUT); | ||
| LOGGER.warn(Messages.STALE_JOB_DETAILS, existingJob.getId(), existingJob.getState(), existingJob.getUpdatedAt(), | ||
| existingJob.getAddedAt(), existingJob.getStartedAt(), existingJob.getBytesRead(), existingJob.getUrl(), | ||
| existingJob.getSpaceGuid(), existingJob.getNamespace(), existingJob.getUser(), | ||
| existingJob.getInstanceIndex()); |
There was a problem hiding this comment.
why are these logs part of the PR?
| import java.text.MessageFormat; | ||
| import java.time.Duration; | ||
|
|
||
| public class FileUploadResilientOperationExecutor { |
| public static final String CANNOT_PARSE_CONTAINER_URI_OF_OBJECT_STORE = "Cannot parse container_uri of object store"; | ||
| public static final String REQUEST_0_1_FAILED_WITH_2 = "Request \"{0} {1}\" failed with \"{2}\""; | ||
| public static final String ERROR_OCCURRED_WHILE_DELETING_JOB_ENTRY = "Error occurred while deleting job entry"; | ||
| public static final String ERROR_OCCURRED_WHILE_DELETING_JOB_ENTRY = "Error occurred while deleting job entry with id: {}"; |
There was a problem hiding this comment.
add number to be consistent and to the other messages also
| <immutables.version>2.12.1</immutables.version> | ||
| <micrometer.version>1.16.4</micrometer.version> | ||
| <aliyun-sdk-oss.version>3.18.5</aliyun-sdk-oss.version> | ||
| <aws.sdk.version>2.44.12</aws.sdk.version> |
There was a problem hiding this comment.
Before merge you can bump the version because they release new version very frequently
LMCROSSITXSADEPLOY-3315
Previously testConnection probed for a non-existent test object on the underlying store, which produced misleading results when the bucket or container itself was missing. The check now explicitly verifies that the target bucket/container exists and throws IllegalStateException with a clear OBJECT_STORE_BUCKET_NOT_FOUND message when it does not, making configuration errors fail fast with an actionable diagnostic. LMCROSSITXSADEPLOY-3315
2a642fa to
c105981
Compare
4052ef2 to
a5703b3
Compare
a5703b3 to
4ea43dc
Compare
|
|
| Stage | Result | Notes |
|---|---|---|
Deploy (deploy-service-pusher-oq) |
PASS | Deploy completed cleanly; app reached 3/3 healthy and OQ began |
Tests (qa-tester) |
FAIL | 113 of 118 initially-failed jobs failed both initial run and one-shot retry; 5 flaky jobs cleared on retry |
Log analysis (log-analyzer) |
FAIL | OQ failed due to a PostgreSQL/RDS database outage at 15:05–15:09 UTC (RDS port 8772 refused connections), not caused by either PR. All 113 job failures trace to the deploy-service crash-looping after HikariCP exhausted retries. Zero AWS SDK / S3 errors in the pre-outage window. |
Verdict rationale
OQ posted a hard failure (113/118 jobs failing both initial and retry runs), but the log-analyzer evidence is unambiguous: the deploy-service was healthy for the first 37 minutes of the run (14:27–15:05 UTC) with zero AWS SDK / S3 errors, then RDS at postgres-….eu-central-1.rds.amazonaws.com:8772 began refusing connections at 15:05:07 UTC, the application health endpoint flipped to 503 at 15:08:35 UTC, and CF killed all 3 instances at 15:09:44 UTC. Every restart attempt afterwards failed at Spring context init because HikariCP could not acquire a DB connection. Any OQ test still in flight or scheduled after 15:09 UTC therefore failed for infrastructure reasons unrelated to the code under review.
The triage classifier produced 0 regression suspects — LIKELY=0, UNLIKELY=0, INCONCLUSIVE=0 — and no version skew was observed across multiapps / multiapps-controller / xsa-multiapps-controller. The breadth of failures (blue-green, service lifecycle, GACD, CTS, application hooks, rollback) is exactly what a post-15:09 crash-loop produces: every scenario hits the same dead controller. The verdict label remains FAIL because tests did not pass, but the recommendation is to re-run OQ on a healthy environment rather than block the PRs on this signal.
Failed jobs / scenarios
- 113 jobs failed both initial and retry — broad failure across blue-green, service lifecycle, GACD, CTS, application hooks, rollback, and related scenarios. All map to the 15:05–15:09 UTC RDS outage and subsequent crash-loop, not to scenario-specific regressions.
- 5 jobs cleared on retry:
fail-on-service-update,incremental-blue-green-deploy,gacd-with-certificates-deployment,health-check-invocation,sequential-resource-processing.
PR change surface (union across all PRs in this run)
- Files changed: 30 (
+1798/-461) - Modules touched:
multiapps-controller-api,multiapps-controller-persistence,multiapps-controller-process(test only),multiapps-controller-web, rootpom.xml,xsa-multiapps-controllerweb manifests (manifest.yml,manifest.stress.yml) - Suspect overlap: none — log-analyzer produced 0 regression suspects, so there is nothing in either PR's diff that the log signal points to.
Non-blocking note (PR #1845 only)
Log-analyzer flagged a latent risk in the new AwsS3ObjectStoreFileStorage.testConnection() path: it calls s3Client.listObjectsV2() with a 10-minute socket timeout (via UrlConnectionHttpClient), which is longer than the SINGLE_TASK_TIMEOUT_IN_SECONDS = 70s used by ApplicationHealthCalculator. If S3 becomes transiently slow (>70s response), the health check could flip to false and return 503 independently of the database. This did not occur in this OQ run — the 503 was DB-driven — but the timeout asymmetry is a code-quality concern worth addressing before merge (the previous JClouds blobStore.blobExists() call did not have an explicit 10-minute socket timeout). Not applicable to PR #525, which only adjusts log-level entries.
Log analysis summary
- Expected (test-driven): 192
- Infrastructure / transient: 40670
- Potentially regression-related: 0
- Likely caused by PR(s): 0
- Unlikely caused by PR(s): 0
- Inconclusive: 0
- Version skew: none
Full log-analyzer findings
Log Analyzer — oq verdict: FAIL
Test outcome (from orchestrator): FAILED (113 of 118 OQ jobs failed both initial run and retry)
CF target: deploy-service / sap_btp_cf_mta_deploy+technical1 (app: deploy-service, deployed sha: 4ea43dc)
Window: 2026-06-05T14:27:36Z → 2026-06-05T16:11:49Z
Index queried: logs-*
Total WARN/ERROR in window: 40,862 | Expected (catalog): 192 | Indeterminate: 40,670 | Truncated: no
Window-anchor note:
window_start(14:27:36Z) vsdeploy_start(14:25:26Z) differ by ~2 min — anchor is consistent withDEPLOY_END, notDEPLOY_START. Cross-validated and accepted.
OQ catalog: up-to-date (exit 0); no rebuild needed.
Verdict rationale
The systematic OQ failure of 113 jobs is caused by a PostgreSQL/RDS database outage, not by the AWS SDK migration PR. The database at postgres-a44daf96-2b75-4c00-9b07-fae7c6a6c1c2.cxxzc36no8yr.eu-central-1.rds.amazonaws.com:8772 began refusing connections at 15:05:07 UTC, roughly 37 minutes into the OQ run. By 15:08:35 UTC the application health check was returning 503 (both object-store health and DB health caches showed false); CF's liveness probe fired at 15:09:44 UTC and killed all 3 running instances. Every restart attempt since then has failed at Spring context initialization because HikariCP cannot acquire a connection during startup — this continues through the end of the OQ window and beyond.
The AWS SDK migration (multiapps-controller PR #1845, xsa-multiapps-controller PR #525) was running cleanly in the 37 minutes before the DB outage: zero S3 or AWS SDK errors appear in the pre-outage window (14:27–15:05 UTC), and the expected OQ scenario errors (auth failures, staging failures, content validation errors, circular-dependencies, etc.) all look normal in volume and kind. No regression suspect entries were surfaced by the triage classifier.
Bucket C is empty. The FAIL verdict is driven entirely by the infrastructure DB outage. The PR changes are not the root cause.
Local git state at analysis time
| Sub-project | Branch | Uncommitted (files) | On feature branch |
|---|---|---|---|
multiapps |
master |
0 | No |
multiapps-controller |
aws-sdk-migration |
1 (pom.xml) |
Yes |
xsa-multiapps-controller |
aws-migration |
1 (pom.xml) |
Yes |
XSOQTests |
fix-hey |
0 | Yes |
cf-mta-examples |
feature/LMCROSSITXSADEPLOY-3316-v2 |
1 (untracked build dir) | Yes |
multiapps-cli-plugin |
master |
0 | No |
product-cf-hcp |
staging-alerting |
5 | Yes |
ds-load-tests-dashboard |
update-readme |
0 | Yes (minor) |
Uncommitted changes of note:
multiapps-controller/pom.xml(modified, unstaged): bumps<multiapps.version>from2.48.0→2.49.0-SNAPSHOTxsa-multiapps-controller/pom.xml(modified, unstaged): bumps<multiapps.version>from2.48.0→2.49.0-SNAPSHOT- These two changes are coordinated and are part of the version-alignment update for the PR deployment.
xsa-multiapps-controlleralso has an untrackednginx/scripts/directory (not load-bearing for this analysis).
Deploy chain version pinning
| Source of truth | Truth value | Declared in downstream | Declared value | Status |
|---|---|---|---|---|
multiapps/pom.xml <version> |
2.49.0-SNAPSHOT |
multiapps-controller <multiapps.version> |
2.49.0-SNAPSHOT |
OK |
multiapps/pom.xml <version> |
2.49.0-SNAPSHOT |
xsa-multiapps-controller <multiapps.version> |
2.49.0-SNAPSHOT |
OK |
multiapps-controller/pom.xml <version> |
2.48.0-SNAPSHOT |
xsa-multiapps-controller <multiapps-controller.version> |
2.48.0-SNAPSHOT |
OK |
No version skew detected. The uncommitted pom.xml changes on both feature branches track multiapps 2.49.0-SNAPSHOT correctly.
Categorization
| Category | Count |
|---|---|
| Expected (test-driven — catalog matches) | 192 |
| Infrastructure / transient (no regression marker) | 40,670 |
| Potentially regression-related (Bucket C suspects) | 0 |
Triage classifier result: unexpected=0, indeterminate_with_regression_marker=0, suspects_written=0.
Per-suspect attribution
No Bucket C suspects. The triage produced zero entries requiring regression attribution.
Strong attributions (LIKELY_CAUSED_BY_PR)
None. No suspect entries were produced by the triage step.
Root cause analysis — database outage (infrastructure event)
Primary signal: HikariPool-1 - Pool is empty, failed to create/setup connection with PSQLException: Connection to postgres-a44daf96-2b75-4c00-9b07-fae7c6a6c1c2.cxxzc36no8yr.eu-central-1.rds.amazonaws.com:8772 refused.
Timeline of the outage:
| UTC Time | Event |
|---|---|
| 14:27:36 | App deployed (DEPLOY_END), health checks passed, OQ started |
| 14:27–15:05 | App running normally; 1,257 pre-outage WARN/ERROR entries — all expected OQ scenario errors, zero AWS SDK errors |
| 15:05:07 | First DB signal: HikariPool - Failed to validate connection ... (This connection has been closed.) — stale connections dropping |
| 15:06:08 | DB outage confirmed: PSQLException: Connection to postgres-....rds.amazonaws.com:8772 refused across all 3 instances |
| 15:06:18 | Retrying operation that failed with message: Timeout while checking database health (health check retries begin) |
| 15:08:35 | Health check emits: Object store file storage health: "false", Database health: "false" → next /public/application-health call returns 503 |
| 15:09:44 | CF liveness check detects 503 → all 3 instances killed (Instance became unhealthy) |
| 15:10:08 | New instances fail startup: UnsatisfiedDependencyException: Error creating bean 'defaultEntityManagerFactory' ... Failed to initialize pool: The connection attempt failed. — 24 events total (8 restart waves × 3 instances) |
| 15:10–16:11 | All restart waves fail the same way; app remains 0/3 (crashed) through the end of the window |
Why ALL 113 OQ jobs failed: Once the DB outage knocked all 3 instances into crash-loop at 15:09 UTC, any OQ test that was still in-flight received connection errors or no-response, and any test that had not started yet could not communicate with the deploy-service. The retry wave of tests encountered the same crashed state. The 5 tests that passed are those that completed before 15:09 UTC.
Why the AWS SDK migration is NOT the root cause:
- Zero AWS SDK / S3 errors appear in the 37-minute healthy window (14:27–15:05 UTC).
- The
objectStore health = falseentry at 15:08:35 appeared after the DB connection failures began at 15:06 and is likely secondary — a transient S3 slowness in the same AWSeu-central-1AZ as the failing RDS instance, or the health check thread pool being starved by blocked DB check threads. - The crash chain is unambiguously driven by
PSQLException(RDS port refused), not by any S3 API call. - The AWS SDK migration touches only object store code (
AwsS3ObjectStoreFileStorage,ObjectStoreFileStorageFactoryBean,ObjectStoreServiceInfoCreator) and the startup log confirmsAwsS3ObjectStoreFileStoragewas set to DEBUG level — the new class loaded and initialized correctly.
Secondary observation: The new AwsS3ObjectStoreFileStorage.testConnection() calls s3Client.listObjectsV2() with a 10-minute socket timeout (via UrlConnectionHttpClient), which is longer than the SINGLE_TASK_TIMEOUT_IN_SECONDS = 70s used in ApplicationHealthCalculator. If S3 becomes transiently slow (>70s response), the health check would flip to false and return 503 independently of the DB. This is a latent risk introduced by the PR (the old JClouds blobStore.blobExists() did not have an explicit 10-minute socket timeout). However, this scenario did NOT occur during this OQ run — the 503 was DB-driven. This is flagged here as a code-quality concern for the PR author to assess, not as a confirmed regression.
Failed scenarios provided by orchestrator
All 113 failed jobs are attributable to the database outage from 15:05 UTC onward, not to any specific scenario-level regression in the PR code.
The 5 jobs that likely succeeded (not in the failed list) completed before the DB outage at 15:09 UTC.
Expected errors (sample — top signatures from catalog triage)
| Signature | Count | Scenario |
|---|---|---|
Exception caught (ContentException / SLException from deliberate scenario errors) |
166 | generic-content-deploy |
are not supported in the specified scope |
11 | unsupported-parameter-message |
Execution of step .* has timed out |
6 | timeout-scenario |
broker returned (broker failure) |
4 | various |
Content deployment for module |
2 | generic-content-deploy |
Failed to delete service |
2 | service-deletion-failed-scenario |
App staging failed |
1 | app-staging-failure |
Infra/transient errors (top no-marker indeterminate signatures)
| Signature | Count | Classification |
|---|---|---|
| Failed to write message to the audit log | 34,849 | Expected — audit log service not bound in OQ |
| Ignoring parameter "namespace" | 2,910 | Expected — namespace not used in most OQ scenarios |
| Notification for Unknown NOT sent to ANS | 1,169 | Expected — ANS not configured in OQ |
| Skipping deletion of services | 225 | Expected — scenario-driven |
| Error while closing command context | 176 | Expected — Flowable error propagation from deliberate scenario failures |
| Skipping notification for operation | 101 | Expected — missing mtaId/orgId in some scenarios |
| Error occurred while executing REST API call | 24 | Expected — OQ scenarios deliberately send bad payloads (wrong Content-Type, malformed JSON, not-found operation IDs) |
| Cannot send audit log / Missing audit log credentials | 48 | Expected — audit log service not bound |
| Exception sending context initialized event to ContextLoader | 24 | DB outage — startup failure; NOT a PR regression |
| HikariPool connection failures (PSQLException) | 46 | DB outage — RDS infrastructure event |
| Timeout while checking database health | 6 | DB outage — health check timeout |
| Object store/database health = false | 3+ | DB outage — health check 503 trigger |
Generated by log-analyzer. Mode: oq. Generated 2026-06-05T19:00:00Z. Consumed by pr-result-publisher.
Posted by pr-result-publisher. Mode: oq. Posted on 2 PR(s) in this run. Generated 2026-06-05T19:30:00Z.



No description provided.