Aws sdk migration by IvanBorislavovDimitrov · Pull Request #1845 · cloudfoundry/multiapps-controller

IvanBorislavovDimitrov · 2026-05-27T09:44:46Z

No description provided.

Yavor16

The azure sdk files are also shown as deleted in this PR. Maybe if you rebase, it will remove from the PR

Yavor16 · 2026-06-02T13:50:14Z

+import io.swagger.annotations.Api;
+import io.swagger.annotations.ApiOperation;
+import io.swagger.annotations.ApiParam;
+import io.swagger.annotations.ApiResponse;
+import io.swagger.annotations.ApiResponses;
+import io.swagger.annotations.Authorization;


Should this file be part of the PR because I don't see changes

It was refactor after I formatted the code.

Yavor16 · 2026-06-03T05:14:37Z

+    void testConnection() {
+        Bucket mockBucket = mock(Bucket.class);
+        when(mockedStorage.get(CONTAINER)).thenReturn(mockBucket);
+        assertDoesNotThrow(mockedGcpFileStorage::testConnection);
+    }
+
+    @Test
+    void testTestConnectionWhenBucketExists() {
+        Bucket mockBucket = mock(Bucket.class);
+        when(mockedStorage.get(CONTAINER)).thenReturn(mockBucket);
+
+        assertDoesNotThrow(mockedGcpFileStorage::testConnection);
+        verify(mockedStorage).get(CONTAINER);
+    }


aren't these two the same

Yavor16 · 2026-06-03T10:06:08Z

    @Override
    public void testConnection() {
-        storage.get(bucketName, "test");
+        var bucket = storage.get(bucketName);


why do you use var

Yavor16 · 2026-06-03T10:08:19Z

    // DEBUG log messages:
    public static final String STORED_FILE_0 = "Stored file: \"{0}\"";
    public static final String STORED_FILE_0_WITH_SIZE_1 = "Stored file \"{0}\" with size {1}";
+    public static final String STORED_FILE_0_WITH_SIZE_1_LOG = "Stored file \"{}\" with size {}";


add numbers to the string {0}

Yavor16 · 2026-06-03T10:08:25Z

    public static final String RETRIEVED_SECRET_TOKEN_WITH_ID_0_FOR_PROCESS_WITH_ID_1 = "Retrieved secret token with id \"{0}\" for process with id \"{1}\"";
    public static final String DELETED_0_SECRET_TOKENS_FOR_PROCESS_WITH_ID_1 = "Deleted \"{0}\" secret tokens for process with id \"{1}\"";
    public static final String DELETED_0_SECRET_TOKENS_WITH_EXPIRATION_DATE_1 = "Deleted secret tokens \"{0}\" with an expiration date \"{1}\"";
+    public static final String FAILED_TO_DELETE_FILE_0_IN_OBJECT_STORE_REASON_1 = "Failed to delete file \"{}\" in object store. Reason: {}";


add numbers to the string {0}

Yavor16 · 2026-06-03T10:10:44Z

+    requires software.amazon.awssdk.services.s3;
+    requires software.amazon.awssdk.core;
+    requires software.amazon.awssdk.awscore;
+    requires software.amazon.awssdk.regions;
+    requires software.amazon.awssdk.utils;
+    requires software.amazon.awssdk.auth;
+    requires software.amazon.awssdk.http;
+    requires software.amazon.awssdk.http.urlconnection;
+    requires software.amazon.awssdk.retries;
+    requires software.amazon.awssdk.retries.api;


hm, i think some of these are not required. can you check?

Yavor16 · 2026-06-03T10:17:59Z

+            LOGGER.warn(Messages.JOB_WITH_ID_WAS_NOT_UPDATED_WITHIN_SECONDS_ON_START, existingJob.getId(), UPDATE_JOB_TIMEOUT);
+            LOGGER.warn(Messages.STALE_JOB_DETAILS, existingJob.getId(), existingJob.getState(), existingJob.getUpdatedAt(),
+                        existingJob.getAddedAt(), existingJob.getStartedAt(), existingJob.getBytesRead(), existingJob.getUrl(),
+                        existingJob.getSpaceGuid(), existingJob.getNamespace(), existingJob.getUser(),
+                        existingJob.getInstanceIndex());


why are these logs part of the PR?

Yavor16 · 2026-06-03T10:26:18Z

+import java.text.MessageFormat;
+import java.time.Duration;
+
+public class FileUploadResilientOperationExecutor {


Why is this part of the PR

Yavor16 · 2026-06-03T10:27:05Z

    public static final String CANNOT_PARSE_CONTAINER_URI_OF_OBJECT_STORE = "Cannot parse container_uri of object store";
    public static final String REQUEST_0_1_FAILED_WITH_2 = "Request \"{0} {1}\" failed with \"{2}\"";
-    public static final String ERROR_OCCURRED_WHILE_DELETING_JOB_ENTRY = "Error occurred while deleting job entry";
+    public static final String ERROR_OCCURRED_WHILE_DELETING_JOB_ENTRY = "Error occurred while deleting job entry with id: {}";


add number to be consistent and to the other messages also

Yavor16 · 2026-06-03T10:31:32Z

        <immutables.version>2.12.1</immutables.version>
        <micrometer.version>1.16.4</micrometer.version>
        <aliyun-sdk-oss.version>3.18.5</aliyun-sdk-oss.version>
+        <aws.sdk.version>2.44.12</aws.sdk.version>


Before merge you can bump the version because they release new version very frequently

LMCROSSITXSADEPLOY-3315

Previously testConnection probed for a non-existent test object on the underlying store, which produced misleading results when the bucket or container itself was missing. The check now explicitly verifies that the target bucket/container exists and throws IllegalStateException with a clear OBJECT_STORE_BUCKET_NOT_FOUND message when it does not, making configuration errors fail fast with an actionable diagnostic. LMCROSSITXSADEPLOY-3315

sonarqubecloud · 2026-06-05T12:45:39Z

Quality Gate passed

Issues
10 New issues
0 Accepted issues

Measures
0 Security Hotspots
88.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

IvanBorislavovDimitrov · 2026-06-05T16:30:53Z

`oq` test verdict: FAIL

Recommendation: do not merge — OQ failed due to an external RDS outage mid-run, not PR regression. Re-run OQ once RDS is healthy before merging.

PRs in this run:

Aws sdk migration #1845 @ 4ea43dce45bf348209f4eb15163e310792db52a4 (multiapps-controller)
https://github.tools.sap/mta-deploy-service/xsa-multiapps-controller/pull/525 @ 77285aa20297a1d776c848f988cf5c8635c7ef63 (xsa-multiapps-controller)

CF target: deploy-service / sap_btp_cf_mta_deploy+technical1 (app: deploy-service)
Window: 2026-06-05T14:27:36Z → 2026-06-05T16:11:49Z
Pipeline: http://gcpclm950064:8080/teams/main/pipelines/qa-tester

Pipeline outcomes

Stage	Result	Notes
Deploy (`deploy-service-pusher-oq`)	PASS	Deploy completed cleanly; app reached 3/3 healthy and OQ began
Tests (`qa-tester`)	FAIL	113 of 118 initially-failed jobs failed both initial run and one-shot retry; 5 flaky jobs cleared on retry
Log analysis (`log-analyzer`)	FAIL	OQ failed due to a PostgreSQL/RDS database outage at 15:05–15:09 UTC (RDS port 8772 refused connections), not caused by either PR. All 113 job failures trace to the deploy-service crash-looping after HikariCP exhausted retries. Zero AWS SDK / S3 errors in the pre-outage window.

Verdict rationale

OQ posted a hard failure (113/118 jobs failing both initial and retry runs), but the log-analyzer evidence is unambiguous: the deploy-service was healthy for the first 37 minutes of the run (14:27–15:05 UTC) with zero AWS SDK / S3 errors, then RDS at postgres-….eu-central-1.rds.amazonaws.com:8772 began refusing connections at 15:05:07 UTC, the application health endpoint flipped to 503 at 15:08:35 UTC, and CF killed all 3 instances at 15:09:44 UTC. Every restart attempt afterwards failed at Spring context init because HikariCP could not acquire a DB connection. Any OQ test still in flight or scheduled after 15:09 UTC therefore failed for infrastructure reasons unrelated to the code under review.

The triage classifier produced 0 regression suspects — LIKELY=0, UNLIKELY=0, INCONCLUSIVE=0 — and no version skew was observed across multiapps / multiapps-controller / xsa-multiapps-controller. The breadth of failures (blue-green, service lifecycle, GACD, CTS, application hooks, rollback) is exactly what a post-15:09 crash-loop produces: every scenario hits the same dead controller. The verdict label remains FAIL because tests did not pass, but the recommendation is to re-run OQ on a healthy environment rather than block the PRs on this signal.

Failed jobs / scenarios

113 jobs failed both initial and retry — broad failure across blue-green, service lifecycle, GACD, CTS, application hooks, rollback, and related scenarios. All map to the 15:05–15:09 UTC RDS outage and subsequent crash-loop, not to scenario-specific regressions.
5 jobs cleared on retry: fail-on-service-update, incremental-blue-green-deploy, gacd-with-certificates-deployment, health-check-invocation, sequential-resource-processing.

PR change surface (union across all PRs in this run)

Files changed: 30 (+1798 / -461)
Modules touched: multiapps-controller-api, multiapps-controller-persistence, multiapps-controller-process (test only), multiapps-controller-web, root pom.xml, xsa-multiapps-controller web manifests (manifest.yml, manifest.stress.yml)
Suspect overlap: none — log-analyzer produced 0 regression suspects, so there is nothing in either PR's diff that the log signal points to.

Non-blocking note (PR #1845 only)

Log-analyzer flagged a latent risk in the new AwsS3ObjectStoreFileStorage.testConnection() path: it calls s3Client.listObjectsV2() with a 10-minute socket timeout (via UrlConnectionHttpClient), which is longer than the SINGLE_TASK_TIMEOUT_IN_SECONDS = 70s used by ApplicationHealthCalculator. If S3 becomes transiently slow (>70s response), the health check could flip to false and return 503 independently of the database. This did not occur in this OQ run — the 503 was DB-driven — but the timeout asymmetry is a code-quality concern worth addressing before merge (the previous JClouds blobStore.blobExists() call did not have an explicit 10-minute socket timeout). Not applicable to PR #525, which only adjusts log-level entries.

Log analysis summary

Expected (test-driven): 192
Infrastructure / transient: 40670
Potentially regression-related: 0
Likely caused by PR(s): 0
Unlikely caused by PR(s): 0
Inconclusive: 0
Version skew: none

Full log-analyzer findings

Log Analyzer — oq verdict: FAIL

Test outcome (from orchestrator): FAILED (113 of 118 OQ jobs failed both initial run and retry)
CF target: deploy-service / sap_btp_cf_mta_deploy+technical1 (app: deploy-service, deployed sha: 4ea43dc)
Window: 2026-06-05T14:27:36Z → 2026-06-05T16:11:49Z
Index queried: logs-*
Total WARN/ERROR in window: 40,862 | Expected (catalog): 192 | Indeterminate: 40,670 | Truncated: no

Window-anchor note: window_start (14:27:36Z) vs deploy_start (14:25:26Z) differ by ~2 min — anchor is consistent with DEPLOY_END, not DEPLOY_START. Cross-validated and accepted.
OQ catalog: up-to-date (exit 0); no rebuild needed.

Verdict rationale

The systematic OQ failure of 113 jobs is caused by a PostgreSQL/RDS database outage, not by the AWS SDK migration PR. The database at postgres-a44daf96-2b75-4c00-9b07-fae7c6a6c1c2.cxxzc36no8yr.eu-central-1.rds.amazonaws.com:8772 began refusing connections at 15:05:07 UTC, roughly 37 minutes into the OQ run. By 15:08:35 UTC the application health check was returning 503 (both object-store health and DB health caches showed false); CF's liveness probe fired at 15:09:44 UTC and killed all 3 running instances. Every restart attempt since then has failed at Spring context initialization because HikariCP cannot acquire a connection during startup — this continues through the end of the OQ window and beyond.

The AWS SDK migration (multiapps-controller PR #1845, xsa-multiapps-controller PR #525) was running cleanly in the 37 minutes before the DB outage: zero S3 or AWS SDK errors appear in the pre-outage window (14:27–15:05 UTC), and the expected OQ scenario errors (auth failures, staging failures, content validation errors, circular-dependencies, etc.) all look normal in volume and kind. No regression suspect entries were surfaced by the triage classifier.

Bucket C is empty. The FAIL verdict is driven entirely by the infrastructure DB outage. The PR changes are not the root cause.

Local git state at analysis time

Sub-project	Branch	Uncommitted (files)	On feature branch
`multiapps`	`master`	0	No
`multiapps-controller`	`aws-sdk-migration`	1 (`pom.xml`)	Yes
`xsa-multiapps-controller`	`aws-migration`	1 (`pom.xml`)	Yes
`XSOQTests`	`fix-hey`	0	Yes
`cf-mta-examples`	`feature/LMCROSSITXSADEPLOY-3316-v2`	1 (untracked build dir)	Yes
`multiapps-cli-plugin`	`master`	0	No
`product-cf-hcp`	`staging-alerting`	5	Yes
`ds-load-tests-dashboard`	`update-readme`	0	Yes (minor)

Uncommitted changes of note:

multiapps-controller/pom.xml (modified, unstaged): bumps <multiapps.version> from 2.48.0 → 2.49.0-SNAPSHOT
xsa-multiapps-controller/pom.xml (modified, unstaged): bumps <multiapps.version> from 2.48.0 → 2.49.0-SNAPSHOT
These two changes are coordinated and are part of the version-alignment update for the PR deployment.
xsa-multiapps-controller also has an untracked nginx/scripts/ directory (not load-bearing for this analysis).

Deploy chain version pinning

Source of truth	Truth value	Declared in downstream	Declared value	Status
`multiapps/pom.xml` `<version>`	`2.49.0-SNAPSHOT`	`multiapps-controller` `<multiapps.version>`	`2.49.0-SNAPSHOT`	OK
`multiapps/pom.xml` `<version>`	`2.49.0-SNAPSHOT`	`xsa-multiapps-controller` `<multiapps.version>`	`2.49.0-SNAPSHOT`	OK
`multiapps-controller/pom.xml` `<version>`	`2.48.0-SNAPSHOT`	`xsa-multiapps-controller` `<multiapps-controller.version>`	`2.48.0-SNAPSHOT`	OK

No version skew detected. The uncommitted pom.xml changes on both feature branches track multiapps 2.49.0-SNAPSHOT correctly.

Categorization

Category	Count
Expected (test-driven — catalog matches)	192
Infrastructure / transient (no regression marker)	40,670
Potentially regression-related (Bucket C suspects)	0

Triage classifier result: unexpected=0, indeterminate_with_regression_marker=0, suspects_written=0.

Per-suspect attribution

No Bucket C suspects. The triage produced zero entries requiring regression attribution.

Strong attributions (LIKELY_CAUSED_BY_PR)

None. No suspect entries were produced by the triage step.

Root cause analysis — database outage (infrastructure event)

Primary signal: HikariPool-1 - Pool is empty, failed to create/setup connection with PSQLException: Connection to postgres-a44daf96-2b75-4c00-9b07-fae7c6a6c1c2.cxxzc36no8yr.eu-central-1.rds.amazonaws.com:8772 refused.

Timeline of the outage:

UTC Time	Event
14:27:36	App deployed (DEPLOY_END), health checks passed, OQ started
14:27–15:05	App running normally; 1,257 pre-outage WARN/ERROR entries — all expected OQ scenario errors, zero AWS SDK errors
15:05:07	First DB signal: `HikariPool - Failed to validate connection ... (This connection has been closed.)` — stale connections dropping
15:06:08	DB outage confirmed: `PSQLException: Connection to postgres-....rds.amazonaws.com:8772 refused` across all 3 instances
15:06:18	`Retrying operation that failed with message: Timeout while checking database health` (health check retries begin)
15:08:35	Health check emits: `Object store file storage health: "false", Database health: "false"` → next `/public/application-health` call returns 503
15:09:44	CF liveness check detects 503 → all 3 instances killed (`Instance became unhealthy`)
15:10:08	New instances fail startup: `UnsatisfiedDependencyException: Error creating bean 'defaultEntityManagerFactory' ... Failed to initialize pool: The connection attempt failed.` — 24 events total (8 restart waves × 3 instances)
15:10–16:11	All restart waves fail the same way; app remains 0/3 (crashed) through the end of the window

Why ALL 113 OQ jobs failed: Once the DB outage knocked all 3 instances into crash-loop at 15:09 UTC, any OQ test that was still in-flight received connection errors or no-response, and any test that had not started yet could not communicate with the deploy-service. The retry wave of tests encountered the same crashed state. The 5 tests that passed are those that completed before 15:09 UTC.

Why the AWS SDK migration is NOT the root cause:

Zero AWS SDK / S3 errors appear in the 37-minute healthy window (14:27–15:05 UTC).
The objectStore health = false entry at 15:08:35 appeared after the DB connection failures began at 15:06 and is likely secondary — a transient S3 slowness in the same AWS eu-central-1 AZ as the failing RDS instance, or the health check thread pool being starved by blocked DB check threads.
The crash chain is unambiguously driven by PSQLException (RDS port refused), not by any S3 API call.
The AWS SDK migration touches only object store code (AwsS3ObjectStoreFileStorage, ObjectStoreFileStorageFactoryBean, ObjectStoreServiceInfoCreator) and the startup log confirms AwsS3ObjectStoreFileStorage was set to DEBUG level — the new class loaded and initialized correctly.

Secondary observation: The new AwsS3ObjectStoreFileStorage.testConnection() calls s3Client.listObjectsV2() with a 10-minute socket timeout (via UrlConnectionHttpClient), which is longer than the SINGLE_TASK_TIMEOUT_IN_SECONDS = 70s used in ApplicationHealthCalculator. If S3 becomes transiently slow (>70s response), the health check would flip to false and return 503 independently of the DB. This is a latent risk introduced by the PR (the old JClouds blobStore.blobExists() did not have an explicit 10-minute socket timeout). However, this scenario did NOT occur during this OQ run — the 503 was DB-driven. This is flagged here as a code-quality concern for the PR author to assess, not as a confirmed regression.

Failed scenarios provided by orchestrator

All 113 failed jobs are attributable to the database outage from 15:05 UTC onward, not to any specific scenario-level regression in the PR code.

The 5 jobs that likely succeeded (not in the failed list) completed before the DB outage at 15:09 UTC.

Expected errors (sample — top signatures from catalog triage)

Signature	Count	Scenario
`Exception caught` (ContentException / SLException from deliberate scenario errors)	166	generic-content-deploy
`are not supported in the specified scope`	11	unsupported-parameter-message
`Execution of step .* has timed out`	6	timeout-scenario
`broker returned` (broker failure)	4	various
`Content deployment for module`	2	generic-content-deploy
`Failed to delete service`	2	service-deletion-failed-scenario
`App staging failed`	1	app-staging-failure

Infra/transient errors (top no-marker indeterminate signatures)

Signature	Count	Classification
Failed to write message to the audit log	34,849	Expected — audit log service not bound in OQ
Ignoring parameter "namespace"	2,910	Expected — namespace not used in most OQ scenarios
Notification for Unknown NOT sent to ANS	1,169	Expected — ANS not configured in OQ
Skipping deletion of services	225	Expected — scenario-driven
Error while closing command context	176	Expected — Flowable error propagation from deliberate scenario failures
Skipping notification for operation	101	Expected — missing mtaId/orgId in some scenarios
Error occurred while executing REST API call	24	Expected — OQ scenarios deliberately send bad payloads (wrong Content-Type, malformed JSON, not-found operation IDs)
Cannot send audit log / Missing audit log credentials	48	Expected — audit log service not bound
Exception sending context initialized event to ContextLoader	24	DB outage — startup failure; NOT a PR regression
HikariPool connection failures (PSQLException)	46	DB outage — RDS infrastructure event
Timeout while checking database health	6	DB outage — health check timeout
Object store/database health = false	3+	DB outage — health check 503 trigger

Generated by log-analyzer. Mode: oq. Generated 2026-06-05T19:00:00Z. Consumed by pr-result-publisher.

Posted by pr-result-publisher. Mode: oq. Posted on 2 PR(s) in this run. Generated 2026-06-05T19:30:00Z.

IvanBorislavovDimitrov force-pushed the aws-sdk-migration branch 11 times, most recently from 616a49d to 2a642fa Compare May 27, 2026 14:15

cloudfoundry deleted a comment from sonarqubecloud Bot Jun 1, 2026

Yavor16 requested changes Jun 3, 2026

View reviewed changes

IvanBorislavovDimitrov added 2 commits June 4, 2026 17:05

Migrate from Jclouds to AWS SDK

f5e67f3

LMCROSSITXSADEPLOY-3315

IvanBorislavovDimitrov force-pushed the aws-sdk-migration branch from 2a642fa to c105981 Compare June 4, 2026 14:08

Fix comments

85580db

IvanBorislavovDimitrov force-pushed the aws-sdk-migration branch 3 times, most recently from 4052ef2 to a5703b3 Compare June 5, 2026 12:23

Fix sonar

4ea43dc

IvanBorislavovDimitrov force-pushed the aws-sdk-migration branch from a5703b3 to 4ea43dc Compare June 5, 2026 12:39

Conversation

IvanBorislavovDimitrov commented May 27, 2026

Uh oh!

Yavor16 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 5, 2026

Quality Gate passed

Uh oh!

IvanBorislavovDimitrov commented Jun 5, 2026

oq test verdict: FAIL

Pipeline outcomes

Verdict rationale

Failed jobs / scenarios

PR change surface (union across all PRs in this run)

Non-blocking note (PR #1845 only)

Log analysis summary

Log Analyzer — oq verdict: FAIL

Verdict rationale

Local git state at analysis time

Deploy chain version pinning

Categorization

Per-suspect attribution

Strong attributions (LIKELY_CAUSED_BY_PR)

Root cause analysis — database outage (infrastructure event)

Failed scenarios provided by orchestrator

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`oq` test verdict: FAIL