Skip to content

Aws sdk migration#1845

Open
IvanBorislavovDimitrov wants to merge 4 commits into
masterfrom
aws-sdk-migration
Open

Aws sdk migration#1845
IvanBorislavovDimitrov wants to merge 4 commits into
masterfrom
aws-sdk-migration

Conversation

@IvanBorislavovDimitrov
Copy link
Copy Markdown
Contributor

No description provided.

@IvanBorislavovDimitrov IvanBorislavovDimitrov force-pushed the aws-sdk-migration branch 11 times, most recently from 616a49d to 2a642fa Compare May 27, 2026 14:15
@cloudfoundry cloudfoundry deleted a comment from sonarqubecloud Bot Jun 1, 2026
Copy link
Copy Markdown
Contributor

@Yavor16 Yavor16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The azure sdk files are also shown as deleted in this PR. Maybe if you rebase, it will remove from the PR

Comment on lines +4 to +9
import io.swagger.annotations.Api;
import io.swagger.annotations.ApiOperation;
import io.swagger.annotations.ApiParam;
import io.swagger.annotations.ApiResponse;
import io.swagger.annotations.ApiResponses;
import io.swagger.annotations.Authorization;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file be part of the PR because I don't see changes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was refactor after I formatted the code.

Comment on lines +159 to +172
void testConnection() {
Bucket mockBucket = mock(Bucket.class);
when(mockedStorage.get(CONTAINER)).thenReturn(mockBucket);
assertDoesNotThrow(mockedGcpFileStorage::testConnection);
}

@Test
void testTestConnectionWhenBucketExists() {
Bucket mockBucket = mock(Bucket.class);
when(mockedStorage.get(CONTAINER)).thenReturn(mockBucket);

assertDoesNotThrow(mockedGcpFileStorage::testConnection);
verify(mockedStorage).get(CONTAINER);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't these two the same

@Override
public void testConnection() {
storage.get(bucketName, "test");
var bucket = storage.get(bucketName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you use var

// DEBUG log messages:
public static final String STORED_FILE_0 = "Stored file: \"{0}\"";
public static final String STORED_FILE_0_WITH_SIZE_1 = "Stored file \"{0}\" with size {1}";
public static final String STORED_FILE_0_WITH_SIZE_1_LOG = "Stored file \"{}\" with size {}";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add numbers to the string {0}

public static final String RETRIEVED_SECRET_TOKEN_WITH_ID_0_FOR_PROCESS_WITH_ID_1 = "Retrieved secret token with id \"{0}\" for process with id \"{1}\"";
public static final String DELETED_0_SECRET_TOKENS_FOR_PROCESS_WITH_ID_1 = "Deleted \"{0}\" secret tokens for process with id \"{1}\"";
public static final String DELETED_0_SECRET_TOKENS_WITH_EXPIRATION_DATE_1 = "Deleted secret tokens \"{0}\" with an expiration date \"{1}\"";
public static final String FAILED_TO_DELETE_FILE_0_IN_OBJECT_STORE_REASON_1 = "Failed to delete file \"{}\" in object store. Reason: {}";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add numbers to the string {0}

Comment on lines +63 to +72
requires software.amazon.awssdk.services.s3;
requires software.amazon.awssdk.core;
requires software.amazon.awssdk.awscore;
requires software.amazon.awssdk.regions;
requires software.amazon.awssdk.utils;
requires software.amazon.awssdk.auth;
requires software.amazon.awssdk.http;
requires software.amazon.awssdk.http.urlconnection;
requires software.amazon.awssdk.retries;
requires software.amazon.awssdk.retries.api;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, i think some of these are not required. can you check?

Comment on lines +136 to +140
LOGGER.warn(Messages.JOB_WITH_ID_WAS_NOT_UPDATED_WITHIN_SECONDS_ON_START, existingJob.getId(), UPDATE_JOB_TIMEOUT);
LOGGER.warn(Messages.STALE_JOB_DETAILS, existingJob.getId(), existingJob.getState(), existingJob.getUpdatedAt(),
existingJob.getAddedAt(), existingJob.getStartedAt(), existingJob.getBytesRead(), existingJob.getUrl(),
existingJob.getSpaceGuid(), existingJob.getNamespace(), existingJob.getUser(),
existingJob.getInstanceIndex());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these logs part of the PR?

import java.text.MessageFormat;
import java.time.Duration;

public class FileUploadResilientOperationExecutor {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this part of the PR

public static final String CANNOT_PARSE_CONTAINER_URI_OF_OBJECT_STORE = "Cannot parse container_uri of object store";
public static final String REQUEST_0_1_FAILED_WITH_2 = "Request \"{0} {1}\" failed with \"{2}\"";
public static final String ERROR_OCCURRED_WHILE_DELETING_JOB_ENTRY = "Error occurred while deleting job entry";
public static final String ERROR_OCCURRED_WHILE_DELETING_JOB_ENTRY = "Error occurred while deleting job entry with id: {}";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add number to be consistent and to the other messages also

Comment thread pom.xml Outdated
<immutables.version>2.12.1</immutables.version>
<micrometer.version>1.16.4</micrometer.version>
<aliyun-sdk-oss.version>3.18.5</aliyun-sdk-oss.version>
<aws.sdk.version>2.44.12</aws.sdk.version>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before merge you can bump the version because they release new version very frequently

LMCROSSITXSADEPLOY-3315
Previously testConnection probed for a non-existent test object on the
underlying store, which produced misleading results when the bucket or
container itself was missing. The check now explicitly verifies that the
target bucket/container exists and throws IllegalStateException with a
clear OBJECT_STORE_BUCKET_NOT_FOUND message when it does not, making
configuration errors fail fast with an actionable diagnostic.

LMCROSSITXSADEPLOY-3315
@IvanBorislavovDimitrov IvanBorislavovDimitrov force-pushed the aws-sdk-migration branch 3 times, most recently from 4052ef2 to a5703b3 Compare June 5, 2026 12:23
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 5, 2026

@IvanBorislavovDimitrov
Copy link
Copy Markdown
Contributor Author

oq test verdict: FAIL

Recommendation: do not merge — OQ failed due to an external RDS outage mid-run, not PR regression. Re-run OQ once RDS is healthy before merging.

PRs in this run:

CF target: deploy-service / sap_btp_cf_mta_deploy+technical1 (app: deploy-service)
Window: 2026-06-05T14:27:36Z → 2026-06-05T16:11:49Z
Pipeline: http://gcpclm950064:8080/teams/main/pipelines/qa-tester

Pipeline outcomes

Stage Result Notes
Deploy (deploy-service-pusher-oq) PASS Deploy completed cleanly; app reached 3/3 healthy and OQ began
Tests (qa-tester) FAIL 113 of 118 initially-failed jobs failed both initial run and one-shot retry; 5 flaky jobs cleared on retry
Log analysis (log-analyzer) FAIL OQ failed due to a PostgreSQL/RDS database outage at 15:05–15:09 UTC (RDS port 8772 refused connections), not caused by either PR. All 113 job failures trace to the deploy-service crash-looping after HikariCP exhausted retries. Zero AWS SDK / S3 errors in the pre-outage window.

Verdict rationale

OQ posted a hard failure (113/118 jobs failing both initial and retry runs), but the log-analyzer evidence is unambiguous: the deploy-service was healthy for the first 37 minutes of the run (14:27–15:05 UTC) with zero AWS SDK / S3 errors, then RDS at postgres-….eu-central-1.rds.amazonaws.com:8772 began refusing connections at 15:05:07 UTC, the application health endpoint flipped to 503 at 15:08:35 UTC, and CF killed all 3 instances at 15:09:44 UTC. Every restart attempt afterwards failed at Spring context init because HikariCP could not acquire a DB connection. Any OQ test still in flight or scheduled after 15:09 UTC therefore failed for infrastructure reasons unrelated to the code under review.

The triage classifier produced 0 regression suspectsLIKELY=0, UNLIKELY=0, INCONCLUSIVE=0 — and no version skew was observed across multiapps / multiapps-controller / xsa-multiapps-controller. The breadth of failures (blue-green, service lifecycle, GACD, CTS, application hooks, rollback) is exactly what a post-15:09 crash-loop produces: every scenario hits the same dead controller. The verdict label remains FAIL because tests did not pass, but the recommendation is to re-run OQ on a healthy environment rather than block the PRs on this signal.

Failed jobs / scenarios

  • 113 jobs failed both initial and retry — broad failure across blue-green, service lifecycle, GACD, CTS, application hooks, rollback, and related scenarios. All map to the 15:05–15:09 UTC RDS outage and subsequent crash-loop, not to scenario-specific regressions.
  • 5 jobs cleared on retry: fail-on-service-update, incremental-blue-green-deploy, gacd-with-certificates-deployment, health-check-invocation, sequential-resource-processing.

PR change surface (union across all PRs in this run)

  • Files changed: 30 (+1798 / -461)
  • Modules touched: multiapps-controller-api, multiapps-controller-persistence, multiapps-controller-process (test only), multiapps-controller-web, root pom.xml, xsa-multiapps-controller web manifests (manifest.yml, manifest.stress.yml)
  • Suspect overlap: none — log-analyzer produced 0 regression suspects, so there is nothing in either PR's diff that the log signal points to.

Non-blocking note (PR #1845 only)

Log-analyzer flagged a latent risk in the new AwsS3ObjectStoreFileStorage.testConnection() path: it calls s3Client.listObjectsV2() with a 10-minute socket timeout (via UrlConnectionHttpClient), which is longer than the SINGLE_TASK_TIMEOUT_IN_SECONDS = 70s used by ApplicationHealthCalculator. If S3 becomes transiently slow (>70s response), the health check could flip to false and return 503 independently of the database. This did not occur in this OQ run — the 503 was DB-driven — but the timeout asymmetry is a code-quality concern worth addressing before merge (the previous JClouds blobStore.blobExists() call did not have an explicit 10-minute socket timeout). Not applicable to PR #525, which only adjusts log-level entries.

Log analysis summary

  • Expected (test-driven): 192
  • Infrastructure / transient: 40670
  • Potentially regression-related: 0
  •   Likely caused by PR(s): 0
  •   Unlikely caused by PR(s): 0
  •   Inconclusive: 0
  • Version skew: none
Full log-analyzer findings

Log Analyzer — oq verdict: FAIL

Test outcome (from orchestrator): FAILED (113 of 118 OQ jobs failed both initial run and retry)
CF target: deploy-service / sap_btp_cf_mta_deploy+technical1 (app: deploy-service, deployed sha: 4ea43dc)
Window: 2026-06-05T14:27:36Z → 2026-06-05T16:11:49Z
Index queried: logs-*
Total WARN/ERROR in window: 40,862 | Expected (catalog): 192 | Indeterminate: 40,670 | Truncated: no

Window-anchor note: window_start (14:27:36Z) vs deploy_start (14:25:26Z) differ by ~2 min — anchor is consistent with DEPLOY_END, not DEPLOY_START. Cross-validated and accepted.
OQ catalog: up-to-date (exit 0); no rebuild needed.


Verdict rationale

The systematic OQ failure of 113 jobs is caused by a PostgreSQL/RDS database outage, not by the AWS SDK migration PR. The database at postgres-a44daf96-2b75-4c00-9b07-fae7c6a6c1c2.cxxzc36no8yr.eu-central-1.rds.amazonaws.com:8772 began refusing connections at 15:05:07 UTC, roughly 37 minutes into the OQ run. By 15:08:35 UTC the application health check was returning 503 (both object-store health and DB health caches showed false); CF's liveness probe fired at 15:09:44 UTC and killed all 3 running instances. Every restart attempt since then has failed at Spring context initialization because HikariCP cannot acquire a connection during startup — this continues through the end of the OQ window and beyond.

The AWS SDK migration (multiapps-controller PR #1845, xsa-multiapps-controller PR #525) was running cleanly in the 37 minutes before the DB outage: zero S3 or AWS SDK errors appear in the pre-outage window (14:27–15:05 UTC), and the expected OQ scenario errors (auth failures, staging failures, content validation errors, circular-dependencies, etc.) all look normal in volume and kind. No regression suspect entries were surfaced by the triage classifier.

Bucket C is empty. The FAIL verdict is driven entirely by the infrastructure DB outage. The PR changes are not the root cause.


Local git state at analysis time

Sub-project Branch Uncommitted (files) On feature branch
multiapps master 0 No
multiapps-controller aws-sdk-migration 1 (pom.xml) Yes
xsa-multiapps-controller aws-migration 1 (pom.xml) Yes
XSOQTests fix-hey 0 Yes
cf-mta-examples feature/LMCROSSITXSADEPLOY-3316-v2 1 (untracked build dir) Yes
multiapps-cli-plugin master 0 No
product-cf-hcp staging-alerting 5 Yes
ds-load-tests-dashboard update-readme 0 Yes (minor)

Uncommitted changes of note:

  • multiapps-controller/pom.xml (modified, unstaged): bumps <multiapps.version> from 2.48.02.49.0-SNAPSHOT
  • xsa-multiapps-controller/pom.xml (modified, unstaged): bumps <multiapps.version> from 2.48.02.49.0-SNAPSHOT
  • These two changes are coordinated and are part of the version-alignment update for the PR deployment.
  • xsa-multiapps-controller also has an untracked nginx/scripts/ directory (not load-bearing for this analysis).

Deploy chain version pinning

Source of truth Truth value Declared in downstream Declared value Status
multiapps/pom.xml <version> 2.49.0-SNAPSHOT multiapps-controller <multiapps.version> 2.49.0-SNAPSHOT OK
multiapps/pom.xml <version> 2.49.0-SNAPSHOT xsa-multiapps-controller <multiapps.version> 2.49.0-SNAPSHOT OK
multiapps-controller/pom.xml <version> 2.48.0-SNAPSHOT xsa-multiapps-controller <multiapps-controller.version> 2.48.0-SNAPSHOT OK

No version skew detected. The uncommitted pom.xml changes on both feature branches track multiapps 2.49.0-SNAPSHOT correctly.


Categorization

Category Count
Expected (test-driven — catalog matches) 192
Infrastructure / transient (no regression marker) 40,670
Potentially regression-related (Bucket C suspects) 0

Triage classifier result: unexpected=0, indeterminate_with_regression_marker=0, suspects_written=0.


Per-suspect attribution

No Bucket C suspects. The triage produced zero entries requiring regression attribution.


Strong attributions (LIKELY_CAUSED_BY_PR)

None. No suspect entries were produced by the triage step.


Root cause analysis — database outage (infrastructure event)

Primary signal: HikariPool-1 - Pool is empty, failed to create/setup connection with PSQLException: Connection to postgres-a44daf96-2b75-4c00-9b07-fae7c6a6c1c2.cxxzc36no8yr.eu-central-1.rds.amazonaws.com:8772 refused.

Timeline of the outage:

UTC Time Event
14:27:36 App deployed (DEPLOY_END), health checks passed, OQ started
14:27–15:05 App running normally; 1,257 pre-outage WARN/ERROR entries — all expected OQ scenario errors, zero AWS SDK errors
15:05:07 First DB signal: HikariPool - Failed to validate connection ... (This connection has been closed.) — stale connections dropping
15:06:08 DB outage confirmed: PSQLException: Connection to postgres-....rds.amazonaws.com:8772 refused across all 3 instances
15:06:18 Retrying operation that failed with message: Timeout while checking database health (health check retries begin)
15:08:35 Health check emits: Object store file storage health: "false", Database health: "false" → next /public/application-health call returns 503
15:09:44 CF liveness check detects 503 → all 3 instances killed (Instance became unhealthy)
15:10:08 New instances fail startup: UnsatisfiedDependencyException: Error creating bean 'defaultEntityManagerFactory' ... Failed to initialize pool: The connection attempt failed. — 24 events total (8 restart waves × 3 instances)
15:10–16:11 All restart waves fail the same way; app remains 0/3 (crashed) through the end of the window

Why ALL 113 OQ jobs failed: Once the DB outage knocked all 3 instances into crash-loop at 15:09 UTC, any OQ test that was still in-flight received connection errors or no-response, and any test that had not started yet could not communicate with the deploy-service. The retry wave of tests encountered the same crashed state. The 5 tests that passed are those that completed before 15:09 UTC.

Why the AWS SDK migration is NOT the root cause:

  1. Zero AWS SDK / S3 errors appear in the 37-minute healthy window (14:27–15:05 UTC).
  2. The objectStore health = false entry at 15:08:35 appeared after the DB connection failures began at 15:06 and is likely secondary — a transient S3 slowness in the same AWS eu-central-1 AZ as the failing RDS instance, or the health check thread pool being starved by blocked DB check threads.
  3. The crash chain is unambiguously driven by PSQLException (RDS port refused), not by any S3 API call.
  4. The AWS SDK migration touches only object store code (AwsS3ObjectStoreFileStorage, ObjectStoreFileStorageFactoryBean, ObjectStoreServiceInfoCreator) and the startup log confirms AwsS3ObjectStoreFileStorage was set to DEBUG level — the new class loaded and initialized correctly.

Secondary observation: The new AwsS3ObjectStoreFileStorage.testConnection() calls s3Client.listObjectsV2() with a 10-minute socket timeout (via UrlConnectionHttpClient), which is longer than the SINGLE_TASK_TIMEOUT_IN_SECONDS = 70s used in ApplicationHealthCalculator. If S3 becomes transiently slow (>70s response), the health check would flip to false and return 503 independently of the DB. This is a latent risk introduced by the PR (the old JClouds blobStore.blobExists() did not have an explicit 10-minute socket timeout). However, this scenario did NOT occur during this OQ run — the 503 was DB-driven. This is flagged here as a code-quality concern for the PR author to assess, not as a confirmed regression.


Failed scenarios provided by orchestrator

All 113 failed jobs are attributable to the database outage from 15:05 UTC onward, not to any specific scenario-level regression in the PR code.

The 5 jobs that likely succeeded (not in the failed list) completed before the DB outage at 15:09 UTC.

Expected errors (sample — top signatures from catalog triage)
Signature Count Scenario
Exception caught (ContentException / SLException from deliberate scenario errors) 166 generic-content-deploy
are not supported in the specified scope 11 unsupported-parameter-message
Execution of step .* has timed out 6 timeout-scenario
broker returned (broker failure) 4 various
Content deployment for module 2 generic-content-deploy
Failed to delete service 2 service-deletion-failed-scenario
App staging failed 1 app-staging-failure
Infra/transient errors (top no-marker indeterminate signatures)
Signature Count Classification
Failed to write message to the audit log 34,849 Expected — audit log service not bound in OQ
Ignoring parameter "namespace" 2,910 Expected — namespace not used in most OQ scenarios
Notification for Unknown NOT sent to ANS 1,169 Expected — ANS not configured in OQ
Skipping deletion of services 225 Expected — scenario-driven
Error while closing command context 176 Expected — Flowable error propagation from deliberate scenario failures
Skipping notification for operation 101 Expected — missing mtaId/orgId in some scenarios
Error occurred while executing REST API call 24 Expected — OQ scenarios deliberately send bad payloads (wrong Content-Type, malformed JSON, not-found operation IDs)
Cannot send audit log / Missing audit log credentials 48 Expected — audit log service not bound
Exception sending context initialized event to ContextLoader 24 DB outage — startup failure; NOT a PR regression
HikariPool connection failures (PSQLException) 46 DB outage — RDS infrastructure event
Timeout while checking database health 6 DB outage — health check timeout
Object store/database health = false 3+ DB outage — health check 503 trigger

Generated by log-analyzer. Mode: oq. Generated 2026-06-05T19:00:00Z. Consumed by pr-result-publisher.


Posted by pr-result-publisher. Mode: oq. Posted on 2 PR(s) in this run. Generated 2026-06-05T19:30:00Z.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants