Skip to content

Support for docker compose > v2 #5739

@sateeshmannar

Description

@sateeshmannar

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
Unable to run sagemaker training in "local" mode with docker version v5.

Component: sagemaker.train.local.local_container._LocalContainer._get_compose_cmd_prefix

_get_compose_cmd_prefix() only recognizes Docker Compose v2 by
checking "v2" in output. Docker Compose v5.x (and v3, v4, or any
future major version) is silently rejected, causing::

ImportError: Docker Compose is not installed.
Local Mode features will not work without docker compose.

even though docker compose is fully installed and functional.

The substring check "v2" in output is too narrow. Every Compose
version with a major number other than 2 is treated as "not installed".

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

Required for Test:
docker pull docker:latest
docker tag image to docker:latest

# Make sure the latest version is > 2
docker run --rm docker:latest docker compose version
# Docker Compose version v5.1.1

The provided code need to be complete and runnable, if additional data is needed, please include them in the issue.

import re
import subprocess

import mock



def test_local_train_triggers_compose_v5_bug(tmp_path):
    """Invoke sagemaker local-mode training to prove Docker Compose v5 triggers the bug.

    This test exercises the real call chain:
        ModelTrainer.train()
          → _LocalContainer.train()
            → _LocalContainer._generate_compose_command()
              → _LocalContainer._get_compose_cmd_prefix()   ← bug lives here

    The bug: ``_get_compose_cmd_prefix`` only accepts output containing "v2".
    Any Compose v3/v4/v5 output causes an ``ImportError`` even though Docker
    Compose is fully installed and functional.

    Assumes ``docker:latest`` is already pulled on the local workstation.
    """
    import os
    import boto3
    from moto import mock_aws
    from sagemaker.core.local import LocalSession
    from sagemaker.core.training.configs import Compute, InputData, OutputDataConfig, SourceCode
    from sagemaker.train.model_trainer import ModelTrainer, Mode

    #Assumes a local docker:latest image is already pulled, which contains a real compose version string in its output. If this test starts failing due to a compose version update, update this image to one that contains a compose version string that triggers the bug (v3+).
    PUBLIC_DOCKER_IMAGE = "docker:latest"

    # Step 1: get the real compose version string from the local docker:latest image
    result = subprocess.run(  # nosec B603 B607
        ["docker", "run", "--rm", PUBLIC_DOCKER_IMAGE, "docker", "compose", "version"],
        capture_output=True,
        text=True,
        timeout=30,  # image assumed already pulled — no pull delay
    )
    assert result.returncode == 0, (
        f"Could not get compose version from {PUBLIC_DOCKER_IMAGE}:\n{result.stderr}"
    )
    real_version_output = result.stdout  # e.g. "Docker Compose version v5.1.1\n"

    match = re.search(r"v(\d+)", real_version_output.strip())
    assert match is not None, f"Could not parse version from: {real_version_output!r}"
    major_version = int(match.group(1))

    if major_version == 2:
        print(
            f"docker:latest ships Compose v{major_version} — bug only triggers on v3+. "
            "Update PUBLIC_DOCKER_IMAGE to an image that ships Compose v3+."
        )

    # Step 2: build a minimal LocalSession backed by moto
    with mock_aws():
        boto_session = boto3.Session(region_name="us-east-1")
        s3 = boto_session.client("s3")
        s3.create_bucket(Bucket="sagemaker-bug-repro")

        sm_session = LocalSession(
            boto_session=boto_session,
            default_bucket="sagemaker-bug-repro",
            sagemaker_config={
                "SchemaVersion": "1.0",
                "SageMaker": {"PythonSDK": {"Modules": {"TelemetryOptOut": True}}},
            },
        )
        sm_session.config = {
            "local": {"local_code": True, "container_root": str(tmp_path)}
        }
        sm_session._default_bucket = "sagemaker-bug-repro"

        # Step 3: build a ModelTrainer pointing at the public docker:latest image.
        # We mock subprocess.check_output so the SDK receives the *real* version
        # string from the public image instead of whatever is installed on this host.
        trainer = ModelTrainer(
            training_image=PUBLIC_DOCKER_IMAGE,
            role="arn:aws:iam::123456789012:role/fake-role",
            compute=Compute(instance_type="local_cpu", instance_count=1),
            sagemaker_session=sm_session,
            output_data_config=OutputDataConfig(s3_output_path="s3://sagemaker-bug-repro/output"),
            base_job_name="compose-v5-bug",
            training_mode=Mode.LOCAL_CONTAINER,
            local_container_root=str(tmp_path),
            source_code=SourceCode(command="true"),  # no-op command
        )

        # Step 4: patch subprocess so _get_compose_cmd_prefix sees the real
        # version string from docker:latest (v5.x), then assert the bug fires.
        with (
            mock.patch("subprocess.check_output", return_value=real_version_output),
            mock.patch("shutil.which", return_value=None),
        ):
            trainer.train(wait=False, logs=False)

Expected behavior
sagemaker train in local mode should not error

Screenshots or logs --- STACK Trace

.tox/py313/lib/python3.13/site-packages/sagemaker/core/telemetry/telemetry_logging.py:187: in wrapper  
    raise caught_ex  
.tox/py313/lib/python3.13/site-packages/sagemaker/core/telemetry/telemetry_logging.py:153: in wrapper  
    response = func(*args, **kwargs)  
               ^^^^^^^^^^^^^^^^^^^^^  
.tox/py313/lib/python3.13/site-packages/sagemaker/core/workflow/pipeline_context.py:346: in wrapper  
    return run_func(*args, **kwargs)  
           ^^^^^^^^^^^^^^^^^^^^^^^^^  
.tox/py313/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py:39: in wrapper_function  
    return wrapper(*args, **kwargs)  
           ^^^^^^^^^^^^^^^^^^^^^^^^  
.tox/py313/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py:136: in __call__  
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))  
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  
.tox/py313/lib/python3.13/site-packages/sagemaker/train/model_trainer.py:813: in train  
    local_container.train(wait)  
.tox/py313/lib/python3.13/site-packages/sagemaker/train/local/local_container.py:237: in train  
    compose_command = self._generate_compose_command(wait)  
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  
.tox/py313/lib/python3.13/site-packages/sagemaker/train/local/local_container.py:479: in _generate_compose_command  
    _compose_cmd_prefix = self._get_compose_cmd_prefix()  
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _   
self = _LocalContainer(training_job_name='a-test-20260408132342', instance_type='local_cpu', instance_count=1, image='glm-sta...er_arguments=['-c', 'chmod +x /opt/ml/input/data/sm_drivers/sm_train.sh && /opt/ml/input/data/sm_drivers/sm_train.sh'])  
    def _get_compose_cmd_prefix(self) -> List[str]:  
        """Gets the Docker Compose command.  
      
        The method initially looks for 'docker compose' v2  
        executable, if not found looks for 'docker-compose' executable.  
      
        Returns:  
            List[str]: Docker Compose executable split into list.  
      
        Raises:  
            ImportError: If Docker Compose executable was not found.  
        """  
        compose_cmd_prefix = []  
      
        output = None  
        try:  
            output = subprocess.check_output(  
                ["docker", "compose", "version"],  
                stderr=subprocess.DEVNULL,  
                encoding="UTF-8",  
            )  
        except subprocess.CalledProcessError:  
            logger.info(  
                "'Docker Compose' is not installed. "  
                "Proceeding to check for 'docker-compose' CLI."  
            )  
      
        if output and "v2" in output.strip():  
            logger.info("'Docker Compose' found using Docker CLI.")  
            compose_cmd_prefix.extend(["docker", "compose"])  
            return compose_cmd_prefix  
      
        if shutil.which("docker-compose") is not None:  
            logger.info("'Docker Compose' found using Docker Compose CLI.")  
            compose_cmd_prefix.extend(["docker-compose"])  
            return compose_cmd_prefix  
      
>       raise ImportError(  
            "Docker Compose is not installed. "  
            "Local Mode features will not work without docker compose. "  
            "For more information on how to install 'docker compose', please, see "  
            "https://docs.docker.com/compose/install/"  
        )  
E       ImportError: Docker Compose is not installed. Local Mode features will not work without docker compose. For more information on how to install 'docker compose', please, see https://docs.docker.com/compose/install/  
.tox/py313/lib/python3.13/site-packages/sagemaker/train/local/local_container.py:638: ImportError
# ═══════════════════════════════════════════════════════════════════════════
# PROPOSED FIX – standalone validation of the corrected logic
# ═══════════════════════════════════════════════════════════════════════════


def _fixed_compose_check(output: str | None) -> bool:
    """Proposed replacement for the ``"v2" in output`` check.

    Accepts any Docker Compose plugin version >= 2.0.0.

    Examples
    --------
    >>> _fixed_compose_check("Docker Compose version v5.1.1")
    True
    >>> _fixed_compose_check("Docker Compose version v2.27.0")
    True
    >>> _fixed_compose_check("Docker Compose version v1.29.2")
    False
    >>> _fixed_compose_check("")
    False
    >>> _fixed_compose_check(None)
    False
    """
    if not output:
        return False
    match = re.search(r"v(\d+)", output.strip())
    return match is not None and int(match.group(1)) >= 2

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: sagemaker-train (>=1.6.0,<2.0.0) / sagemaker-core (>=2.7.1,<3.0.0)
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): NA
  • Framework version: sagemaker-train (>=1.6.0,<2.0.0) / sagemaker-core (>=2.7.1,<3.0.0)
  • Python version: 3.13
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions