Skip to content

switch envs, resources are shared on multiple envs and are exclusive#1303

Draft
dnsi0 wants to merge 24 commits intomainfrom
feat/adapt-node-envs
Draft

switch envs, resources are shared on multiple envs and are exclusive#1303
dnsi0 wants to merge 24 commits intomainfrom
feat/adapt-node-envs

Conversation

@dnsi0
Copy link
Copy Markdown
Contributor

@dnsi0 dnsi0 commented Mar 27, 2026

Fixes #1280 .

Changes proposed in this PR:

  • Multiple environments per engine: Each Docker engine now supports multiple compute environments, each with its own full resource definitions (cpu, ram, disk, GPUs), fees, access controls, and free-tier config.
  • Per-environment exclusive resources with global validation: CPU, RAM, and disk are exclusive per environment, inUse is tracked only within the environment where the job runs. A dual validation ensures both the target environment has capacity and the global aggregate across all environments doesn't exceed physical limits. GPUs are shared-exclusive (if gpu0 is used in envA, it shows in-use on envB too). CPU cores are hard-partitioned per environment via explicit cpuCores arrays with overlap validation at config time.
  • Removed engine-level resource config: All resource definitions moved from C2DDockerConfig to per-environment C2DEnvironmentConfig. No more engine-wide resources, fees, access, storageExpiry, maxJobDuration, minJobDuration, maxJobs, or free fields on the docker config — these are now per-environment.
  • New C2DEnvironmentConfig interface: Added with resources: ComputeResource[], cpuCores: number[], and all
    environment-specific settings.
  • Zod schema updates: Per-environment validation (fees/free required, storageExpiry >= maxJobDuration), CPU core overlap validation across environments within a cluster.
  • Simplified engine initialization: Removed cpuOffset cascading between engines, removed buildEnvResources()/resolveEnvResources() merging logic — environments directly own their resources. CPU and RAM defaults are still auto-detected from Docker sysinfo when not configured.

@dnsi0 dnsi0 self-assigned this Mar 27, 2026
@dnsi0 dnsi0 force-pushed the feat/adapt-node-envs branch from f67b833 to f6422f2 Compare March 30, 2026 08:35
@dnsi0 dnsi0 marked this pull request as ready for review March 30, 2026 10:38
@dnsi0 dnsi0 marked this pull request as draft March 30, 2026 10:39
@dnsi0
Copy link
Copy Markdown
Contributor Author

dnsi0 commented Mar 30, 2026

/run-security-scan

Copy link
Copy Markdown
Member

@alexcos20 alexcos20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This Pull Request introduces a significant architectural change to the Compute-to-Data (C2D) environment configuration. It refactors the system to support multiple, distinct compute environments per Docker engine, providing more granular control over resource allocation, fees, and access policies. The DOCKER_COMPUTE_ENVIRONMENTS configuration now expects an array of environments within each Docker cluster definition. Resource management has been enhanced with per-environment resource tracking (CPU, RAM, Disk), global resource availability checks to prevent over-allocation, and explicit handling for shared-exclusive resources like GPUs. Type definitions (C2D.ts) and configuration schemas (schemas.ts) have been updated to reflect this new hierarchical structure. All example and test configurations are adjusted accordingly. This is a major improvement for C2D flexibility and scalability.

Comments:
• [ERROR][bug] The ComputeEnvironmentFreeOptions interface in C2D.ts (which C2DEnvironmentConfig.free uses) does not define minJobDuration or maxJobDuration. However, the ci.yml, config.json, and various integration tests define minJobDuration and maxJobDuration within the free configuration. Furthermore, compute_engine_docker.ts in the start() method explicitly copies these properties if found in envDef.free. This creates a type and schema inconsistency. Please add minJobDuration?: number and maxJobDuration?: number to ComputeEnvironmentFreeOptions in src/@types/C2D/C2D.ts and ComputeEnvironmentFreeOptionsSchema in src/utils/config/schemas.ts.
• [ERROR][bug] The C2DDockerConfigSchema no longer enforces the presence of a 'disk' resource at the top level, as resources are now per-environment. However, C2DEnvironmentConfigSchema also does not enforce it, and C2DEngineDocker.start() does not automatically add a 'disk' resource to env.resources if not explicitly provided (unlike CPU and RAM). A 'disk' resource is fundamental for almost all compute jobs. This could lead to environments being configured without essential storage. Please consider adding a .refine rule to C2DEnvironmentConfigSchema to ensure a 'disk' resource is always present, or adding default 'disk' resources in C2DEngineDocker.start().
• [WARNING][other] This PR introduces a breaking change to the DOCKER_COMPUTE_ENVIRONMENTS configuration structure. The documentation (e.g., README.md) and release notes should clearly communicate this change, provide migration steps for existing deployments, and explain the benefits of the new multi-environment setup.
• [INFO][performance] The envResourceMap creation for (env.resources || []).map((r) => [r.id, r]) within the _getComputeEnvironmentInfo method is done inside the loop for each env. This map is then used inside the for (const job of jobs) loop. While likely negligible for typical scenarios, creating this map repeatedly might be slightly inefficient if there are many environments or jobs. Could envResourceMap be created once per env when the environments are initialized in C2DEngineDocker.start() and stored on the env object or a separate map for quicker lookup? This is a minor optimization suggestion.
• [INFO][style] Consider adding JSDoc comments to the new processFeesForEnvironment method and updating existing JSDocs for methods like checkIfResourcesAreAvailable and allocateCpus to reflect their new parameters and responsibilities. This will improve code clarity and maintainability.

@dnsi0
Copy link
Copy Markdown
Contributor Author

dnsi0 commented Mar 30, 2026

/run-security-scan

Copy link
Copy Markdown
Member

@alexcos20 alexcos20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This pull request introduces a significant architectural change to the Compute-to-Data (C2D) component, allowing for multiple, distinct compute environments within a single Docker cluster. The configuration for compute environments is now hierarchical, enabling granular control over resources, fees, access, and job durations for each environment. It also implements global resource availability checks to prevent over-allocation across environments and refactors CPU core allocation to be environment-specific. This is a substantial improvement in flexibility and resource management, but it introduces a breaking change in the configuration structure.

Comments:
• [ERROR][other] This PR introduces a breaking change to the dockerComputeEnvironments configuration structure. Previously, parameters like storageExpiry, maxJobDuration, fees, and resources were directly under the C2DDockerConfig object. Now, they must be nested within an environments array. While the .env.example and config.json files are updated, existing deployments will require manual configuration migration. Please ensure this breaking change is clearly documented in the release notes with migration instructions.
• [INFO][style] Good catch correcting the typo from 1 hours to 1 hour in the paymentClaimInterval comment.
• [INFO][performance] The addition of envResourceMap for efficient resource lookup is a good optimization, especially with the introduction of multiple environments. This avoids repeated array iteration.
• [WARNING][bug] The physicalLimits map correctly initializes for 'cpu' and 'ram' based on sysinfo. However, there's no corresponding initialization for 'disk'. While the checkGlobalResourceAvailability method for disk will sum total values from configured environments, it won't necessarily reflect the actual physical disk capacity of the host unless manually configured to match. Is there a plan to automatically detect physical disk capacity or enforce a configuration check to ensure the sum of disk.total across environments does not exceed the physical limit?
• [INFO][style] The ComputeEnvironmentFreeOptionsSchema now includes minJobDuration. This is consistent with the paid environment options. Good to ensure consistency here.
• [INFO][other] The validation rule environments: z.array(C2DEnvironmentConfigSchema).min(1) ensures that at least one compute environment is configured within a Docker cluster. This is a sensible default and prevents misconfigurations.
• [INFO][other] The environment ID fallback to String(envIdx) if not explicitly provided is functional for ensuring uniqueness. For user-friendliness and traceability in logs/monitoring, it might be beneficial to encourage explicit id definitions in the configuration, or generate a more descriptive ID based on other environment properties if id is missing.

@dnsi0
Copy link
Copy Markdown
Contributor Author

dnsi0 commented Mar 30, 2026

/run-security-scan

Copy link
Copy Markdown
Member

@alexcos20 alexcos20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This pull request introduces a major architectural refactoring of the compute environment configuration, allowing a single compute node to host multiple distinct compute environments with individual resource definitions, fee structures, and access controls. It moves core environment settings into a nested environments array within the Docker compute configuration. Key improvements include granular per-environment resource management (CPU, RAM, disk), a new global resource availability check to prevent over-allocation, and robust Zod schema validation for the new configuration structure. This is a significant enhancement for compute providers offering varied services.

Comments:
• [INFO][other] The updated example for DOCKER_COMPUTE_ENVIRONMENTS clearly demonstrates the new nested environments array structure. This is a breaking change for existing deployments, so clear migration notes will be essential.
• [INFO][other] The introduction of C2DEnvironmentConfig and the modification of C2DDockerConfig correctly reflect the new multi-environment architecture. This provides much-needed flexibility.
• [INFO][other] The fix from '1 hours' to '1 hour' for paymentClaimInterval comment is a minor but welcome detail correction.
• [INFO][performance] The envResourceMap creation is a good optimization for resource lookups within an environment, improving readability and potentially performance in resource-intensive loops.
• [INFO][other] The new logic distinguishing between shared-exclusive (GPU) and per-env exclusive (CPU, RAM, disk) resource tracking is crucial for accurate multi-environment resource management. This addresses a common challenge in container orchestration.
• [INFO][security] The checkGlobalResourceAvailability method is a critical addition for preventing over-allocation of physical resources across multiple logical compute environments. This enhances the stability and security of the compute node.
• [INFO][other] The processFeesForEnvironment method extracts and centralizes fee processing logic, which improves code organization and maintainability.
• [INFO][other] The start() method's refactoring to iterate over envConfig.environments and dynamically create ComputeEnvironment objects is the core of the multi-environment support. This is a robust implementation of the new design.
• [WARNING][performance] The cpuOffset was removed from the constructor and related logic for CPU allocation is now per-environment. While the new per-environment CPU allocation logic (using envCpuCoresMap) is correct, it's essential to confirm that the cpuOffset functionality for multiple physical Docker clusters (if that was ever a use case) hasn't been implicitly removed without replacement. Currently, C2DEngines iterates c2dClusters but C2DEngineDocker creates only for a single cluster. If there's a use case for multiple DOCKER clusters on one node and they need different CPU affinities, that might need re-evaluation.
• [INFO][performance] Passing envId to allocateCpus and using envCpuCoresMap for environment-specific CPU pinning is a good approach to ensure fair and isolated CPU resource allocation for jobs running in different configured environments.
• [INFO][other] The C2DEnvironmentConfigSchema with its refine clauses provides excellent validation for the new environment configurations. The mandatory disk resource and storageExpiry vs maxJobDuration checks are important for operational stability. The requirement for either fees or free configuration per environment is also sensible.
• [INFO][other] Adding minJobDuration to ComputeEnvironmentFreeOptionsSchema with a default of 60 seconds is a good consistency improvement. This ensures free jobs also have a minimum duration constraint.
• [WARNING][test] In some integration tests, paymentClaimInterval is now set on the DOCKER_COMPUTE_ENVIRONMENTS string at the top level of the Docker config, not within the nested environments array. This is consistent with the C2DDockerConfig definition, but it's important to ensure that this global paymentClaimInterval is respected and applies correctly to all nested environments, or if it should be an environment-specific setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not share compute resources between environments

2 participants