Skip to content

fix(env): live process.env reads + late-env detection for env-key providers#27719

Open
jerome-benoit wants to merge 91 commits into
anomalyco:devfrom
jerome-benoit:fix-env-caching-12698
Open

fix(env): live process.env reads + late-env detection for env-key providers#27719
jerome-benoit wants to merge 91 commits into
anomalyco:devfrom
jerome-benoit:fix-env-caching-12698

Conversation

@jerome-benoit
Copy link
Copy Markdown
Contributor

@jerome-benoit jerome-benoit commented May 15, 2026

Issue for this PR

Closes #12698

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

Env.all() previously snapshotted process.env at init time via InstanceState, so env vars set after init were never detected by Env.get/Env.all, and Env.set only mutated the internal copy (never process.env).

Core fix (env/index.ts)

Replaces the snapshot with Layer.succeed + Effect.sync that proxies directly to process.env:

  • get(key)Effect.sync(() => process.env[key]) — live read
  • all()Effect.sync(() => ({ ...process.env })) — fresh shallow copy each call
  • set(key, value) → writes to process.env directly
  • remove(key) → deletes from process.env directly

Env.Interface is unchanged; no consumer modifications required.

Provider-side changes (provider.ts)

The env/index.ts fix alone is not sufficient because provider.list() was returning a frozen snapshot built at init time. Late env vars now flow through:

  • cleanedDatabase: env-eligible providers with models pre-filtered, computed once at init.
  • currentProviders(state, envs) / resolveEnvOverlay: pure functions that produce the live providers map by overlaying the live env on top of the init-time cachedProviders. All public read paths (list, getProvider, getModel, getLanguage, closest, getSmallModel, defaultModel) go through this overlay.
  • getLanguage cache key is hashIdentity({providerID, modelID, key, options}) so a rotated env key invalidates the cached LanguageModel. hashIdentity maps function-valued options to a name-tagged sentinel so two distinct closures (e.g. AWS credentialProvider instances) don't silently collide.

dynamicEnv: false opt-out (multi-env-var providers)

Providers whose options derive from multiple env vars (amazon-bedrock, sap-ai-core, azure, azure-cognitive-services, google-vertex, cloudflare-ai-gateway, cloudflare-workers-ai) cannot be reconstituted from env alone — their loaders close over auth metadata, dynamic SDK imports, region/deployment IDs, etc. These return dynamicEnv: false and stay excluded from cleanedDatabase. When their env later appears, a one-time warn is emitted and the provider stays hidden until OpenCode restart.

This means the original reproducer of #12698 (setting AICORE_SERVICE_KEY after init) still requires a restart. The user now gets a clear warning rather than silence. Single-env-key providers (anthropic, openai, openrouter, etc.) work fully without restart, which is the primary class of plugins triggering #12698 in practice.

Test isolation

Because Env.set now mutates the global process.env, test/preload.ts adds an afterEach snapshot/restore so per-test isolation is automatic regardless of contributor discipline. The ENV_BASELINE is captured after Log.init() and initProjectors() so any import-time side effects are part of the baseline. The deletion list is extended for the env vars referenced by this PR's loaders (AICORE_, AWS_CONTAINER_, AWS_WEB_IDENTITY_, AWS_ROLE_ARN, GOOGLE_APPLICATION_CREDENTIALS, GOOGLE_VERTEX_, CLOUDFLARE_*, AZURE_COGNITIVE_SERVICES_RESOURCE_NAME, GITLAB_INSTANCE_URL).

Known limitations

  • Multi-instance credential clobbering: Env.set writes process-wide. In multi-instance hosts (e.g. desktop sidecar serving several workspaces) two workspaces authenticating to the same provider with different credentials clobber each other — last write wins. The TODO(multi-instance) comments in src/provider/provider.ts and src/config/config.ts mark the call sites that persist auth-derived values through Env.set. Fixing this requires routing per-workspace credentials directly into SDK constructors instead of via process.env; that is a separate, larger refactor and out of scope for this PR.
  • Multi-env-var providers still need restart: see dynamicEnv: false section above. A future enhancement could allow loaders to declare rerunOn: string[] so the loader is re-evaluated when one of its env inputs changes; this would cover sap-ai-core, amazon-bedrock, etc. Out of scope for this PR.

Internal API surface

resolveEnvOverlay, currentProviders, and the bare layer export from provider.ts are tagged @internal — they exist only so the test suite can build deterministic overlay scenarios and stub RuntimeFlags. Production code MUST use Provider.defaultLayer and the Provider.Service methods. They are subject to change without notice.

How did you verify your code works?

  • bun typecheck passes
  • bun test packages/opencode/test/provider/provider.test.ts — 103 pass
  • bun test packages/opencode/test/provider/overlay.test.ts — 14 pass (covers every resolveEnvOverlay case, including the multi-env-candidate cached-key preserve case)
  • bun test packages/opencode/test/provider/transform.test.ts packages/opencode/test/provider/model-status.test.ts packages/opencode/test/provider/digitalocean.test.ts — 230 pass
  • bun test packages/opencode/test/effect/runtime-flags.test.ts packages/opencode/test/config/config.test.ts — 118 pass
  • Reviewed every env.all() / env.get() consumer in provider.ts, config.ts, tool/registry.ts — all are read-only and unaffected by snapshot → live-proxy
  • Reproduced the original issue: started without ANTHROPIC_API_KEY, set it after init, called provider.list()anthropic now appears in connected (failed before fix)

Screenshots / recordings

N/A — no UI change.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

Copilot AI review requested due to automatic review settings May 15, 2026 10:45
@github-actions github-actions Bot added needs:compliance This means the issue will auto-close after 2 hours. contributor labels May 15, 2026
@github-actions github-actions Bot removed the needs:compliance This means the issue will auto-close after 2 hours. label May 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for updating your PR! It now meets our contributing guidelines. 👍

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the environment service to read and write process.env directly so environment variables set after startup can be observed by runtime consumers.

Changes:

  • Replaces the cached Env state with direct process.env access.
  • Removes now-obsolete direct env writes/comments in config/provider code.
  • Adds more test preload env cleanup entries for provider-related variables.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
packages/opencode/src/env/index.ts Changes Env service methods to proxy through process.env.
packages/opencode/src/config/config.ts Removes redundant direct console token env assignment.
packages/opencode/src/provider/provider.ts Removes obsolete TODO comments around direct env usage.
packages/opencode/test/preload.ts Expands initial test env cleanup list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/opencode/src/env/index.ts Outdated
return Service.of({ get, all, set, remove })
Service.of({
get: Effect.fn("Env.get")((key: string) => Effect.sync(() => process.env[key])),
all: Effect.fn("Env.all")(() => Effect.sync(() => ({ ...process.env }) as State)),
Comment on lines +80 to +89
delete process.env["AICORE_SERVICE_KEY"]
delete process.env["AICORE_DEPLOYMENT_ID"]
delete process.env["AICORE_RESOURCE_GROUP"]
delete process.env["GOOGLE_APPLICATION_CREDENTIALS"]
delete process.env["CLOUDFLARE_ACCOUNT_ID"]
delete process.env["CLOUDFLARE_GATEWAY_ID"]
delete process.env["CLOUDFLARE_API_TOKEN"]
delete process.env["SINGLE_ENV_KEY"]
delete process.env["MULTI_ENV_KEY_1"]
delete process.env["FALLBACK_KEY"]
Cover all env vars declared by providers in models-api.json plus the
test-fixture pairs referenced in provider.test.ts, so host env (or
in-process leaks via Env.set) cannot perturb provider detection.
…call

Provider state cached via InstanceState read env.all() once at init,
so providers whose env keys were set after the first list() call
remained invisible (issue anomalyco#12698 surface beyond the Env layer).

Overlay live env detection on top of the cached providers in list()
and getProvider(), respecting disabled_providers, enabled_providers,
blacklist, whitelist and model status filters via a precomputed
envEligible set. Cached state retains expensive init work
(catalog/plugins/SDK loaders); only env-driven detection re-runs.
Closes the half-fix where overlayEnv() was wired into list/getProvider only.
getModel, getLanguage, closest, getSmallModel, defaultModel and resolveSDK all
read s.providers directly, so a provider detected from a late env mutation
appeared in list() but crashed when actually used.

State refactor:
- rename providers -> cachedProviders (invariant: never holds env-only entries)
- replace database+envEligible with cleanedDatabase (env-eligible providers with
  models pre-filtered for status/blacklist/whitelist/chat-aliases AND non-autoload
  custom-loader options pre-folded)
- drop the init-time env-stamp loop, subsumed by the call-time overlay

Pure helpers (unit-testable):
- resolveEnvOverlay(cached, candidate, apiKey): 7-branch env precedence
- currentProviders(s, envs): single seam every reader funnels through

Layer body:
- view(s) helper composes env.all() + currentProviders
- all 7 readers + resolveSDK go through view; resolveSDK accepts Info explicitly
- custom loop always registers loaders so a late-env provider can find them;
  folds non-autoload options into database; preserves source=env for
  env-conditional autoloaders (gitlab, cloudflare-ai-gateway)

Tests:
- test/provider/overlay.test.ts: 13 pure-helper unit tests
- test/provider/provider.test.ts: 9 integration tests covering getModel late,
  getLanguage late, closest late, getSmallModel late, defaultModel late,
  alpha model leak, blacklist late, multi-env Bedrock, auth precedence
- test/preload.ts: clear GITHUB_TOKEN, GITLAB_TOKEN, OPENCODE_API_KEY,
  GEMINI_API_KEY, HF_TOKEN, DIGITALOCEAN_ACCESS_TOKEN to prevent CI leakage
…ers and cleanedDatabase, close variant parity gap

Round-2 cross-validation by 3 oracles surfaced 4 actionable findings.
This addresses all of them:

A1 — Variant parity gap (Oracle A): the cleanedDatabase build only applied
the 5 keep filters (status/blacklist/whitelist/chat-aliases) but skipped
the prepareModel mutations (api.id default, ProviderTransform.variants
generation, configProvider.models[*].variants merge) that the cachedProviders
cleanup loop applies. A late-env-detected provider therefore had empty
variants while the same provider with env-at-boot had full variants.
Fixed by applying prepareModel in both loops.

A3 — Filter duplication (Oracle A + C): the 5 keep predicates were duplicated
between the cachedProviders cleanup loop and the cleanedDatabase build.
Extracted keepModel(model, modelID, providerID, configProvider) and
prepareModel(model, modelID, configVariants) as local helpers in the
InstanceState.make closure. Both loops now share the pipeline.

A2 — Test tightening (Oracle A + C): the late-detected multi-env Bedrock
test had an 'if (bedrock)' guard that silently passed if regression made
bedrock undefined. Replaced with explicit expect(bedrock).toBeDefined().
The defaultModel late-env test only asserted toBeDefined() — strengthened
to expect(d.providerID).toBe(anthropic) and added enabled_providers
config to make the assertion robust against bundled provider iteration
order.

A4 — AWS env hygiene (Oracle B): the multi-env Bedrock test only removed
4 AWS keys; added AWS_PROFILE, AWS_ROLE_ARN, AWS_CONTAINER_CREDENTIALS_*,
AWS_WEB_IDENTITY_TOKEN_FILE for full provider-detection-path isolation.

New regression test: 'late-detected env provider has variants populated
(parity with env-at-boot)' guards A1 closure.

bun typecheck: clean
bun test test/provider/: 101 + 13 = 114/114 pass
…out + test isolation

Round-3 cross-validation by 4 oracles + 4 design oracles surfaced 8 findings
plus 3 audit gaps. This commit addresses all of them.

B1 — Multi-credential late-env gap (Oracle 2 + Oracle 3): cleanedDatabase
surfaces providers whose env appears post-init, but multi-cred loaders
(bedrock, sap-ai-core, vertex, vertex-anthropic, azure, cloudflare-workers-
ai, cloudflare-ai-gateway, azure-cognitive-services) close over credential
providers / deployment IDs / account IDs / derived URLs at loader time.
Late-promoting them yields an unauthenticated SDK that fails silently at
first request.

Solution: explicit `dynamicEnv?: boolean` capability flag on CustomLoader.
Loaders mark `dynamicEnv: false` on broken-init-state early-returns; State
holds `loaderRefusedDynamicEnv: Set<ProviderID>` populated at init;
cleanedDatabase build excludes these. New `warnLateEnvRefused` helper
emits a one-time per-process warn when refused env appears, surfacing
the limitation without false-positive listing. The PR test that asserted
the broken bedrock state is inverted to assert exclusion.

Audit (post-oracles): cloudflare-workers-ai, cloudflare-ai-gateway, and
azure-cognitive-services had the same pattern as the original 5 loaders
but were not covered. All three now carry the flag; azure-cognitive-
services refactored to early-return (was producing options.baseURL=
undefined). The dynamicEnv JSDoc states the rule generically so future
loaders know what to do.

B2 — Test isolation regression (Oracle 3): bun runs all .test.ts in one
shared process. With Env.set now writing through to process.env, 30+
test sites in test/provider/provider.test.ts call set() without remove()
and leak across files. Documented victim: test/session/structured-output-
integration.test.ts:12 reads process.env.ANTHROPIC_API_KEY at module-load
to choose skip vs live.

Solution: capture ENV_BASELINE in test/preload.ts after preload settles,
afterEach() restores the snapshot. ~15 lines, covers every test file
automatically, future-proof.

M1 — Inaccurate State invariants (Oracle 2 + Oracle 4): cachedProviders
comment claimed "never holds env-only entries", violated by the autoload-
with-env-key path. cleanedDatabase claimed "Immutable post-init" without
compile-time hint. Rephrased both; added Readonly<> on cleanedDatabase.

M2.C — Inconsistent process.env writes (Oracle 3): bedrock and sap-ai-core
loaders wrote process.env.X = auth.key directly, breaking the trace-and-
audit guarantee of the Env service. Added dep.set on CustomDep; lifted
the effectful write out of iife() into proper yield* dep.set(...).

M3 — Missing Env.Service contract docs (Oracle 2 + Oracle 4): added
top-of-file JSDoc covering live reads, global writes, child-process
inheritance, and the test-isolation contract.

M5 — Empty-string apiKey edge case (Oracle 2): info.env.find(Boolean)
intentionally treats "" as absent. Added inline comment so a future
"correctness fix" doesn't regress to find(v => v !== undefined).

M6 — Redundant `as State` cast (Oracle 1): {...process.env} infers
correctly as Record<string, string | undefined>. Cast removed.

bun typecheck: clean
bun test test/provider: 365/366 (1 environmental 5000ms flake, different
  test each run, present on baseline before changes)
bun test test/plugin: 91/91
Round-4 cross-validation by 5 review oracles + 2 design oracles surfaced 5
in-scope findings on top of the late-env contract. This commit addresses
all of them.

B1 — getLanguage TypeError on env removal (Oracle correctness): when the
env key is removed between getModel() and getLanguage(), currentProviders
yields no entry for the providerID and resolveSDK / spread on
provider.options crashes inside refineRejection (which only refines
NoSuchModelError). Surfaces as an unrelated defect.

Solution: early-return ModelNotFoundError when the overlay has no provider,
mirroring getModel's existing pattern. fuzzysort suggestions are populated
identically so the error contract is consistent across the two readers.

B2 — s.models cache stale on env-key rotation (Oracle correctness): s.sdk
correctly rebuilds via Hash.fast({providerID, npm, options}) when apiKey
changes, but the cached LanguageModelV3 wrapper at s.models was keyed only
by ${providerID}/${modelID}, so it kept referencing the SDK from before
rotation. Multi-env providers (Databricks, Privatemode, Google generative
AI) where provider.key is undefined had an even worse collision: two
distinct env states produced identical cache keys.

Solution: cache key now mirrors s.sdk's Hash.fast keying:
Hash.fast(JSON.stringify({providerID, modelID, key, options})). Env key
rotation, multi-env credential changes, and env-templated baseURL
mutations now invalidate the LanguageModel together with the SDK by
construction.

M1 — Direct process.env write in config.ts well-known auth (Explore §4.2):
`process.env[value.key] = value.token` bypassed the typed Env seam.
Replaced with `yield* env.set(...)` matching the existing pattern at the
OPENCODE_CONSOLE_TOKEN site. No behavioral change post-PR (Env.set now
writes through), pure consistency.

M2 — Mixed env reads in bedrock loader (Explore §4.3):
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI / _FULL_URI were read directly
from process.env while surrounding code uses the dep.env() snapshot.
Switched to env["..."] for greppability and to keep all bedrock-loader
env reads going through one source.

M3 — Multi-instance hazard documentation (Oracle architecture, Oracle
scope/quality): Env.set writes are process-global. In multi-workspace
desktop hosts (one sidecar serving several workspace contexts) two
workspaces authenticating to the same provider with different credentials
clobber each other. Added explicit warning to the Env module docstring,
plus TODO(multi-instance) markers at each existing credential-write site
(config.ts well-known + console-token, provider.ts bedrock-bearer +
sap-ai-core-service-key) so the future Auth-routing migration has a clean
search target.

Tests:

- getLanguage rejects when env removed between getModel and getLanguage
  (covers B1)
- getLanguage rebuilds language model after env key rotation (covers B2)

Verification:

- bun typecheck: clean
- lsp_diagnostics: clean on all 4 changed files
- bun test test/provider/: 368 pass / 0 fail (incl. 2 new + all PR
  late-env tests + overlay.test.ts pure unit tests)
- bun test test/config/: 163 pass / 0 fail

Deferred to separate tickets (out of scope for this PR per the design
proposal cross-validated by both oracles):

- Multi-env partial-env leak (Databricks/Privatemode/Google generative AI):
  pre-existing in the deleted load-env loop with identical semantics; no
  safe automatic rule without catalog metadata distinguishing alias vs
  complementary multi-env. Fix candidate: default dynamicEnv:false for
  multi-env providers without custom loaders.
- dynamicEnv:false extension to autoload-success providers (warn on
  secondary env mutation post-init).
- mcp-websearch.ts EXA_URL module-level snapshot.
- Multi-instance credential isolation rework (Auth-routing migration).
- resolveEnvOverlay: preserve cached.key when candidate is multi-env so a
  previously-resolved env key is not dropped to undefined when the env
  layer is later re-evaluated.
- hashIdentity: replace bare `Hash.fast(JSON.stringify(...))` with a
  helper that maps function values to a named-tag sentinel. JSON.stringify
  silently drops functions, so two distinct closures stored in options
  (e.g. AWS credentialProvider instances) collided on the same hash. Apply
  at the resolveSDK and getLanguage call sites.
- Mark resolveEnvOverlay, currentProviders, and the bare `layer` export
  as @internal in their JSDoc so consumers don't depend on them; production
  code MUST go through Provider.defaultLayer / Provider.Service methods.
- Document the deep-clone invariant of toPublicInfo+mergeDeep so the
  prepareModel call inside both providers and cleanedDatabase loops is
  provably safe (each loop mutates its own clone).
- test/preload.ts: add AZURE_COGNITIVE_SERVICES_RESOURCE_NAME and
  GITLAB_INSTANCE_URL to the deletion list (referenced by the azure
  cognitive services and gitlab loaders).
- test/preload.ts: capture ENV_BASELINE after Log.init() and
  initProjectors() so any import-time side effects are part of the
  baseline. Avoids afterEach mistakenly deleting keys set by src/ module
  evaluation.
- Extend overlay.test.ts with the missing multi-env preserve-key case.
@jerome-benoit jerome-benoit changed the title fix(env): proxy directly to process.env instead of snapshotting fix(env): live process.env reads + late-env detection for env-key providers May 15, 2026
…acing, rerunOn TODO

Address cross-validation findings from 4-oracle review:

- env: drop Effect.fn spans on get/all (hot path; every Provider read
  goes through env.all()). Keep tracing on set/remove where the side
  effect is observable.
- provider: tighten hashIdentity comment to admit that anonymous arrows
  still collide on __fn:anon, and explain why that permissiveness is
  intentional (per-call fetch wrapper in resolveSDK must hit the SDK
  cache instead of busting it).
- provider: simplify `cached?.key ?? undefined` to `cached?.key` in
  resolveEnvOverlay (redundant with optional chaining).
- provider: add TODO(rerunOn) marker in dynamicEnv JSDoc as the future
  declarative env-dependency list that would close the sap-ai-core /
  amazon-bedrock late-env case without restart.

No behavioral change. typecheck + 465 tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Env.all() caches process.env snapshot, preventing detection of env vars set after initialization

2 participants