security(github-copilot): per-pod KDF for stored Copilot token#300
security(github-copilot): per-pod KDF for stored Copilot token#300pjdoland wants to merge 1 commit into
Conversation
Second-pass review surfaced two items: 1. The default-password WARNING mentioned NBI_POD_IDENTITY as a knob the operator could set, but that env var ships in PR plmbr#300 (per-pod KDF) and does not exist on the branch's merge base. Drop the reference until plmbr#300 lands; the remaining knobs (NBI_GH_ACCESS_TOKEN_PASSWORD per-user, NBI_ALLOW_DEFAULT_TOKEN_PASSWORD opt-out) are sufficient guidance for the deployments this PR targets. 2. The read-path warn test only asserted the WARNING fired; the contract is "warn AND return the plaintext" because a regression that aborts the read on warn would silently log the user out. Rewritten to encrypt a real token, run the read, and assert both halves. Also pin that the non-shared branch of the warning omits the directory path so log analysis can tell the two postures apart by message content alone. The silent-fail UX (when NBI_REFUSE_DEFAULT_TOKEN_PASSWORD_ON_SHARED_FS refuses the write, the user sees no in-app signal) is documented as a known limitation in the PR description but not addressed here: plumbing storage-failure status back to the chat sidebar is a separable change that touches the capability handshake, and the audit's recommended remediation (operators set the env vars before rollout) means in-app feedback is a UX enhancement rather than a regression blocker.
|
@pjdoland this looks great but can we allow to turn off this new logic and use the existing method. In some environments filesystem is not shared and reboots change the machine id which would require a re-login. |
|
Definitely. That is almost certainly the correct approach...as it might be for some of these other security PRS. |
Second-pass review surfaced two items: 1. The default-password WARNING mentioned NBI_POD_IDENTITY as a knob the operator could set, but that env var ships in PR plmbr#300 (per-pod KDF) and does not exist on the branch's merge base. Drop the reference until plmbr#300 lands; the remaining knobs (NBI_GH_ACCESS_TOKEN_PASSWORD per-user, NBI_ALLOW_DEFAULT_TOKEN_PASSWORD opt-out) are sufficient guidance for the deployments this PR targets. 2. The read-path warn test only asserted the WARNING fired; the contract is "warn AND return the plaintext" because a regression that aborts the read on warn would silently log the user out. Rewritten to encrypt a real token, run the read, and assert both halves. Also pin that the non-shared branch of the warning omits the directory path so log analysis can tell the two postures apart by message content alone. The silent-fail UX (when NBI_REFUSE_DEFAULT_TOKEN_PASSWORD_ON_SHARED_FS refuses the write, the user sees no in-app signal) is documented as a known limitation in the PR description but not addressed here: plumbing storage-failure status back to the chat sidebar is a separable change that touches the capability handshake, and the audit's recommended remediation (operators set the env vars before rollout) means in-app feedback is a UX enhancement rather than a regression blocker.
The encrypted token at ~/.jupyter/nbi/user-data.json was keyed via
PBKDF2 from a public default password (NBI_GH_ACCESS_TOKEN_PASSWORD,
default literal "nbi-access-token-password"). Every install with the
default shipped the same key, so an attacker who read user-data.json
off a shared filesystem could decrypt the OAuth access token directly,
no brute force.
Mix a per-pod + per-user secret into the KDF password input before
PBKDF2. Resolution order (first non-empty wins, the rest still
contribute to the concatenation):
1. NBI_POD_IDENTITY env var (highest priority). Admins on K8s where
/etc/machine-id and uid are shared across co-tenants pin this to
a per-pod or per-user value (JupyterHub username, spawn token,
Pod metadata.uid via the downward API).
2. /etc/machine-id or /var/lib/dbus/machine-id when mounted.
3. socket.gethostname() as a last resort (pod name in K8s).
POSIX uid is always included; getattr-guarded for Windows.
The on-disk blob format is unchanged (16 salt bytes + Fernet
ciphertext). New encrypt_user_secret / decrypt_user_secret helpers
wrap the existing crypto with the derived password. Legacy blobs
(those written before this change) decrypt via a fallback to the
bare password and are silently rewritten under the v2 KDF on the
next read, so existing users do not have to re-login.
The rewrite goes through the existing _atomic_write_json helper so a
crash mid-rewrite (K8s OOMKill during the silent upgrade) cannot
leave a torn user-data.json on disk. The legacy-fallback exception
handler catches InvalidToken only, so a corrupted blob or programmer
error surfaces instead of being masked as "needs the legacy path."
Admin-guide and PRIVACY notes updated to call out the K8s
shared-machine-id reality and recommend NBI_POD_IDENTITY for
multi-tenant deployments.
05f10cd to
f78be29
Compare
|
Converting this to a draft for now to address @mbektas's feedback before it goes in. The ask was for a way to turn the new logic off and fall back to the existing method, since environments with a non-shared filesystem, or a machine-id that changes on reboot, would otherwise force a re-login. This branch adds So I'd like to add an explicit opt-out (default on, so the hardening stays the default) that uses the bare-password method when disabled and makes sure the read path does not silently re-bind in that mode. I'll mark this ready again once that is in place. Thanks for the catch. |
Summary
The encrypted Copilot token at
~/.jupyter/nbi/user-data.jsonwas keyed via PBKDF2 fromNBI_GH_ACCESS_TOKEN_PASSWORDwhose default value is the public literalnbi-access-token-password. Every install with the default ships the same key. An attacker who could read another tenant'suser-data.json(shared NFS home, accidental backup leak, errant git commit) could decrypt the OAuth access token directly: the ciphertext is decryption-equivalent to plaintext for anyone holding the known password.Solution
Mix a per-pod + per-user secret into the KDF password input before PBKDF2. The on-disk blob format is unchanged (16 random salt bytes + Fernet ciphertext); the change is purely in what string feeds
PBKDF2HMAC.derive().Resolution order in
_machine_user_secret()(first non-empty wins, the rest still contribute to the concatenation):NBI_POD_IDENTITYenv var (highest priority). On a multi-tenant Kubernetes deployment,/etc/machine-idtypically resolves to the host node's value (shared across co-tenants) and every container runs as uid1000. The automatic sources collapse to a value shared across co-tenants on the same node. Admins pinNBI_POD_IDENTITYto a value that stays stable across pod restarts (JupyterHub username, per-user spawn secret) for real cross-tenant isolation. Avoidmetadata.uidfrom the downward API: that value rotates on every restart and would invalidate stored tokens on each restart./etc/machine-idor/var/lib/dbus/machine-idwhen mounted.socket.gethostname()as a last resort. In Kubernetes the hostname equals the pod name, which differs across pods and gives per-pod isolation even though it is not a confidential value.POSIX uid is always included. The Windows path is
getattr-guarded so a missingos.getuiddoes not crash the read path.New
encrypt_user_secret/decrypt_user_secrethelpers innotebook_intelligence/util.pywrap the existingencrypt_with_password/decrypt_with_passwordwith the derived password. Thegithub_copilot.pyread/write call sites switch to the new helpers; tests on the bare helpers continue to cover the underlying crypto.Legacy migration: blobs encrypted before this change decrypt via a fallback to the bare password and are silently rewritten under the v2 KDF on the next read. The legacy-fallback exception handler catches
InvalidTokenonly, so a corrupted blob or programmer error surfaces loudly instead of being masked. The rewrite goes through the existing_save_user_datahelper (introduced in PR #293) so the 0o600 file mode contract is preserved across the upgrade and a crash mid-rewrite (a K8s OOMKill during the silent upgrade) cannot leave a tornuser-data.jsonon disk. If the rewrite itself fails (read-only filesystem, transient I/O), the read still returns the decrypted plaintext so the user's login completes; the next read falls back to the legacy path again until the underlying I/O problem clears.Testing
tests/test_user_secret_kdf.py: 13 cases covering round-trip, cross-pod isolation by machine-id and uid, hostname fallback when/etc/machine-idis absent, legacy-blob fallback (and refuses withallow_legacy=False),NBI_POD_IDENTITYoverriding the automatic sources, an end-to-end silent-upgrade test that drivesread_stored_github_access_tokenagainst a fakeuser_data_fileontmp_pathand verifies the on-disk bytes change between calls, and a rewrite-failure test that asserts the read still returns plaintext whenwrite_github_access_tokenraises.Risks / follow-ups
/etc/machine-idmounted, hostname changes,NBI_POD_IDENTITYunset or pointed at a rotating value likemetadata.uid) will fail to decrypt v2 blobs and the user is silently logged out. The admin-guide and PRIVACY notes now document this, recommend stableNBI_POD_IDENTITYvalues, and explicitly warn againstmetadata.uidfrom the downward API._save_user_data(per PR security: enforce 0o600 on the encrypted GitHub token file #293) on every write including the silent legacy-blob upgrade. The merge into this branch adopts that helper at both write sites./etc/machine-idis shared across co-tenants and uid pins to 1000. TheNBI_POD_IDENTITYenv knob lets admins close that gap with a one-line spawn config; the admin guide section calls this out.