Skip to content

security: replace predictable production token with cryptographic randomness#3243

Open
Snakinya wants to merge 2 commits into
Netflix:masterfrom
Snakinya:security/fix-predictable-production-token
Open

security: replace predictable production token with cryptographic randomness#3243
Snakinya wants to merge 2 commits into
Netflix:masterfrom
Snakinya:security/fix-predictable-production-token

Conversation

@Snakinya

@Snakinya Snakinya commented Jun 8, 2026

Copy link
Copy Markdown

Summary

The production token mechanism in metaflow/plugins/aws/step_functions/production_token.py uses a fully deterministic algorithm to generate deployment authorization tokens:

random.seed(zlib.adler32(to_bytes(prefix)))
yield prefix + "".join(random.sample(string.ascii_lowercase, 4))

This means anyone who knows the flow name (which is public information) can compute the exact production token and bypass deployment authorization checks in Step Functions, Airflow, and Argo Workflows.

Before (predictable — same output every time):

production:MyFlow-0-chby   ← always this value

After (cryptographically random):

production:MyFlow-heq2nrw5qc73f0ry   ← different every time

Vulnerability Details

  • CWE: CWE-330 (Use of Insufficiently Random Values)
  • Affected components: Step Functions, Airflow, Argo Workflows deployment authorization
  • Impact: Any user who knows a flow name can predict the production token and re-deploy/modify production workflows belonging to other users
  • Root cause: zlib.adler32 (a checksum, not CSPRNG) used as seed for random.sample with only 4 lowercase letters (26⁴ = 456,976 possibilities, all enumerable)

Fix

Replace with secrets.choice() generating 16-character alphanumeric tokens (36¹⁶ ≈ 7.96×10²⁴ possibilities).

Backwards compatible: existing stored tokens continue to work via load_token(). Only newly generated tokens use the secure algorithm.

Test plan

  • new_token() produces different output on each call
  • Token format remains {prefix}-{suffix} (compatible with existing storage/comparison)
  • load_token() / store_token() unchanged
  • Existing production deployments can still generate new tokens on next create

The previous implementation used zlib.adler32 as a PRNG seed with
random.sample to generate 4-letter suffixes. This made tokens fully
deterministic — anyone who knows the flow name can compute the exact
token, bypassing deployment authorization.

Replace with secrets.choice() producing 16-character alphanumeric
tokens (36^16 ≈ 7.96×10²⁴ possibilities), making brute-force
infeasible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR addresses a real predictability vulnerability in production token generation by replacing the seeded random/zlib.adler32 approach with secrets.choice(), raising token entropy from ~19 bits (26⁴ possibilities) to ~83 bits (36¹⁶).

  • _token_generator and its deterministic sequence logic are removed; new_token() now always returns a fresh cryptographically random {prefix}-{16 alphanumeric chars} token and ignores its prev_token parameter.
  • load_token() and store_token() are unchanged, so existing deployed tokens continue to authorize correctly.
  • The caller-side if token is None branches in step_functions_cli.py, airflow_cli.py, and argo_workflows_cli.py can no longer be reached, and the guard that raised MetaflowException("--generate-new-token is not supported after --new-token") is silently gone.

Confidence Score: 5/5

The core change is correct and safe to merge: the cryptographic fix is sound, existing stored tokens are unaffected, and no new failure paths are introduced in production_token.py itself.

The secrets.choice() implementation is correct, load_token/store_token are untouched so backward compatibility holds, and the only concern — dead if token is None branches and the lost --new-token guard in the CLI callers — was already surfaced in prior review threads and is outside the changed file.

No files in the diff require additional attention; the unchanged CLI callers carry stale dead-code branches that were flagged in a previous review thread.

Important Files Changed

Filename Overview
metaflow/plugins/aws/step_functions/production_token.py Replaces deterministic random/zlib.adler32 token generation with secrets.choice() for 16-char alphanumeric tokens; prev_token is now a no-op parameter and new_token() always returns a non-None string, leaving if token is None branches in all three CLI call sites permanently unreachable.

Reviews (2): Last reviewed commit: "address review: document prev_token rete..." | Re-trigger Greptile

Comment thread metaflow/plugins/aws/step_functions/production_token.py
The prev_token parameter is no longer used for sequence-based generation
but is kept for call-site compatibility. With cryptographic randomness,
new_token() can always succeed — the old None-return path (which guarded
against exhausting a deterministic sequence) is no longer reachable. This
is intentional: the --generate-new-token flow now always produces a valid
token rather than erroring when the old sequence was exhausted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant