Skip to content

fix: P0 security/stability hardening bundle#22

Merged
maltsev-dev merged 1 commit into
masterfrom
fix/p0-security-stability-bundle
Jun 19, 2026
Merged

fix: P0 security/stability hardening bundle#22
maltsev-dev merged 1 commit into
masterfrom
fix/p0-security-stability-bundle

Conversation

@maltsev-dev

Copy link
Copy Markdown
Member

Closes the P0/P1/P2/P3 issues from the security review (plan §10/§11.4).

Security / PCI-DSS / GDPR

  • P0-1: Mask positional PII in _enforce_sensitive_tool by introspecting the wrapped function's signature and applying SENSITIVE_ARG_KEYS to positional params. Pre-fix, charge("4111-…-1111", 50) forwarded the PAN into /execute and the audit log.
  • P0-6 / P3-3: _safe_repr now redacts BEFORE truncating. The pre-fix order truncated first, so details={…} past position 50 leaked verbatim. _safe_repr is now the single source of truth for the redact-then-truncate flow.

Cost-audit / reliability

  • P0-3: Bounded chunked reads on the sync + async httpx transports (MAX_RESPONSE_BYTES, default 16 MiB, NULLRUN_MAX_RESPONSE_BYTES env override). Above the cap, tracking is skipped and _coverage_streaming_skipped is incremented. Replaces the response.read() / await response.aread() unbounded buffer that held entire LLM streaming bodies in memory.
  • P0-4: _do_flush_locked re-queue on CB OPEN now drops the NEWEST non-critical events instead of the oldest. The oldest events (incident start, billing-period start) are exactly what a billing investigator needs; losing them silently broke monthly rollups. Control-plane events (state_change, kill_received, policy_invalidated, key_rotated) are preserved unconditionally so the dashboard KILL switch lands even under sustained backend outage.

Identity

  • S-8 / P2-4: agent() now emits str(uuid.uuid4()) (with dashes). Pre-fix the format was f"agent-{uuid.uuid4().hex}" — 32 hex chars, no dashes — and backend UUID-typed columns dropped these to NULL on insert. User-supplied names are still preserved verbatim.
  • §7.2 docs: add logo and shields.io badges to README #16: workflow() context manager now resets span_id (not only workflow_id / trace_id) so nested with span() blocks don't leave the inner span_id visible inside the workflow scope.

Resource leaks

  • S-9: _active_runs on NullRunCallback is now an OrderedDict capped at 4096 with FIFO eviction. Pre-fix the dict grew unbounded when on_chain_end did not fire (some LangChain versions short-circuit the end hook on chain-body errors).
  • S-10: WebSocket reconnect loop is now capped at 10 consecutive failures, then falls back to HTTP-poll. Pre-fix the loop ran forever when the backend was permanently down, leaking the WS thread.

Transport

  • §7.2 fix(ci): add Callable to typing imports in runtime.py #6: Separate hmac_verify_expired_total counter so SRE can distinguish clock-skew (NTP drift) from forged packets. Mirrored in both the HTTP and WebSocket verify paths.
  • §7.2 #35: CircuitBreaker.call now dispatches the OPEN→HALF_OPEN jitter through _maybe_apply_open_jitter_sync / _maybe_apply_open_jitter_async. Pre-fix the jitter used time.sleep before dispatching to async, which blocked the caller's event loop on every transition.
  • P2-1: _coverage_seen now bumps in the httpx path (sync + async). Pre-fix the counter was only bumped by the requests transport, so the dashboard's coverage view was empty for the dominant OpenAI / Anthropic / Gemini / Mistral / Cohere traffic.
  • P2-3: is_sensitive_tool match is case-insensitive. Pre-fix "stripe.charge" did not match "Stripe.Charge", bypassing the sensitive gate.

Concurrency

  • §7.2 #39: New _tools_lock guards every mutation of _strict_mode_tools / _sensitive_tools. Same lock guards the coverage-counter bump+prune sequence (§7.2 #33) so two threads can't both observe the dict at length 4095 and both grow it to 4097 before either prune lands.
  • §7.2 #47: New _langchain_lock / _langgraph_lock guard the patch sequences end-to-end. Pre-fix two threads racing through auto_instrument could both pass the early _x_patched check and double-wrap BaseCallbackManager / Pregel.
  • §7.2 #33: _COVERAGE_CAP (4096) bounds the per-host coverage dicts.

Webhook delivery

  • P3-2: Exponential backoff (0.5s, 1s, 2s, 4s, 8s, 16s, 30s cap) replaces the previous linear schedule. Linear didn't back off fast enough under sustained outage — each KILL/PAUSE spawned its own delivery thread, producing 1000+ spinning threads hammering the dead endpoint.

WAL crash-recovery

  • P1-5b: Atomic WAL writes (tmp + fsync + os.replace), 64 MiB rotation with os.replace(wal, wal.1), replay drains both wal.1 and wal. New NULLRUN_WAL_PATH / NULLRUN_WAL_MAX_BYTES env overrides for containers with readOnlyRootFilesystem: true.

Tests

8 new regression test files (57 tests total):
test_agent_id_uuid.py, test_args_pii_masked.py, test_streaming_oom_cap.py, test_lru_active_runs.py, test_reconnect_cap.py, test_coverage_seen_httpx.py, test_webhook_backoff.py, test_redact.py

test_buffer_invariants.py extended with drop-newest + critical-event preservation cases. test_release_polish.py updated to pin the 5s cap on both the sync and async jitter helpers (post §7.2 #35 split).

Full incident write-ups in CHANGELOG.md under the same P0/S/P tags.

What

Why

How

Test plan

  • Unit tests pass (per-repo, e.g. cd backend && cargo test, cd frontend && npm test)
  • Lint passes (per-repo, e.g. cd frontend && npm run lint)
  • Type-check passes (per-repo, e.g. cd frontend && npm run type-check)
  • Manually verified in dev / staging

Risk

Checklist

  • I have read the repo's CONTRIBUTING.md (if present)
  • My change does not introduce new lint warnings
  • I have updated the CHANGELOG (if user-visible)
  • I have considered backwards compatibility

Closes the P0/P1/P2/P3 issues from the security review (plan §10/§11.4).

Security / PCI-DSS / GDPR

- P0-1: Mask positional PII in `_enforce_sensitive_tool` by introspecting
  the wrapped function's signature and applying `SENSITIVE_ARG_KEYS` to
  positional params. Pre-fix, `charge("4111-…-1111", 50)` forwarded the
  PAN into `/execute` and the audit log.
- P0-6 / P3-3: `_safe_repr` now redacts BEFORE truncating. The pre-fix
  order truncated first, so `details={…}` past position 50 leaked
  verbatim. `_safe_repr` is now the single source of truth for the
  redact-then-truncate flow.

Cost-audit / reliability

- P0-3: Bounded chunked reads on the sync + async httpx transports
  (`MAX_RESPONSE_BYTES`, default 16 MiB, `NULLRUN_MAX_RESPONSE_BYTES`
  env override). Above the cap, tracking is skipped and
  `_coverage_streaming_skipped` is incremented. Replaces the
  `response.read()` / `await response.aread()` unbounded buffer that
  held entire LLM streaming bodies in memory.
- P0-4: `_do_flush_locked` re-queue on CB OPEN now drops the NEWEST
  non-critical events instead of the oldest. The oldest events
  (incident start, billing-period start) are exactly what a billing
  investigator needs; losing them silently broke monthly rollups.
  Control-plane events (`state_change`, `kill_received`,
  `policy_invalidated`, `key_rotated`) are preserved unconditionally
  so the dashboard KILL switch lands even under sustained backend
  outage.

Identity

- S-8 / P2-4: `agent()` now emits `str(uuid.uuid4())` (with dashes).
  Pre-fix the format was `f"agent-{uuid.uuid4().hex}"` — 32 hex chars,
  no dashes — and backend UUID-typed columns dropped these to NULL
  on insert. User-supplied names are still preserved verbatim.
- §7.2 #16: `workflow()` context manager now resets `span_id` (not
  only `workflow_id` / `trace_id`) so nested `with span()` blocks
  don't leave the inner span_id visible inside the workflow scope.

Resource leaks

- S-9: `_active_runs` on `NullRunCallback` is now an `OrderedDict`
  capped at 4096 with FIFO eviction. Pre-fix the dict grew
  unbounded when `on_chain_end` did not fire (some LangChain
  versions short-circuit the end hook on chain-body errors).
- S-10: WebSocket reconnect loop is now capped at 10 consecutive
  failures, then falls back to HTTP-poll. Pre-fix the loop ran
  forever when the backend was permanently down, leaking the
  WS thread.

Transport

- §7.2 #6: Separate `hmac_verify_expired_total` counter so SRE can
  distinguish clock-skew (NTP drift) from forged packets. Mirrored
  in both the HTTP and WebSocket verify paths.
- §7.2 #35: `CircuitBreaker.call` now dispatches the OPEN→HALF_OPEN
  jitter through `_maybe_apply_open_jitter_sync` /
  `_maybe_apply_open_jitter_async`. Pre-fix the jitter used
  `time.sleep` before dispatching to async, which blocked the
  caller's event loop on every transition.
- P2-1: `_coverage_seen` now bumps in the httpx path (sync + async).
  Pre-fix the counter was only bumped by the `requests` transport,
  so the dashboard's coverage view was empty for the dominant
  OpenAI / Anthropic / Gemini / Mistral / Cohere traffic.
- P2-3: `is_sensitive_tool` match is case-insensitive. Pre-fix
  `"stripe.charge"` did not match `"Stripe.Charge"`, bypassing the
  sensitive gate.

Concurrency

- §7.2 #39: New `_tools_lock` guards every mutation of
  `_strict_mode_tools` / `_sensitive_tools`. Same lock guards the
  coverage-counter bump+prune sequence (§7.2 #33) so two threads
  can't both observe the dict at length 4095 and both grow it to
  4097 before either prune lands.
- §7.2 #47: New `_langchain_lock` / `_langgraph_lock` guard the
  patch sequences end-to-end. Pre-fix two threads racing through
  `auto_instrument` could both pass the early `_x_patched` check
  and double-wrap `BaseCallbackManager` / `Pregel`.
- §7.2 #33: `_COVERAGE_CAP` (4096) bounds the per-host coverage
  dicts.

Webhook delivery

- P3-2: Exponential backoff (0.5s, 1s, 2s, 4s, 8s, 16s, 30s cap)
  replaces the previous linear schedule. Linear didn't back off
  fast enough under sustained outage — each KILL/PAUSE spawned
  its own delivery thread, producing 1000+ spinning threads
  hammering the dead endpoint.

WAL crash-recovery

- P1-5b: Atomic WAL writes (tmp + `fsync` + `os.replace`), 64 MiB
  rotation with `os.replace(wal, wal.1)`, replay drains both
  `wal.1` and `wal`. New `NULLRUN_WAL_PATH` / `NULLRUN_WAL_MAX_BYTES`
  env overrides for containers with `readOnlyRootFilesystem: true`.

Tests

8 new regression test files (57 tests total):
  test_agent_id_uuid.py, test_args_pii_masked.py,
  test_streaming_oom_cap.py, test_lru_active_runs.py,
  test_reconnect_cap.py, test_coverage_seen_httpx.py,
  test_webhook_backoff.py, test_redact.py

`test_buffer_invariants.py` extended with drop-newest +
critical-event preservation cases. `test_release_polish.py`
updated to pin the 5s cap on both the sync and async jitter
helpers (post §7.2 #35 split).

Full incident write-ups in CHANGELOG.md under the same P0/S/P tags.
@maltsev-dev maltsev-dev merged commit 87b1e6a into master Jun 19, 2026
3 of 6 checks passed
@maltsev-dev maltsev-dev deleted the fix/p0-security-stability-bundle branch June 19, 2026 10:14
@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant