fix: P0 security/stability hardening bundle#22
Merged
Conversation
Closes the P0/P1/P2/P3 issues from the security review (plan §10/§11.4).
Security / PCI-DSS / GDPR
- P0-1: Mask positional PII in `_enforce_sensitive_tool` by introspecting
the wrapped function's signature and applying `SENSITIVE_ARG_KEYS` to
positional params. Pre-fix, `charge("4111-…-1111", 50)` forwarded the
PAN into `/execute` and the audit log.
- P0-6 / P3-3: `_safe_repr` now redacts BEFORE truncating. The pre-fix
order truncated first, so `details={…}` past position 50 leaked
verbatim. `_safe_repr` is now the single source of truth for the
redact-then-truncate flow.
Cost-audit / reliability
- P0-3: Bounded chunked reads on the sync + async httpx transports
(`MAX_RESPONSE_BYTES`, default 16 MiB, `NULLRUN_MAX_RESPONSE_BYTES`
env override). Above the cap, tracking is skipped and
`_coverage_streaming_skipped` is incremented. Replaces the
`response.read()` / `await response.aread()` unbounded buffer that
held entire LLM streaming bodies in memory.
- P0-4: `_do_flush_locked` re-queue on CB OPEN now drops the NEWEST
non-critical events instead of the oldest. The oldest events
(incident start, billing-period start) are exactly what a billing
investigator needs; losing them silently broke monthly rollups.
Control-plane events (`state_change`, `kill_received`,
`policy_invalidated`, `key_rotated`) are preserved unconditionally
so the dashboard KILL switch lands even under sustained backend
outage.
Identity
- S-8 / P2-4: `agent()` now emits `str(uuid.uuid4())` (with dashes).
Pre-fix the format was `f"agent-{uuid.uuid4().hex}"` — 32 hex chars,
no dashes — and backend UUID-typed columns dropped these to NULL
on insert. User-supplied names are still preserved verbatim.
- §7.2 #16: `workflow()` context manager now resets `span_id` (not
only `workflow_id` / `trace_id`) so nested `with span()` blocks
don't leave the inner span_id visible inside the workflow scope.
Resource leaks
- S-9: `_active_runs` on `NullRunCallback` is now an `OrderedDict`
capped at 4096 with FIFO eviction. Pre-fix the dict grew
unbounded when `on_chain_end` did not fire (some LangChain
versions short-circuit the end hook on chain-body errors).
- S-10: WebSocket reconnect loop is now capped at 10 consecutive
failures, then falls back to HTTP-poll. Pre-fix the loop ran
forever when the backend was permanently down, leaking the
WS thread.
Transport
- §7.2 #6: Separate `hmac_verify_expired_total` counter so SRE can
distinguish clock-skew (NTP drift) from forged packets. Mirrored
in both the HTTP and WebSocket verify paths.
- §7.2 #35: `CircuitBreaker.call` now dispatches the OPEN→HALF_OPEN
jitter through `_maybe_apply_open_jitter_sync` /
`_maybe_apply_open_jitter_async`. Pre-fix the jitter used
`time.sleep` before dispatching to async, which blocked the
caller's event loop on every transition.
- P2-1: `_coverage_seen` now bumps in the httpx path (sync + async).
Pre-fix the counter was only bumped by the `requests` transport,
so the dashboard's coverage view was empty for the dominant
OpenAI / Anthropic / Gemini / Mistral / Cohere traffic.
- P2-3: `is_sensitive_tool` match is case-insensitive. Pre-fix
`"stripe.charge"` did not match `"Stripe.Charge"`, bypassing the
sensitive gate.
Concurrency
- §7.2 #39: New `_tools_lock` guards every mutation of
`_strict_mode_tools` / `_sensitive_tools`. Same lock guards the
coverage-counter bump+prune sequence (§7.2 #33) so two threads
can't both observe the dict at length 4095 and both grow it to
4097 before either prune lands.
- §7.2 #47: New `_langchain_lock` / `_langgraph_lock` guard the
patch sequences end-to-end. Pre-fix two threads racing through
`auto_instrument` could both pass the early `_x_patched` check
and double-wrap `BaseCallbackManager` / `Pregel`.
- §7.2 #33: `_COVERAGE_CAP` (4096) bounds the per-host coverage
dicts.
Webhook delivery
- P3-2: Exponential backoff (0.5s, 1s, 2s, 4s, 8s, 16s, 30s cap)
replaces the previous linear schedule. Linear didn't back off
fast enough under sustained outage — each KILL/PAUSE spawned
its own delivery thread, producing 1000+ spinning threads
hammering the dead endpoint.
WAL crash-recovery
- P1-5b: Atomic WAL writes (tmp + `fsync` + `os.replace`), 64 MiB
rotation with `os.replace(wal, wal.1)`, replay drains both
`wal.1` and `wal`. New `NULLRUN_WAL_PATH` / `NULLRUN_WAL_MAX_BYTES`
env overrides for containers with `readOnlyRootFilesystem: true`.
Tests
8 new regression test files (57 tests total):
test_agent_id_uuid.py, test_args_pii_masked.py,
test_streaming_oom_cap.py, test_lru_active_runs.py,
test_reconnect_cap.py, test_coverage_seen_httpx.py,
test_webhook_backoff.py, test_redact.py
`test_buffer_invariants.py` extended with drop-newest +
critical-event preservation cases. `test_release_polish.py`
updated to pin the 5s cap on both the sync and async jitter
helpers (post §7.2 #35 split).
Full incident write-ups in CHANGELOG.md under the same P0/S/P tags.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the P0/P1/P2/P3 issues from the security review (plan §10/§11.4).
Security / PCI-DSS / GDPR
_enforce_sensitive_toolby introspecting the wrapped function's signature and applyingSENSITIVE_ARG_KEYSto positional params. Pre-fix,charge("4111-…-1111", 50)forwarded the PAN into/executeand the audit log._safe_reprnow redacts BEFORE truncating. The pre-fix order truncated first, sodetails={…}past position 50 leaked verbatim._safe_repris now the single source of truth for the redact-then-truncate flow.Cost-audit / reliability
MAX_RESPONSE_BYTES, default 16 MiB,NULLRUN_MAX_RESPONSE_BYTESenv override). Above the cap, tracking is skipped and_coverage_streaming_skippedis incremented. Replaces theresponse.read()/await response.aread()unbounded buffer that held entire LLM streaming bodies in memory._do_flush_lockedre-queue on CB OPEN now drops the NEWEST non-critical events instead of the oldest. The oldest events (incident start, billing-period start) are exactly what a billing investigator needs; losing them silently broke monthly rollups. Control-plane events (state_change,kill_received,policy_invalidated,key_rotated) are preserved unconditionally so the dashboard KILL switch lands even under sustained backend outage.Identity
agent()now emitsstr(uuid.uuid4())(with dashes). Pre-fix the format wasf"agent-{uuid.uuid4().hex}"— 32 hex chars, no dashes — and backend UUID-typed columns dropped these to NULL on insert. User-supplied names are still preserved verbatim.workflow()context manager now resetsspan_id(not onlyworkflow_id/trace_id) so nestedwith span()blocks don't leave the inner span_id visible inside the workflow scope.Resource leaks
_active_runsonNullRunCallbackis now anOrderedDictcapped at 4096 with FIFO eviction. Pre-fix the dict grew unbounded whenon_chain_enddid not fire (some LangChain versions short-circuit the end hook on chain-body errors).Transport
hmac_verify_expired_totalcounter so SRE can distinguish clock-skew (NTP drift) from forged packets. Mirrored in both the HTTP and WebSocket verify paths.CircuitBreaker.callnow dispatches the OPEN→HALF_OPEN jitter through_maybe_apply_open_jitter_sync/_maybe_apply_open_jitter_async. Pre-fix the jitter usedtime.sleepbefore dispatching to async, which blocked the caller's event loop on every transition._coverage_seennow bumps in the httpx path (sync + async). Pre-fix the counter was only bumped by therequeststransport, so the dashboard's coverage view was empty for the dominant OpenAI / Anthropic / Gemini / Mistral / Cohere traffic.is_sensitive_toolmatch is case-insensitive. Pre-fix"stripe.charge"did not match"Stripe.Charge", bypassing the sensitive gate.Concurrency
_tools_lockguards every mutation of_strict_mode_tools/_sensitive_tools. Same lock guards the coverage-counter bump+prune sequence (§7.2 #33) so two threads can't both observe the dict at length 4095 and both grow it to 4097 before either prune lands._langchain_lock/_langgraph_lockguard the patch sequences end-to-end. Pre-fix two threads racing throughauto_instrumentcould both pass the early_x_patchedcheck and double-wrapBaseCallbackManager/Pregel._COVERAGE_CAP(4096) bounds the per-host coverage dicts.Webhook delivery
WAL crash-recovery
fsync+os.replace), 64 MiB rotation withos.replace(wal, wal.1), replay drains bothwal.1andwal. NewNULLRUN_WAL_PATH/NULLRUN_WAL_MAX_BYTESenv overrides for containers withreadOnlyRootFilesystem: true.Tests
8 new regression test files (57 tests total):
test_agent_id_uuid.py, test_args_pii_masked.py, test_streaming_oom_cap.py, test_lru_active_runs.py, test_reconnect_cap.py, test_coverage_seen_httpx.py, test_webhook_backoff.py, test_redact.py
test_buffer_invariants.pyextended with drop-newest + critical-event preservation cases.test_release_polish.pyupdated to pin the 5s cap on both the sync and async jitter helpers (post §7.2 #35 split).Full incident write-ups in CHANGELOG.md under the same P0/S/P tags.
What
Why
How
Test plan
cd backend && cargo test,cd frontend && npm test)cd frontend && npm run lint)cd frontend && npm run type-check)Risk
Checklist
CONTRIBUTING.md(if present)