Skip to content

feat(git): instrument snapshot serve backend and server-side TTFB#349

Draft
worstell wants to merge 1 commit into
mainfrom
worstell/snapshot-serve-ttfb-backend-metrics
Draft

feat(git): instrument snapshot serve backend and server-side TTFB#349
worstell wants to merge 1 commit into
mainfrom
worstell/snapshot-serve-ttfb-backend-metrics

Conversation

@worstell

Copy link
Copy Markdown
Contributor

Snapshot-lookup latency (the time a blox client waits before a snapshot download begins) is bimodal: small repos see ~ms TTFB while giant repos (e.g. an 8 GB snapshot) see hundreds of seconds. The existing snapshot_serve_duration_seconds is whole-request and cannot tell us whether the time is spent in cache lookup, waiting on the L2 (S3) backend first chunk, or in the download itself — nor which tier served the bytes.

This adds instrumentation to disambiguate, without changing serving behavior:

  • Tiered.Open annotates the serving tier via an internal X-Cachew-Served-By header (disk, s3, ...); serve handlers read it for a low-cardinality backend metric/span label. Not forwarded to clients.
  • New metrics:
    • cachew.git.snapshot_serve_ttfb_seconds — server-side time-to-first-byte (handler entry → first response byte), by source/backend/repo.
    • cachew.git.snapshot_cache_open_duration_seconds — cache Open (lookup/metadata/reader creation) before streaming, by backend/status/repo.
  • serveReaderFast now measures TTFB: immediate for sendfile'd files, first-Read for stream readers (e.g. an S3 range reader whose first Read blocks on the initial chunk).
  • Snapshot serves and spans now carry backend, ttfb_seconds, and mirror_head_seconds attributes; existing recordSnapshotServe gains a backend label.

Once deployed this lets us confirm whether giant-repo lookup latency is dominated by S3 L2 vs disk L1, and by cache-open vs first-chunk time, to pick the right follow-up (e.g. keeping hot giant snapshots on L1).

Validation: go build ./..., go test ./internal/cache/... ./internal/strategy/git/..., lint clean except pre-existing gosec warnings on untouched lines.

Add a backend dimension (disk/s3/...) and server-side time-to-first-byte
to snapshot serves so snapshot-lookup latency can be attributed to a cache
tier and split into cache-open vs first-chunk time.

- Tiered.Open annotates the serving tier via an internal X-Cachew-Served-By
  header; serve handlers read it for the backend label.
- New metrics: cachew.git.snapshot_serve_ttfb_seconds and
  cachew.git.snapshot_cache_open_duration_seconds.
- serveReaderFast measures TTFB (sendfile for files, first-Read for stream
  readers) and snapshot serves/spans carry backend + ttfb attributes.

Amp-Thread-ID: https://ampcode.com/threads/T-019ef6a9-a407-7389-bc43-001405e3ae9e
Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant