perf: pre-size reads and skip redundant copy to cut sync memory#9
Merged
Conversation
Syncing a large source (e.g. the ~75MB MaxMind GeoLite2 mmdb that getlantern/geo pulls through FromTarGz(FromWeb(...)) into a ToChannel + ToFile pair) was buffering the whole payload with io.ReadAll up to three times per sync — once extracting the tarball, once in syncOnce, and once more in the byteChannel sink. io.ReadAll grows its buffer by repeated reallocation, so each of those reads churns through a sequence of ever-larger backing arrays before settling, and the byte-channel read was an entirely redundant second full copy of data the Runner already held. On 1GB proxy/VPS hosts this transient churn dominated RSS and tripped the "host memory >85%" alert. Two changes: - readAll(): reads into a buffer pre-sized from the reader's known length (HTTP Content-Length, archive entry size, or an in-memory reader's Len()), turning a multi-realloc read into a single allocation. webSource and tarGzSource now return size-aware readers so syncOnce benefits too. - bytesConsumer fast-path: byteChannel takes the payload the Runner already buffered directly instead of reading it into a second copy. The slice is read-only and shared across sinks (documented on bytesConsumer). No public API change. Adds read.go + tests.
There was a problem hiding this comment.
Pull request overview
This PR reduces sync-time memory churn by pre-sizing large reads when a reader’s remaining size is known and by avoiding a redundant full-buffer copy when sending the already-buffered payload to channel sinks.
Changes:
- Add
readAll()/knownSize()helpers to pre-size buffers and introduce an internalbytesConsumerfast-path for sinks. - Surface size information from
webSource(HTTPContent-Length) andtarGzSource(archive entry size) so Runner reads can be pre-sized. - Update
byteChannelto consume the already-buffered payload directly and add tests around the new read helper behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| source.go | Wrap HTTP bodies / extracted archive entries in size-aware readers to enable pre-sized reads. |
| sink.go | Route byteChannel through readAll() and add a no-copy bytesConsumer path. |
| read.go | Introduce readAll(), knownSize(), and internal bytesConsumer optimization. |
| read_test.go | Add tests for readAll() pre-sizing and knownSize() behavior. |
| keepcurrent.go | Switch Runner to readAll() and use bytesConsumer when available to skip redundant copies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…/comment fixes
- readAll: cap pre-allocation at 256 MiB and convert to int, so a bogus
size can't panic make() (falls back to io.ReadAll). The original
n+bytes.MinRead did compile — bytes.MinRead is an untyped constant — but
the guard hardens against hostile/garbage sizes.
- tarGzSource: reuse readAll(sizedReadCloser{...}) instead of a bespoke
bytes.Buffer, inheriting the same pre-size + overflow handling.
- bytesConsumer doc: state plainly that the slice is read-only (Runner may
share it across sinks); consumers must copy to retain/mutate.
- read_test: fix stale countingReader comment on unsizedReader.
…rnal, test cap fallback - webSource.Fetch now closes resp.Body on the 304 and non-200 early returns so connections are released for keep-alive reuse. - bytesConsumer doc: state it's an internal optimization for keepcurrent's own sinks; the unexported method means external sinks can't implement it and always use Sink.UpdateFrom. - Add TestReadAllFallsBackWhenSizeExceedsCap covering a reader that reports a size > maxPreAlloc while delivering a small payload.
- Remove the bytesConsumer fast-path entirely. It forwarded the Runner's shared buffer to an external channel, dropping ToChannel's independent- ownership guarantee (a receiver mutating or concurrently reading the slice could corrupt other sinks). byteChannel.UpdateFrom now copies again — but via readAll, so it's a single pre-sized allocation. The memory spike came from io.ReadAll's doubling-realloc churn, which pre-sizing fixes; the copy-elimination was marginal and not worth the semantic change. - webSource.Fetch: drain bodies to EOF before closing on the 304 / non-200 paths (via drainClose) so net/http can actually reuse the connection. Comment now matches behavior.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
source.go:55
- The 304 / non-200 early-return paths only call resp.Body.Close(), but the comment claims “Drain+close so the connection can be reused for keep-alive”. In Go’s net/http transport, connections are generally only eligible for reuse when the response body is read to EOF and closed. Either drain the body before closing (to match the comment and enable keep-alive reuse) or adjust the comment to avoid implying reuse.
}
if s.getETag() != "" {
req.Header.Add("If-None-Match", s.etag)
}
resp, err := s.client.Do(req)
if err != nil {
This was referenced Jun 16, 2026
reflog
added a commit
to getlantern/http-proxy
that referenced
this pull request
Jun 16, 2026
Pulls in getlantern/keepcurrent#9, which pre-sizes the reads used to sync the MaxMind GeoLite2 mmdb (~75MB) through geo's FromTarGz(FromWeb(...)) pipeline. The old io.ReadAll grew its buffer by repeated reallocation, churning through a chain of ever-larger backing arrays per sync; on 1GB hosts that transient spike dominated RSS and tripped the host-memory>85% alert. http-proxy's keepcurrent was pinned to a 2022 commit, so the bump also carries the transitive archiver/v3 -> mholt/archives migration and the minor golang.org/x/* upgrades it brings in. keepcurrent stays indirect; no first-party code changes. Co-authored-by: Ilya Yakelzon <reflog@getlantern.org>
reflog
added a commit
to getlantern/lantern-box
that referenced
this pull request
Jun 16, 2026
Pulls in getlantern/keepcurrent#9, which pre-sizes the reads used to sync the MaxMind GeoLite2 mmdb (~75MB) through geo's FromTarGz(FromWeb(...)) pipeline. The old io.ReadAll grew its buffer by repeated reallocation, churning through a chain of ever-larger backing arrays per sync; on 1GB VPS that transient spike dominated RSS and tripped the host-memory>85% alert. No code changes — keepcurrent stays an indirect dep. Co-authored-by: Ilya Yakelzon <reflog@getlantern.org>
reflog
added a commit
that referenced
this pull request
Jun 16, 2026
…#10) * perf: pre-size fileSource reads too (close the InitFrom churn) Follow-up to #9. That PR made webSource and tarGzSource size-aware, but fileSource was left returning a bare *os.File, so readAll's knownSize() returned false for file reads and fell back to io.ReadAll's doubling growth. geo loads its cached ~75MB MaxMind mmdb from disk via Runner.InitFrom(FromFile(path)) at startup, so that read churned through the 75->60->48->38->30->24MB realloc staircase on every boot (confirmed via alloc_space pprof on a lantern-box v0.0.91 box: io.ReadAll 373MB under keepcurrent.readAll <- syncOnce <- InitFrom <- geo.FromWeb). Make fileSource.Fetch return a sizedReadCloser carrying the stat size when there's no preprocessor, so the read is a single pre-sized allocation. The preprocessor path keeps the io.ReadAll fallback (a preprocessor can change the byte count). Also close the file on the early-return/error paths, which previously leaked the fd. go build/vet/test green; adds TestFileSourceIsSizeAware. * review: fix inverted fileSource modified-since check, close-couple preprocessor stream, hermetic web test - fileSource.Fetch returned ErrUnmodified exactly when the file WAS newer than the cutoff (and fetched when unchanged). Invert to match the Source.Fetch contract and byteSource.Fetch. - couple a preprocessor's returned stream's Close to the underlying *os.File so io.NopCloser-style preprocessors don't leak the descriptor. - add TestFileSourceUnmodified covering the ErrUnmodified path. - make TestUpdateFromWeb hermetic via httptest; the live httpbin.org dependency intermittently hung and timed out CI. --------- Co-authored-by: Ilya Yakelzon <reflog@getlantern.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Several 1 GB proxy/VPS hosts were tripping the SigNoz "Host memory usage high" alert (10 m avg
system.memory.usage> 85%). Heap pprof from bothhttp-proxyandlantern-boxbinaries pointed at the same culprit — the dominant live allocation wasio.ReadAll, and every byte of it came from keepcurrent:The traces showed multiple large retained generations (75 + 60 + 48 + 38 + 30 + 24 MB …).
The source is
getlantern/geo, which syncs the MaxMind GeoLite2 mmdb (~75 MB) viakeepcurrent.FromTarGz(keepcurrent.FromWeb(dbURL), name)into two sinks (ToChannel(chDB)+ToFile(path)). Per sync that payload was buffered withio.ReadAllup to three times:tarGzSource.Fetch),syncOnce(data, _ = ioutil.ReadAll(rc)),byteChannel.UpdateFrom, re-reading thebytes.Readerthe Runner already handed it into a brand-new slice.io.ReadAllgrows its buffer by repeated reallocation, so each read churns through a sequence of ever-larger backing arrays (the 75/60/48/38… generations) before settling — and step 3 is a wholly redundant full copy. On a 1 GB box this transient churn dominates RSS.Fix
No public API change.
readAll()(newread.go) — reads into a buffer pre-sized from the reader's known length, turning a multi-realloc read into a single allocation. Recognizes:sizedReadCloser(HTTPContent-Length, archive entry size — constructed internally),Len()(*bytes.Reader,*bytes.Buffer,*strings.Reader).io.ReadAllwhen size is unknown.webSource.FetchsurfacesContent-Length;tarGzSource.Fetchpre-sizes extraction from the archive entry's uncompressed size and returns a size-aware reader, sosyncOnce's read is pre-sized too.bytesConsumerfast-path —byteChannelreceives the payload the Runner already buffered directly (updateFromBytes) instead of reading it into a second copy. The slice is read-only and shared across sinks (documented on the interface). This removes copy #3 entirely.Net effect for the geo case: three doubling-realloc reads of a 75 MB payload → one extraction read + one pre-sized
syncOnceread, with the channel sink reusing that buffer.Test
go build ./...,go vet ./...clean.read_test.gocovering pre-sizing for sized readers, fallback for unsized readers, andknownSize.Rollout
Downstream bumps to follow once merged:
getlantern/lantern-boxandgetlantern/http-proxy-lantern(both pin keepcurrent indirectly).