Skip to content

perf(kube): shrink binary 15.5MB and drop the k8s.io/kubernetes dependency#989

Closed
dangrondahl wants to merge 2 commits into
mainfrom
reduce-k8s-binary-size
Closed

perf(kube): shrink binary 15.5MB and drop the k8s.io/kubernetes dependency#989
dangrondahl wants to merge 2 commits into
mainfrom
reduce-k8s-binary-size

Conversation

@dangrondahl

Copy link
Copy Markdown
Contributor

What

The k8s environment reporter only lists pods and namespaces, but the CLI was linking the entire Kubernetes API surface and, via tests, the whole Kubernetes monorepo. Two focused, behaviour-preserving changes:

  1. Use the dynamic client instead of the aggregate typed clientset. kubernetes.NewForConfig transitively imports the global client-go scheme, which registers all 53 API groups and linked the entire k8s.io/api package (~15MB) into the binary. The dynamic client has zero dependency on k8s.io/api; unstructured objects are converted to typed corev1.Pod/corev1.Namespace via runtime.DefaultUnstructuredConverter, so all downstream logic and tests are unchanged. Only k8s.io/api/core/v1 remains linked.

  2. Drop the test-only k8s.io/kubernetes monorepo dependency. The kube test used it solely for e2epod.WaitForPodNameRunningInNamespace. Importing that one helper dragged in the whole monorepo, which forced ~33 replace directives (its staging modules are pinned to v0.0.0 and must be replaced) and dozens of indirect deps. Replaced it with a small local waitForPodRunning poll loop using the typed clientset already created for test setup.

Impact

  • Binary: 83.2MB → 67.8MB (-15.5MB, ~19%). Lighter downloads, CI installs, and the k8s-reporter image.
  • Smaller supply-chain / CVE surface: we no longer ship ~50 unused Kubernetes API groups or the Kubernetes monorepo source tree.
  • go.mod hygiene: require entries 246 → 210, replace directives 35 → 2. The two kept are client-go v1.5.2 → v0.36.0 (MVS still needs it) and the pre-existing, unrelated docker → moby replace.
  • No behaviour change and no CI change. The kind library (hermetic cluster setup) is kept, so CI needs no kind-CLI install.

What was measured and deliberately NOT changed

The _ "k8s.io/client-go/plugin/pkg/client/auth" blank import was measured and left as-is: removing all three providers saved 0 bytes (their only notable dep, golang.org/x/oauth2, is already linked by the GCP/Azure integrations), and keeping it avoids breaking legacy auth-provider kubeconfigs.

Verification

  • go build ./..., go vet ./..., golangci-lint run ./internal/kube/ — all clean
  • go mod verify — all modules verified
  • Full KIND integration suite passed against a live cluster for both changes (list pods cluster-wide + per-namespace, namespace filtering, throttling, and the new wait helper)

Follow-ups (not in this PR)

Largest remaining binary contributors are non-k8s: aws-sdk-go-v2 (~11.5MB) and open-policy-agent/opa (~4MB).

The k8s environment reporter only lists pods and namespaces, but using the
aggregate typed clientset (kubernetes.NewForConfig) transitively imported the
global client-go scheme, which registers all 53 API groups and linked the
entire k8s.io/api package (~15MB) into the binary.

Switch production code to the dynamic client, which has no dependency on
k8s.io/api, and convert unstructured objects to typed corev1.Pod/Namespace via
runtime.DefaultUnstructuredConverter so downstream logic and tests are
unchanged. Only k8s.io/api/core/v1 remains linked now.

The test suite keeps a separate typed clientset for cluster setup/teardown;
test-only imports do not affect the release binary.

Binary size: 83.2MB -> 67.8MB (-15.5MB, ~18.6%).
The kube test only used k8s.io/kubernetes for e2epod.WaitForPodNameRunningInNamespace.
Importing that package pulled in the entire Kubernetes monorepo, which in turn
forced ~33 replace directives (its staging modules are pinned to v0.0.0 and must
be replaced) and dozens of indirect dependencies.

Replace the single helper with a small local waitForPodRunning poll loop using
the typed clientset already created for test setup. This lets go mod tidy remove
k8s.io/kubernetes and 36 require entries, and drops all staging-module replaces.
The kind library (hermetic cluster setup) is kept, so CI needs no changes.

go.mod: require entries 246 -> 210, replace directives 35 -> 2 (kept client-go
v1.5.2->v0.36.0, which MVS still needs, and the unrelated docker->moby replace).

Verified: go build ./..., go vet ./..., golangci-lint, and the full KIND
integration suite all pass.
@claude

claude Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Claude finished @dangrondahl's task in 4m 45s —— View job


Review: PR #989 — shrink k8s binary size

  • Gather context & read changed files
  • Static analysis (build, vet, mod tidy)
  • Code review — kube.go dynamic-client conversion
  • Code review — test waitForPodRunning helper
  • Post review

This is a clean, well-scoped, behaviour-preserving refactor with a genuinely impressive result (-15.5MB / -19% and 33 fewer replace directives). The PR description is excellent — measurements, what was deliberately not changed, and follow-ups. Nice work. 👏

Static analysis — all clean ✅

Check Result
go build ./... ✅ passes
go vet ./internal/kube/... ✅ no findings (also type-checks the test file)
go mod tidygit diff go.mod go.sum ✅ no changes — already tidy
Remaining replace directives ✅ just the two expected (client-go → v0.36.0, docker → moby)
External API impact ✅ only NewK8sClientSet / GetPodsData are used outside the package (cmd/kosli/snapshotK8S.go); the removed embedded *Clientset field had no external users

Findings (all non-blocking)

I left three inline comments:

  1. kube.go:140 — performance tradeoff (worth confirming, not fixing). The dynamic client always uses JSON (not protobuf), and DefaultUnstructuredConverter.FromUnstructured goes via map[string]interface{} + reflection. On the large clusters this path is tuned for (the QPS/burst comments), that's more wire bytes and more CPU/allocs per pod than the old typed path. Acceptable given the size win — just worth spot-checking against your largest real cluster (KIND won't surface it) and maybe leaving a one-line note so nobody "optimizes" it back later.

  2. kube.go:344 — symmetry nit. Pods get an unstructuredToPods helper; namespaces use an inline loop. A matching unstructuredToNamespaces (or a small generic converter) would read more consistently. Purely cosmetic.

  3. kube_test.go:337 — honor ctx cancellation. The waitForPodRunning poll loop ignores the passed ctx and would spin until the full timeout if the context were cancelled. Harmless today (caller uses context.Background()), but a cheap case <-ctx.Done(): return ctx.Err() future-proofs it. Suggestion posted inline.

Things done well

  • Correctness of the conversion is integration-coveredgetK8sClient now returns the dynamic-backed client, so the KIND suite exercises the real unstructured→typed path (cluster-wide list, per-namespace, filtering, throttling).
  • The waitForPodRunning replacement is a faithful, minimal stand-in: it checks before the first sleep, treats transient Get errors as "keep polling" (correct for the post-create NotFound window), and fails fast on terminal phases.
  • Good use of an unexported dynamicClient field + package-level GVRs with an explanatory comment on why the dynamic client was chosen.
  • Test/production separation is clear: the typed clientset is confined to setup/teardown with a comment saying so.

Verdict: LGTM. None of the above blocks merge; #1 is the only one I'd consciously decide on (confirm-on-large-cluster), the other two are polish.

Note: I couldn't run the KIND integration suite here (needs a live cluster + Docker), so I relied on your reported run for runtime verification.
· branch reduce-k8s-binary-size

Comment thread internal/kube/kube.go
if len(filter.IncludeNames) == 0 && len(filter.IncludeNamesRegex) == 0 &&
len(filter.ExcludeNames) == 0 && len(filter.ExcludeNamesRegex) == 0 {
list, err := clientset.Clientset.CoreV1().Pods("").List(context.Background(), metav1.ListOptions{})
ul, err := clientset.dynamicClient.Resource(podsGVR).Namespace("").List(context.Background(), metav1.ListOptions{})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance tradeoff worth noting (non-blocking). The dynamic client always negotiates JSON, not protobuf, and runtime.DefaultUnstructuredConverter.FromUnstructured decodes each object into a map[string]interface{} and then reflection-converts it into a typed Pod. Versus the old typed clientset (protobuf + direct decode), this means larger wire payloads and noticeably more CPU/allocations per pod.

The QPS/burst tuning comments above (lines 103-106) say this path targets large clusters, so a cluster-wide list of tens of thousands of pods is exactly where the extra cost shows up. Given the goal here is binary size and the behavior is preserved, this is an acceptable trade — but it'd be good to confirm the memory/latency profile on your largest real cluster (the KIND suite won't surface it), and maybe leave a one-line comment acknowledging the JSON/reflection cost so a future reader doesn't "optimize" it back to the typed client unaware of the size implication.

Comment thread internal/kube/kube.go
}

return namespaces.Items, nil
namespaces := make([]corev1.Namespace, 0, len(ul.Items))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor symmetry nit: pods get a reusable unstructuredToPods helper, but namespaces are converted with an inline loop here. Consider a small unstructuredToNamespaces helper (or a generic converter) so both resources read the same way and the conversion error wrapping stays consistent. Purely cosmetic — not blocking.

Comment on lines +337 to +341
select {
case <-deadline:
return fmt.Errorf("timed out after %s waiting for pod %s/%s to be running", timeout, namespace, podName)
case <-ticker.C:
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, correct helper — it checks the pod before the first sleep and treats transient Get errors (e.g. the brief NotFound right after create) as "keep polling", which is right.

One tiny robustness gap: the ctx passed in is never observed. If it's ever cancelled, the loop ignores the resulting Get error and keeps spinning every 2s until the full timeout elapses. Today the only caller uses context.Background(), so it can't happen — but a future caller with a real context would be surprised. Cheap to make it honor cancellation:

Suggested change
select {
case <-deadline:
return fmt.Errorf("timed out after %s waiting for pod %s/%s to be running", timeout, namespace, podName)
case <-ticker.C:
}
select {
case <-ctx.Done():
return ctx.Err()
case <-deadline:
return fmt.Errorf("timed out after %s waiting for pod %s/%s to be running", timeout, namespace, podName)
case <-ticker.C:
}

@mbevc1

mbevc1 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Thanks @dangrondahl , but I wonder if this is worth doing. Especially with removing structured types feels like we're taking more on ourselves not using existing libraries, which might already solved this and also have good performance optimisations in place 🤔

@dangrondahl

Copy link
Copy Markdown
Contributor Author

Thanks @dangrondahl , but I wonder if this is worth doing. Especially with removing structured types feels like we're taking more on ourselves not using existing libraries, which might already solved this and also have good performance optimisations in place 🤔

Thanks @mbevc1
I agree and actually also why I hesitated on opening a PR in the fist place. So this was just to show what the changes actually was.
It comes at a price and the full trade-off is not clear to me.

Happy to close it with a "won't fix" or "not now" decision. 😊

@dangrondahl

Copy link
Copy Markdown
Contributor Author

Tradeoff exists! Chose to stay weith what we know for now.

@dangrondahl dangrondahl closed this Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants