Skip to content

chore(release): bump component versions for 26.07#1357

Merged
michael-balint merged 2 commits into
masterfrom
dholt/release-26.07-version-bumps
Jul 2, 2026
Merged

chore(release): bump component versions for 26.07#1357
michael-balint merged 2 commits into
masterfrom
dholt/release-26.07-version-bumps

Conversation

@dholt

@dholt dholt commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Bump component default versions for the 26.07 release train:
    • Slurm 25.11.626.05.1 (current upstream stable line from SchedMD)
    • GPU Operator chart v26.3.1v26.3.3 (default driver branch unchanged at 580.126.20)
    • Kubernetes GPU device plugin and GPU feature discovery charts 0.19.10.19.3
    • MIG manager (mig-parted) packages v0.14.1v0.14.2
    • Network Operator 26.1.126.4.0
    • Prometheus v3.11.3v3.13.0, Alertmanager v0.32.1v0.33.0, Grafana 13.0.113.1.0, kube-prometheus-stack chart 85.0.387.5.1
    • Spack v1.1.1v1.2.0
  • Kubespray (v2.31.0), node exporter (v1.11.1), ingress-nginx chart (4.15.1), registry (3.1.1), and the MAAS role pin are already current and unchanged.
  • Make the DEEPOPS_VERSION debug output in scripts/common.sh meaningful: default it to the checkout's git describe --tags --always with an unknown fallback, keep the config/env.sh override, and document the variable in config.example/env.sh.

Validation

  • git diff --check origin/master..HEAD
  • Full role lint: 0 failures, 0 warnings
  • Ansible syntax-check suite across the standard playbook set
  • Every bumped artifact verified against its primary source before the change: SchedMD tarball and mig-parted release packages return HTTP 200; Docker Hub manifests exist for the Prometheus, Alertmanager, and Grafana tags; the GPU Operator, device plugin, GFD, Network Operator, and kube-prometheus-stack chart versions are served by their Helm repositories.
  • bash -n scripts/common.sh plus behavior check of the version derivation and env override.

Notes

  • Draft until release QA gates complete: fresh Slurm deployment with a GPU job, Kubernetes/GPU Operator deployment with a CUDA pod, MAAS provisioning smoke, and upgrade-path validation. A sanitized QA summary will be attached before review.
  • Slurm moves to a new upstream major line (26.05); the Slurm deployment gate below validates build and single-node GPU scheduling before this leaves draft.
  • kube-prometheus-stack is a multi-major chart bump; monitoring deployment values are exercised during QA.

dholt added 2 commits July 2, 2026 06:44
scripts/common.sh printed an always-empty DEEPOPS_VERSION unless a user
happened to set it in config/env.sh. Default it to the checkout's git tag
description (git describe --tags --always) with an unknown fallback, keep
the env.sh override, and document the variable in config.example/env.sh.
@dholt

dholt commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Live validation summary

All release QA gates for this PR completed on a disposable single-node NVIDIA H100 GPU server running Ubuntu 22.04:

  • Slurm deployment (26.05.1): full playbooks/slurm-cluster.yml run finished with ok=304, changed=108, failed=0. Slurm built from the 26.05.1 source tarball; scontrol reports Version=26.05.1, the node joined the batch partition idle with gres/gpu=1, all Slurm services active, and the GPU job gate passed: srun --gpus=1 nvidia-smi saw the H100 on driver 580.159.03.
  • Kubernetes / GPU Operator (v26.3.3, device plugin 0.19.3): playbooks/k8s-cluster.yml deployed cleanly; GPU Operator, device plugin, and node-feature-discovery pods running; the node advertised one allocatable nvidia.com/gpu; a CUDA smoke pod scheduled with a GPU request reached Succeeded and reported the H100 on driver 580.159.03.
  • MAAS provisioning smoke: MAAS controller install passed (failed=0); services active; the MAAS API responded with version 3.5.12 and full capability list.
  • Upgrade path: the direct 23.08-to-current in-place upgrade remains blocked by Kubespray's Calico minimum-version gate, as found and documented during the 26.05 release; the docs carry the explicit staged-upgrade/redeploy statement, which satisfies this gate.
  • Static gates: full role lint (0 failures), Ansible syntax-check suite, OS compatibility and component audits, and public sanitization all passed; every bumped artifact was verified against its primary upstream source (tarball/package HTTP checks, container manifest checks, Helm repo index checks) before the change.

Not covered on this hardware: multi-node scheduling, Network Operator 26.4.0 install behavior (needs relevant NICs), and monitoring stack deployment values for the kube-prometheus-stack chart bump beyond syntax/render checks.

@dholt dholt marked this pull request as ready for review July 2, 2026 14:47
@dholt dholt requested a review from michael-balint July 2, 2026 14:47
@michael-balint michael-balint merged commit fa1e345 into master Jul 2, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants