fix(rbac): operator ClusterRole pods delete (reseedStandby 작동)#277
Merged
Conversation
reseedStandby/reconcileStaleReplicas(#205/#220)가 stuck standby(primary 승격 후 timeline divergence 등)의 pod+PVC를 삭제해 fresh pg_basebackup 으로 재clone 한다. chart RBAC 가 pods verbs 에 delete 누락(get/list/watch/patch 만) → reseed 의 pod 삭제 단계가 forbidden → timeline-stuck replica 영구 방치(라이브 postgres-prod shard-0-1 8h+ stuck 실측). PVC delete 는 이미 있었으나 pod delete 누락이 reseed 무력화. 라이브 RCA: kubectl auth can-i delete pods (operator SA) = no. 본 fix 로 reseed 완전 작동. helm template verbs=[get,list,watch,patch,delete] 렌더 검증. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
KeiaiLab-PHIL
added a commit
that referenced
this pull request
Jun 23, 2026
…ashloop 차단) (#278) INC-2026-06-23 postgres-prod 7일 다운 라이브 차단점: operator status가 죽은 옛 primary(shard-0-1)를 primary로 stale 인식 → PRIMARY_ENDPOINT=죽은노드 → 모든 노드가 PrepareRestartedPrimaryAsStandbyWithRewind 에서 죽은 노드로 pg_rewind 시도 → DNS/connection 실패 무한 crashloop. 실제 primary(0-0)조차 PRIMARY_ENDPOINT(=자기 or 죽은노드) 보고 자기를 standby화하려다 self-rewind connection-refused crashloop. Fix: RejoinOptions.SelfEndpoint 추가 + PrimaryEndpoint==SelfEndpoint면 operator가 본 노드를 primary로 지정한 것이므로 standby화 skip + marker 제거. 라이브에서 수동 marker(.keiailab-restart-primary-as-standby)+standby.signal 제거로만 복구되던 것을 코드 자동화. 회귀테스트 TestPrepareRestartedPrimaryAsStandby_SelfPrimarySkips PASS, go build/vet/test ok. 관련: #205/#220 reseed + postmaster.pid(#276) + RBAC(#277) 동일 INC 체인의 라이브 차단점 최종 fix. Co-authored-by: Support <support@masblue.studio> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause (live postgres-prod replica HA 미복원)
reseedStandby/reconcileStaleReplicas(#205/#220)는 stuck standby(primary 승격 후 timeline divergence 등)의 pod+PVC 삭제 → fresh pg_basebackup 재clone으로 복구한다. 그러나 charttemplates/rbac.yaml의podsverbs가[get,list,watch,patch]로 delete 누락 — PVC delete는 있으나 pod delete가 없어 reseed가 pod 삭제 단계에서 forbidden → timeline-stuck replica 영구 방치.라이브 RCA:
kubectl auth can-i delete pods(operator SA) = no. postgres-prod shard-0-1이 primary 승격 후 timeline 1 끝에서 8h+ "starting up" stuck, reseed 이벤트 0.Fix
podsverbs에delete추가. operator 코드(reseed)는 이미 존재 — 권한만 부족했음(이미지 재빌드 불필요, chart-only).Verify
Refs #205 #220 (reseed) — RBAC 누락이 reseed 무력화하던 갭 봉인.
Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com