Skip to content

fix(rbac): operator ClusterRole pods delete (reseedStandby 작동)#277

Merged
KeiaiLab-PHIL merged 1 commit into
mainfrom
fix/operator-rbac-pods-delete-reseed
Jun 22, 2026
Merged

fix(rbac): operator ClusterRole pods delete (reseedStandby 작동)#277
KeiaiLab-PHIL merged 1 commit into
mainfrom
fix/operator-rbac-pods-delete-reseed

Conversation

@KeiaiLab-PHIL

Copy link
Copy Markdown
Contributor

Root cause (live postgres-prod replica HA 미복원)

reseedStandby/reconcileStaleReplicas(#205/#220)는 stuck standby(primary 승격 후 timeline divergence 등)의 pod+PVC 삭제 → fresh pg_basebackup 재clone으로 복구한다. 그러나 chart templates/rbac.yamlpods verbs가 [get,list,watch,patch]delete 누락 — PVC delete는 있으나 pod delete가 없어 reseed가 pod 삭제 단계에서 forbidden → timeline-stuck replica 영구 방치.

라이브 RCA: kubectl auth can-i delete pods (operator SA) = no. postgres-prod shard-0-1이 primary 승격 후 timeline 1 끝에서 8h+ "starting up" stuck, reseed 이벤트 0.

Fix

pods verbs에 delete 추가. operator 코드(reseed)는 이미 존재 — 권한만 부족했음(이미지 재빌드 불필요, chart-only).

Verify

helm template → ClusterRole pods verbs=[get,list,watch,patch,delete] ✓, 렌더 오류 0

Refs #205 #220 (reseed) — RBAC 누락이 reseed 무력화하던 갭 봉인.

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

reseedStandby/reconcileStaleReplicas(#205/#220)가 stuck standby(primary 승격 후 timeline divergence 등)의 pod+PVC를 삭제해 fresh pg_basebackup 으로 재clone 한다. chart RBAC 가 pods verbs 에 delete 누락(get/list/watch/patch 만) → reseed 의 pod 삭제 단계가 forbidden → timeline-stuck replica 영구 방치(라이브 postgres-prod shard-0-1 8h+ stuck 실측). PVC delete 는 이미 있었으나 pod delete 누락이 reseed 무력화.

라이브 RCA: kubectl auth can-i delete pods (operator SA) = no. 본 fix 로 reseed 완전 작동. helm template verbs=[get,list,watch,patch,delete] 렌더 검증.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@KeiaiLab-PHIL KeiaiLab-PHIL merged commit cd7c4f8 into main Jun 22, 2026
2 checks passed
@KeiaiLab-PHIL KeiaiLab-PHIL deleted the fix/operator-rbac-pods-delete-reseed branch June 22, 2026 21:56
KeiaiLab-PHIL added a commit that referenced this pull request Jun 23, 2026
…ashloop 차단) (#278)

INC-2026-06-23 postgres-prod 7일 다운 라이브 차단점: operator status가 죽은 옛 primary(shard-0-1)를 primary로 stale 인식 → PRIMARY_ENDPOINT=죽은노드 → 모든 노드가 PrepareRestartedPrimaryAsStandbyWithRewind 에서 죽은 노드로 pg_rewind 시도 → DNS/connection 실패 무한 crashloop. 실제 primary(0-0)조차 PRIMARY_ENDPOINT(=자기 or 죽은노드) 보고 자기를 standby화하려다 self-rewind connection-refused crashloop.

Fix: RejoinOptions.SelfEndpoint 추가 + PrimaryEndpoint==SelfEndpoint면 operator가 본 노드를 primary로 지정한 것이므로 standby화 skip + marker 제거. 라이브에서 수동 marker(.keiailab-restart-primary-as-standby)+standby.signal 제거로만 복구되던 것을 코드 자동화. 회귀테스트 TestPrepareRestartedPrimaryAsStandby_SelfPrimarySkips PASS, go build/vet/test ok.

관련: #205/#220 reseed + postmaster.pid(#276) + RBAC(#277) 동일 INC 체인의 라이브 차단점 최종 fix.

Co-authored-by: Support <support@masblue.studio>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant