fix(instance): skip standby-prep when PRIMARY_ENDPOINT is self (crashloop fix)#278
Merged
Merged
Conversation
…ashloop 차단) INC-2026-06-23 postgres-prod 7일 다운 라이브 차단점: operator status가 죽은 옛 primary(shard-0-1)를 primary로 stale 인식 → PRIMARY_ENDPOINT=죽은노드 → 모든 노드가 PrepareRestartedPrimaryAsStandbyWithRewind 에서 죽은 노드로 pg_rewind 시도 → DNS/connection 실패 무한 crashloop. 실제 primary(0-0)조차 PRIMARY_ENDPOINT(=자기 or 죽은노드) 보고 자기를 standby화하려다 self-rewind connection-refused crashloop. Fix: RejoinOptions.SelfEndpoint 추가 + PrimaryEndpoint==SelfEndpoint면 operator가 본 노드를 primary로 지정한 것이므로 standby화 skip + marker 제거. 라이브에서 수동 marker(.keiailab-restart-primary-as-standby)+standby.signal 제거로만 복구되던 것을 코드 자동화. 회귀테스트 TestPrepareRestartedPrimaryAsStandby_SelfPrimarySkips PASS, go build/vet/test ok. 관련: #205/#220 reseed + postmaster.pid(#276) + RBAC(#277) 동일 INC 체인의 라이브 차단점 최종 fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause (postgres-prod 7-day outage live blocker)
operator status가 죽은 옛 primary(shard-0-1)를 primary로 stale 인식 →
PRIMARY_ENDPOINT=죽은-노드주입 → 모든 노드가PrepareRestartedPrimaryAsStandbyWithRewind에서 죽은 노드로pg_rewind시도 → DNS/connection 실패 → 무한 crashloop. 실제 데이터 primary(0-0)조차 marker(.keiailab-restart-primary-as-standby)가 남아 자기를 standby화하려다 self-rewindconnection refusedcrashloop.라이브 복구는 PGDATA의
marker + standby.signal수동 제거로만 가능했다 (데이터는 "in production" 무손상).Fix
RejoinOptions.SelfEndpoint추가 →PrimaryEndpoint == SelfEndpoint면 operator가 본 노드를 primary로 지정한 것이므로 standby-prep을 skip하고 marker를 제거(primary로 부팅). 수동 복구를 코드 자동화.Verify
라이브: 본 fix를 빌드한 pg 이미지 배포 시, failover로 PRIMARY_ENDPOINT가 자기 자신이 된 노드는 standby화 시도 없이 primary 부팅 → 7일 다운 재발 불가.
관련 INC 체인: postmaster.pid(#276) + RBAC pods delete(#277) + reseed(#205/#220) + 본 self-primary guard.
Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com