feat(email): durable email outbox with push-based retry consumer#327
Open
hhvrc wants to merge 15 commits into
Open
feat(email): durable email outbox with push-based retry consumer#327hhvrc wants to merge 15 commits into
hhvrc wants to merge 15 commits into
Conversation
Replace inline transactional email sends with a database-backed outbox so unsent mail is never lost and can be retried/resent. - Add EmailOutboxMessage (Common): jsonb payload, email_type/email_status pg enums, three states (Sending/Sent/Failed), attempt/lease/backoff fields. - EmailOutboxWorker (API hosted service): push-triggered via Redis pub/sub with a 15s poll fallback; claims rows with FOR UPDATE SKIP LOCKED so it is safe in every replica; exponential backoff + jitter, retries on 5xx/429/ transport errors. EmailOutboxRetryPolicy holds the pure, tested state machine. - IEmailOutboxDispatcher: single send path; mints each token lazily at send time (re-minted on resend) so the queue never stores a usable link. - IEmailService now returns EmailSendResult (Sent/Transient/Permanent) and no longer throws on provider failures; Mailjet/SMTP/None updated. - AccountService: activation, OAuth activation, password reset and email change now enqueue durable outbox rows (token hash seeded at insert, re-minted at send) and notify the consumer instead of sending inline. - Migration AddEmailOutbox (tooling-generated); existing token tables unchanged. - Tests: EmailOutboxRetryPolicyTests (10), EmailOutboxMessageTests (2).
|
Ready to review this PR? Stage has broken it down into 12 individual chapters for you: Chapters generated by Stage for commit 93de6d1 on Jun 30, 2026 10:14pm UTC. |
CI caught two runtime failures (build passed, inserts 500'd): - Postgres enum labels for email_type must be snake_case to match Npgsql's default member translator (e.g. password_reset), not camelCase. - Dictionary<string,string> -> jsonb needs Npgsql dynamic JSON; enable it via ConfigureDataSource(EnableDynamicJson) rather than a hand-rolled converter. Both verified against a real Postgres and the full MailTests suite (16/16). Hardening from a self-review pass: - Redis enqueue notification is now best-effort: a publish failure can no longer fail a request whose outbox row already committed (poll delivers it). - Deduplicate the password-reset / email-change validity checks in the dispatcher into a shared helper. - Fix back-off overflow for large attempt counts (cap in seconds before converting to TimeSpan). - Correct stale doc comments (jsonb via dynamic JSON; token hash seeded then re-minted, not nullable). Migration keeps its original id (only the enum labels changed).
…tom worker Move email sending out of the API host and into the Cron host, and hand the delivery machinery (retry back-off, scheduling, crash recovery, dashboard) to Hangfire rather than the bespoke worker. - Delete EmailOutboxWorker (polling loop, FOR UPDATE SKIP LOCKED claim, lease) and EmailOutboxRetryPolicy (+ test); sending is now a per-email Hangfire job EmailOutboxJob.SendAsync with [AutomaticRetry]. Transient failure throws so Hangfire retries; the job marks the row Failed on its last attempt (detected via PerformContext.RetryCount) so an exhausted row never stays stuck Queued. - Keep the transactional outbox: the API writes the Pending row in the business transaction and publishes the Redis nudge, nothing more. - Move the whole email stack (providers, templates, dispatcher, options, .liquid files) API -> Cron so no ASP.NET/DI logic or MailKit/Fluid lands in Common. - New coalescing EmailOutboxConsumer in Cron: Redis push + 1-minute safety sweep share one wake signal, claim Pending rows, and enqueue Hangfire jobs. - EmailStatus -> Pending, Queued, Sent, Failed; drop next_attempt_at, attempt_started_at, attempt_count (Hangfire owns retry state). Migration updated in place; anti-Hangfire docs rewritten for the hybrid design.
…e tests The API integration MailTests asserted real SMTP delivery, but delivery moved to the Cron host, so the API-only test host no longer sent anything and they failed. Boot the Cron pipeline in-process alongside the API (shared Postgres/Redis/Mailpit) so the outbox is actually drained and delivered. - WebApplicationFactory boots cronhost::Program (aliased to avoid the Program clash); its consumer + Hangfire deliver the rows the API enqueues. - Register FrontendOptions in Cron/Program (the dispatcher builds links with it). - Make Hangfire's QueuePollInterval configurable: production keeps the 15s default (no extra DB load); the test sets 0.5s so jobs land in the wait window. - Copy Cron's Liquid templates next to the test binaries (relative-path load). - Add CronOutboxQueueTests: drive the outbox job's state machine through each case (Sent / Skipped / Permanent / Transient-retry / already-terminal no-op).
…batch delivery Make email_outbox the single source of truth for delivery state (status, attempt_count, next_attempt_at, last_error) instead of handing retry and scheduling to Hangfire. The delivery job claims due rows with FOR UPDATE SKIP LOCKED, leases them (Sending), and records the outcome on the row. - Replace the per-email Hangfire job + [AutomaticRetry] and the custom BackgroundService poll loop with EmailOutboxDeliveryJob ([CronJob] every minute) + EmailOutboxNotificationListener (enqueues it on the Redis pending nudge for instant delivery). No background polling loop. - Add Sending/Skipped states; Skipped is a terminal no-op kept distinct from Failed so operators never requeue it. - Self-owned retry: EmailOutboxRetryPolicy (capped-exponential backoff, 11 attempts = parity with the old AutomaticRetry) plus a delivery lease so a crashed send is reclaimed. - xmin optimistic-concurrency token so a lapsed-lease reclaim can't clobber attempt count / terminal state; per-message reload + isolation so one failure can't abort the batch; drain loop for backlogs. - Pure EmailOutboxStateMachine unit-tested in new Cron.Tests project; DB round-trip / enum-mapping coverage in EmailOutboxPersistenceTests.
…deterministic WebApplicationFactory stops the Cron host at WebApplication.Build(), so the recurring [CronJob] email-outbox sweep registered after Build() in Cron/Program.cs never runs under tests. That left email-flow integration tests depending solely on the one-shot Redis "pending" nudge with no periodic fallback, so mail was never drained inside the Mailpit wait window and 8 flow tests failed. Run the same EmailOutboxDeliveryJob.Execute() the production sweep runs, on a fast 250ms loop in the test host. It is concurrency-safe (FOR UPDATE SKIP LOCKED + xmin), so it composes with the still-live notification listener.
Add an opaque per-row coalesce key so only the newest pending request of a kind is
delivered (account activation, password reset, email-change verification), while the
email-change notice keeps delivering every occurrence.
- EmailOutboxMessage.CoalesceKey (nullable) + coalesce_key column/index; the domain
stamps the key via EmailOutboxCoalesceKeys (activation/pwreset/emailchange:{userId}).
- Delivery job skips a row as Skipped("Superseded by a newer request") when a strictly
newer sibling shares its key; null key never coalesces. Fixes stale redrive double-sends
and out-of-order delivery without requiring delivery ordering.
- Per-type EmailOutboxMessage factories (ForAccountActivation/ForPasswordReset/
ForEmailVerification/ForEmailChangeNotice) own the (type, payload key, coalesce key)
triple so call sites can't drift; AccountService call sites collapsed to one line each.
- Rewrite the two sibling-request mail tests to assert newest-wins delivery (older row
Skipped) and preserve completion-time stamp-rotation invalidation via atomic DB seeding.
- Document the outbox design in docs/email-outbox-design.md.
…n instead of xmin Guard the deferred terminal write with attempt_count (already incremented on every claim) instead of the row's xmin: mark it IsConcurrencyToken so the write only conflicts on an actual lease reclaim, with no false-conflict surface from unrelated updates to the row. The AddEmailOutbox migration is regenerated with the EF tool (MigrationOpenShockContext) rather than hand-edited, so it now reflects the current model (coalesce_key column + index, attempt_count fencing token, no xmin shadow property) as tool-generated output.
…rocess Cron The API's job for transactional email is to atomically write the EmailOutboxMessage row; it never sends. Booting the Cron host + a delivery pump inside the API tests was wrong and caused races (a background drain loop mutating rows that schema/predicate tests seed, and two executors racing the coalescing). - WebApplicationFactory no longer boots the Cron host, SMTP server, or a delivery loop - just the API + Postgres + Redis. - MailTests now assert the correct outbox row is enqueued (type, recipient, coalesce key, payload) or that nothing is enqueued for rejected requests; the emailed-link end-to-end flows and newest-wins coalescing move to Cron.IntegrationTests. - Drop the Cron project reference, SmtpTemplates copy, and the now-unused Mailpit fixtures.
The delivery behaviour removed from the API tests now lives in a Cron-side integration project that boots the real Cron host against throwaway Postgres / Redis / Mailpit containers. Delivery is driven deterministically - tests seed outbox rows then call RunDeliveryAsync() to run one pass of the real EmailOutboxDeliveryJob, so nothing races the assertions. Covers: password-reset delivery with lazy token minting (the emailed token verifies against the stored hash), the self-contained change notice, and newest-wins coalescing (older sibling Skipped, newest Sent, exactly one email). The Cron host doesn't migrate on boot, so the factory applies migrations to the fresh database first.
…ML-doc warning The coalescing test seeded both reset rows in one transaction, so they shared a created_at (Postgres CURRENT_TIMESTAMP is transaction-start time). The tie fell through to a UUIDv7 id comparison, whose Postgres ordering doesn't match .NET creation order, so "newest" came out inverted and the newer row was wrongly skipped. Real requests are separate transactions with distinct timestamps, so production was never affected - seed each sibling in its own transaction to mirror that and make the test deterministic. Also drop the lone Create() <param> tag that tripped CS1573 (fold it into the summary).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Transactional emails (account activation, password reset, email-change verification, email-change notice) were sent inline during the HTTP request. If the mail provider was down, the process restarted, or the network blipped, the email was simply lost - with no record and no way to resend.
This PR makes email delivery durable and retryable via a transactional outbox delivered by Hangfire: the request handler writes an
EmailOutboxMessagerow in the same transaction as the business change and commits; the Cron host picks it up and hands it to Hangfire, which sends it and retries until it succeeds or is exhausted. Rows are never auto-deleted, so failures stay visible and resendable.The split is deliberate: the outbox row owns transactional integrity and the audit fact ("this user is owed this email, here is its delivery state"); Hangfire owns the delivery machinery (retry back-off, persistent scheduling, crash requeue, and an operator dashboard) instead of a hand-rolled worker.
How it works
AccountServicecreates the request row (e.g.UserPasswordReset) and anEmailOutboxMessage(statusPending) in one transaction, then publishes a Redis notification. The API host does nothing else - it never sends.EmailOutboxConsumer(a coalescing background service in the Cron host) is woken by the Redis notification (email-outbox-pending), with a 1-minute sweep as a safety net for any missed notification. The push and the sweep share one capacity-1 wake signal, so they can never fire on top of each other. It claimsPendingrows withFOR UPDATE SKIP LOCKEDand enqueues a Hangfire job per email, flipping each row toQueued.EmailOutboxJob.SendAsync(id)carries[AutomaticRetry(Attempts = 10)]: a transient failure throws so Hangfire reschedules it on its built-in back-off curve; a permanent/skip outcome marks the rowFailed; success marks itSent. On the final attempt (detected viaPerformContext.RetryCount) the row is markedFailedtoo, so an exhausted message never stays stuck lookingQueued. Crash mid-send is recovered by Hangfire's requeue. The precise attempt count + per-retry history live in the Hangfire dashboard.IEmailOutboxDispatcheris the only place email is sent. It maps the message type + dynamicjsonbpayload to a template, mints the secret token lazily at send time (re-minted on every resend), and classifies the result.IEmailServicereturnsEmailSendResult(Sent / Transient / Permanent) instead of throwing.Security note
Because a reset/verification link can't be reconstructed from a stored hash, tokens are minted at send time and only the hash is persisted. The queue therefore never contains a usable link, and a resend produces a fresh token - the desired behaviour for security links anyway. Existing token tables are unchanged: the hash is seeded on insert and rotated on send.
Why the diff is large (and what's actually in it)
52 files, +2996 / −339. The line count is dominated by generated code and a project move, not new logic:dotnet ef migrations addoutput; the bulk is the Designer/snapshot EF rewrites wholesale. Not hand-written, not reviewed line-by-line..liquidfiles relocated so sending lives in the Cron host. Mostly moves (namespace-only changes); a few git-detected renames, the rest paired add/delete.EmailOutboxConsumer(159) +EmailOutboxJob(120) - the real surface area to review.EmailOutboxMessageTests; XML docs explaining the design so it's self-documenting.So the substantive new code is the ~280-line consumer + Hangfire job; the rest is generated schema, a project relocation, the outbox entity/enums, and documentation.
Outbox + Hangfire (the design)
This is a transactional outbox with Hangfire as the relay/executor - the two layers own different things and are combined on purpose:
The API host stays free of email-send work entirely; it only writes the row and rings the bell.
Scope / roadmap
One self-contained PR. Follow-ups (now trivial - they just insert a row): admin "send any email type", admin "list email types", and a Cron prune job for old
Sentrows.Deployment note: the Cron host now sends email, so it needs the
OpenShock:Mailconfig section (and, for a real provider, Mailjet/SMTP credentials). The API host no longer needs that config or the templates.Testing
Common.Testsgreen (265 passing).MailTestsare consistent with async delivery (they already wait for mail, parse the minted token, expect the same link formats) but were not run in this environment.