⚡ Promptolis Original · Coding & Development
🚚 Migration Rollout Plan Designer
Designs your DB migration / framework upgrade / dependency change rollout: the staged plan, the rollback gate at each phase, the verification checks, and the 'kill switch' if something breaks at 3am.
Why this is epic
Most migrations are 'do it Saturday, hope for the best.' This Original designs the staged rollout: shadow mode, percentage-based traffic, verification gates at each phase, and the explicit rollback decision criteria.
Outputs the full plan: pre-migration prep, the staged steps with verification at each, monitoring + alerts to add, the rollback procedure (with specific commands), and the post-migration cleanup.
Calibrated to 2026 migration realities: feature flags + LaunchDarkly for percentage rollout, blue-green deploys, the difference between online + offline migrations, the 'last 5% always breaks' pattern.
Includes the 4 question every migration must answer: (1) is the rollback fast or slow? (2) does the new system handle the OLD system's edge cases? (3) what's the data-shape change risk? (4) what's the on-call burden? Answers shape the rollout plan.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<migration-type>Database vendor migration: Postgres 14 (self-hosted on AWS EC2) → Postgres 16 on AWS RDS</migration-type> <from-state>Postgres 14.7 self-hosted on a single EC2 instance (db.r5.xlarge equivalent). 600GB data, ~12K queries/sec peak. Daily backups via pg_dump. ~3 years of operational history.</from-state> <to-state>Postgres 16 on AWS RDS (Multi-AZ, db.r6i.xlarge). Goal: get managed backups, automated failover, cleaner upgrade path forward, and the Pg 16 features (improved partition support, faster vacuum).</to-state> <scale>~12K queries/sec peak. ~50 services + workers connect. ~600GB data. 5K active customers.</scale> <criticality>Critical. Every customer transaction touches the DB. Down DB = product down.</criticality> <rollback-acceptable-delay>Minutes only. Multi-hour outage = compliance issue + customer SLA breach.</rollback-acceptable-delay> <existing-dual-run-capability>No dual-write infrastructure today. We have feature flags via LaunchDarkly. We can route reads via env var per service.</existing-dual-run-capability> <team-availability>4 backend engineers + 1 DBA contractor. All Pacific timezone. The DBA has done 2 prior Postgres migrations.</team-availability> <prior-migration-experience>Team did one minor Pg 13→14 in-place upgrade 2 years ago (went well, simple). Have not done a vendor switch.</prior-migration-experience> <constraints>Want to be on RDS by Q3 2026 for compliance audit. Currently April. So 5 months max. Prefer 3 months. Budget for AWS DMS or pg_logical replication.</constraints> <integration-points>Backend services (50), Stripe webhook handler, our analytics pipeline (read-replica today), our CDC stream into Snowflake (uses logical replication slots).</integration-points>
Migration Plan: Postgres 14 (EC2 self-hosted) → Postgres 16 (RDS Multi-AZ)
Risk Assessment
This is a high-blast-radius migration. Postgres is your single point of failure today; you're moving it. Worst case: extended outage if cut-over fails AND rollback fails.
Mitigating factors:
- AWS DMS or pg_logical replication enables dual-state operation (write to both during transition)
- RDS is well-trodden infrastructure; failure modes are documented
- Your DBA contractor has prior Postgres migration experience
Aggravating factors:
- 50 service consumers — coordination complexity
- Snowflake CDC uses logical replication slots — the migration affects the upstream
- 600GB data — initial sync takes hours, not minutes
- 12K QPS peak — small windows of inconsistency are visible to customers
Confidence: Medium-high with the staged plan below. Low if attempted as big-bang.
Estimated total timeline: 8 weeks from kickoff to old-DB-decommission. Aggressive but doable.
Pre-Migration Prep (Weeks 1-3, before any traffic moves)
Week 1: Infrastructure + tooling
- Provision RDS Postgres 16 instance (Multi-AZ, same region, db.r6i.xlarge to start)
- Set up VPC peering / Security Groups so app servers can reach both DBs
- Provision AWS DMS instance OR set up pg_logical replication from EC2 → RDS
- Configure CloudWatch alarms: connection count, replication lag, query latency, error rate
- Set up cross-DB query latency monitoring (Datadog or similar)
Week 2: Initial sync + validation
- Take a fresh
pg_dumpof EC2 → import to RDS (initial bulk load, ~6-8 hours) - Start logical replication from EC2 → RDS (continuous CDC of new writes)
- Verify replication lag stays <5 seconds steady-state
- Run row-count validation on every table: must match EC2
- Spot-check ~50 random rows per major table for byte-identical match
- Run query-correctness validation: pick 100 representative SELECT queries, run on both, diff results
Week 3: Application-layer prep
- Add a database-router abstraction in your app code (LaunchDarkly flag controls which DB)
- Default the flag to 'EC2' (current state) — no behavior change yet
- Deploy this code to all 50 services. Smoke-test that 'EC2' route still works for everything
- Verify the Stripe webhook handler is in the routing layer
- Update the analytics read-replica to clone from RDS instead of EC2 (so analytics doesn't depend on the old DB after migration)
- Update Snowflake CDC: this one is tricky — see specific scenario below
The Staged Rollout (Weeks 4-7)
Phase 1: Shadow Reads (Week 4 — 5 days)
- Action: flip 5% of READ traffic to RDS via LaunchDarkly. Reads only, NOT writes.
- Verification:
- Compare query latency (should be similar; RDS slightly faster expected)
- Compare error rates (must be 0)
- Compare result correctness via sampling (1% of routed queries also run on EC2; results compared)
- Success criteria: 0 customer-visible issues, latency within 10% of EC2, no correctness mismatches
- Decision gate: if any criteria fail → fix, don't proceed. Rollback: flag back to 100% EC2 (instant).
Phase 2: 25% Reads (Week 4 — 2 days after Phase 1)
- Action: route 25% of reads to RDS
- Verification: same as Phase 1 + connection-pool health on RDS
- Success criteria: sustained for 48h with no incidents
- Decision gate: same
Phase 3: 100% Reads on RDS (Week 5)
- Action: route 100% of reads to RDS. Writes still hit EC2 only.
- Verification: same + watch RDS performance under full read load
- Success criteria: stable for 5 days minimum
- Decision gate: before proceeding to Phase 4 — DBA + tech lead + on-call must sign off
Phase 4: Dual-Write (Week 6)
- Action: application writes to BOTH EC2 and RDS for every transaction. Reads still on RDS.
- Implementation: transaction wraps both writes; if EITHER fails, both roll back. Use 2-phase commit pattern in app layer (NOT distributed transactions — too complex).
- Verification:
- 100% write parity (any divergence = bug)
- Latency: dual-write adds ~5-15ms; acceptable
- Replication lag from EC2 → RDS shrinks to ~0 (since we're writing both)
- Success criteria: 5 days of dual-write with 0 divergence
- Decision gate: automated diff check shows < 0.001% divergence
Phase 5: Cutover Writes (Week 7 — the actual migration moment)
- Action: writes go ONLY to RDS. EC2 receives no new writes.
- Pre-action: verify all running transactions on EC2 have completed
- Action timing: during business hours (Tuesday 10am Pacific is ideal — full team awake, low-traffic period)
- Verification immediately:
- Writes succeeding
- Reads succeeding
- Stripe webhooks processing
- Analytics pipeline still working
- Snowflake CDC still flowing
- Hold dual-state for 7 days: EC2 stays online, receives no new writes, but is recoverable if RDS has unforeseen issues
- Decision gate: 7 days of stable RDS-only operation before declaring success
Phase 6: Decommission (Week 8)
- Action: stop EC2 Postgres process, snapshot disk for archive, terminate instance
- Verification: all systems still working without EC2
- Final cleanup: remove dual-write code path, remove LaunchDarkly flag, update docs
Verification Checks per Phase
Continuous monitoring (all phases):
- RDS CPU utilization (alert >70%)
- RDS connection count (alert >80% of max_connections)
- RDS query latency p95, p99 (alert >2× EC2 baseline)
- Replication lag (alert >10s during dual-state)
- Application error rate (alert >0.1%)
- Customer-impact metrics: checkout success rate, login success rate (any drop = stop)
- DMS replication health (if using DMS)
Automated diff checks:
During dual-write phase, run hourly:
- Row counts of all tables (EC2 vs RDS)
- Sample 0.01% of recent writes, query both DBs, diff
- Foreign key integrity checks
- Sequence values (if using sequences)
Kill Switch
Primary kill switch: LaunchDarkly flag database_target with values 'ec2' (default), 'rds', 'both'.
- Flip to 'ec2' → all traffic returns to EC2 immediately (within seconds)
- This works UNTIL phase 5 (cutover). After cutover, EC2 is no longer receiving writes, so flipping to 'ec2' would cause data inconsistency.
Secondary kill switch (post-cutover): during the 7-day hold period, you can:
- Quickly re-enable dual-write (config change, ~10 min to deploy)
- Or accept short data loss + revert to EC2 (extreme — only for catastrophic RDS failure)
Decision after cutover: if RDS fails in the 7-day hold window, the rollback is to:
1. Re-enable dual-write (now to EC2 + RDS) so writes are captured to BOTH
2. Wait for RDS issue to be diagnosed + fixed
3. Decide: continue on RDS or fall back to EC2
4. If falling back to EC2: replay any RDS-only writes from your app's audit log to EC2
This is why the 7-day hold matters — it's the window where RDS has to prove itself before EC2 is destroyed.
Rollback Procedure (specific commands)
During Phase 1-3 (read traffic only on RDS):
# LaunchDarkly UI: flip database_target flag to 'ec2'
# OR via API: ld flags set database_target ec2
# Effect: instant. App config refreshes within 30s.
During Phase 4 (dual-write):
# Stop dual-write — flip flag to 'ec2-only-writes'
# Effect: writes go to EC2 only; reads on RDS continue
# OR full rollback to EC2: flip database_target to 'ec2'
During Phase 5 cutover (writes on RDS):
# Re-enable dual-write: flip database_target to 'both'
# Verify writes flow to both
# If still problematic, flip to 'ec2' and accept short window of writes that didn't reach EC2
# These are recoverable from RDS via: pg_dump rds > diff.sql; psql ec2 < diff.sql (after diff)
During Phase 6 (post-cutover):
# Old EC2 still running but receiving no writes
# To rollback:
# 1. Bring EC2 back to write-receiving state
# 2. Replicate RDS → EC2 to catch up the writes RDS-only received during the hold period
# 3. Flip flag back to ec2
# Estimated time: 1-3 hours depending on data volume diff
Communication Plan
Internal team:
- Migration kickoff: all-hands, week 1 — align on plan + on-call
- Daily standup mention during phases
- Slack #engineering channel: live updates during phase transitions
- Postmortem after Phase 5 success
Stakeholders:
- CTO + CEO: weekly status email + immediate notification on any phase rollback
- Product team: notify them of the migration window so they don't ship risky features that week
- Compliance: notify of the migration completion for the audit
Customers:
- No customer-facing communication unless something goes wrong (the migration should be invisible to them)
- IF an issue occurs: status page update within 5 min, Twitter/X update within 15 min if customer-impacting
Timeline
Week 1 (Apr 28 - May 2): Infrastructure + tooling setup
Week 2 (May 5-9): Initial sync + validation
Week 3 (May 12-16): Application-layer prep, deploy router code
Week 4 (May 19-23): Phase 1 (shadow reads 5%) + Phase 2 (25%)
Week 5 (May 26-30): Phase 3 (100% reads on RDS) — hold for 5 days
Week 6 (Jun 2-6): Phase 4 (dual-write) — hold for 5 days
Week 7 (Jun 9-13): Phase 5 (cutover, Tuesday 10am Pacific) — hold for 7 days
Week 8 (Jun 16-20): Phase 6 (decommission EC2)
Done by mid-June. Compliance deadline Q3 met with comfort.
Migration windows: Phase transitions ALWAYS Tuesday-Thursday, 9am-3pm Pacific. Never Friday (no recovery window before weekend). Never Monday (Monday = highest traffic + highest risk).
On-Call Setup
Phase 4-5 (highest-risk weeks):
- Primary on-call: DBA contractor (paged first for DB issues)
- Secondary: Senior backend engineer (paged for app-layer issues)
- Tech lead on standby (decision authority)
Standard hours coverage: all 4 backend engineers know the runbook + can execute kill switch
Escalation path: primary → secondary → tech lead → CTO
Pages-firing thresholds:
- Customer error rate > 0.5% sustained 5 min → page primary
- Customer error rate > 2% → page primary + secondary + tech lead
- Any data integrity check failure → page everyone
Post-Migration Cleanup (Week 8+)
1. Remove dual-write code path from app. Single-DB code is simpler + faster.
2. Remove LaunchDarkly flag. Once Phase 6 is stable for 30 days.
3. Update internal docs to reflect RDS architecture.
4. Archive EC2 disk snapshots for 90 days, then delete. Compliance + audit.
5. Update runbooks for DB issues (RDS-specific procedures).
6. Postmortem doc: what went well, what surprises hit, lessons for next time.
What This Plan Doesn't Solve
- Won't fix application-layer bugs. If your app has SQL queries that work on Pg 14 but break on Pg 16 (rare but possible), tests should catch it. If they don't, ship it + fix forward.
- Won't handle 'we changed our mind' scenarios. Once you decommission EC2 in Week 8, going back is rebuild-from-scratch territory.
- Won't compress timeline. 8 weeks is aggressive for a critical-path migration; squeezing to 4 weeks adds risk dramatically.
- Won't reduce on-call burden during transition. Plan for higher-than-normal incident response.
Specific Risk Scenarios
Scenario 1: 'Snowflake CDC stops working after replication slot moves to RDS.'
Likelihood: medium. Snowflake CDC reads from EC2's logical replication slots; switching to RDS breaks the slot.
Response: pre-plan in Week 1. Either (a) reconfigure Snowflake to read from RDS replication slots, OR (b) accept a brief gap in CDC during cutover and backfill with a one-time SELECT query.
Scenario 2: 'RDS connection limits hit during peak traffic.'
Likelihood: low (we provisioned db.r6i.xlarge with ample max_connections). Watch metrics.
Response: scale RDS instance up (no downtime needed for compute scaling on Multi-AZ) OR enable RDS Proxy.
Scenario 3: 'Dual-write divergence detected in Week 6.'
Likelihood: medium. Race conditions in app layer, transaction isolation issues.
Response: stop progression. Diagnose divergence root cause. Don't proceed to Phase 5 until 0 divergence for 5 consecutive days.
Scenario 4: 'Stripe webhook delivers an event during cutover that double-processes.'
Likelihood: low (idempotency keys should prevent), but worth verifying.
Response: idempotency is your defense. Verify Stripe handler uses event-ID idempotency before starting. Test in staging.
Scenario 5: 'Tuesday 10am cutover fails; need to roll back at 11am.'
Likelihood: this is what we plan for.
Response: kill switch flag to 'ec2' (instant). Diagnose. Schedule retry for next Tuesday. No weekend deploys.
Key Takeaways
- This is an 8-week staged migration. Big-bang is too risky for a critical-path system at your scale.
- 5 phases: shadow → 5% → 25% → 100% reads → dual-write → cutover. Hold periods at each gate.
- Kill switch is a LaunchDarkly flag. Deploy-free. Instant. Tested before starting.
- The 7-day hold after cutover is non-negotiable. Don't decommission EC2 until RDS has proven itself.
- Tuesday-Thursday business-hours only for phase transitions. Never Friday. Never Monday.
- Document specific rollback commands BEFORE starting. The Phase 4-5 rollback procedures must be tested in staging first.
Common use cases
- Engineer planning a Postgres major version upgrade (14 → 16)
- Team migrating from Redis to KeyDB / Memcached / a different cache
- Backend lead upgrading their framework version (Next.js 14 → 15, Rails 7 → 8)
- Architect splitting a monolith into 2 services (the first split is the hardest)
- Engineer changing primary database vendor (Postgres → Aurora, MySQL → Postgres)
- Team replacing a 3rd-party service (Auth0 → Clerk, SendGrid → Postmark)
Best AI model for this
Claude Opus 4. Migration planning needs reasoning about reversibility, blast radius, and edge cases — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Pro tips
- Shadow mode first. Run new system alongside old, compare outputs, don't act on new system's output. Surfaces edge cases without risk.
- 5% → 25% → 50% → 100% traffic rollout is safer than direct cut-over. Each phase reveals new failure modes.
- Always have a kill switch that doesn't require a deploy. Env var, feature flag, config change — something you can flip in 1 minute.
- Migrations during business hours, not weekends. You want senior engineers awake when the failure happens.
- Verify dual-write consistency before flipping reads. Most data migrations fail at this step, not at the cut-over.
- The 'last 5%' always breaks. That's where the rare data conditions live. Plan for it; don't be surprised.
- Document the rollback procedure BEFORE you start the migration. Not 'we can revert if needed' — actual specific commands.
Customization tips
- Be specific about the from-state and to-state. Migration plans differ dramatically between minor version upgrades vs vendor changes.
- Specify the criticality. A migration on internal-tooling can use simpler patterns than payment-system migrations.
- Be honest about rollback acceptable delay. 'Minutes' vs 'hours' vs 'days' fundamentally shapes the migration design.
- Mention prior migration experience. Teams who've done 5 prior migrations need less guardrails than first-timers.
- List ALL integration points. Migrations often fail at the upstream/downstream systems (analytics pipeline, CDC, webhooks) not at the migrated system itself.
- Use the Critical-Path Mode variant for migrations on payment / auth / data systems — adds extra verification gates and longer parallel-run periods.
Variants
Database Migration Mode
For DB schema or version upgrades — emphasizes data integrity, dual-write/dual-read, and zero-downtime patterns.
Framework Upgrade Mode
For framework or runtime version upgrades — emphasizes API surface changes, dependency cascades, and staged deploys.
Service Migration Mode
For replacing a 3rd-party service or splitting a monolith — emphasizes API contract testing and staged cutover.
Critical-Path Mode
For migrations on payment / auth / data-store systems — adds extra verification gates and longer parallel-run periods.
Frequently asked questions
How do I use the Migration Rollout Plan Designer prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with Migration Rollout Plan Designer?
Claude Opus 4. Migration planning needs reasoning about reversibility, blast radius, and edge cases — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Can I customize the Migration Rollout Plan Designer prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Shadow mode first. Run new system alongside old, compare outputs, don't act on new system's output. Surfaces edge cases without risk.; 5% → 25% → 50% → 100% traffic rollout is safer than direct cut-over. Each phase reveals new failure modes.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals