⚡ Promptolis Original · Coding & Development

🚚 Migration Rollout Plan Designer

Designs your DB migration / framework upgrade / dependency change rollout: the staged plan, the rollback gate at each phase, the verification checks, and the 'kill switch' if something breaks at 3am.

⏱️ 5 min to set up 🤖 ~120 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Most migrations are 'do it Saturday, hope for the best.' This Original designs the staged rollout: shadow mode, percentage-based traffic, verification gates at each phase, and the explicit rollback decision criteria.

Outputs the full plan: pre-migration prep, the staged steps with verification at each, monitoring + alerts to add, the rollback procedure (with specific commands), and the post-migration cleanup.

Calibrated to 2026 migration realities: feature flags + LaunchDarkly for percentage rollout, blue-green deploys, the difference between online + offline migrations, the 'last 5% always breaks' pattern.

Includes the 4 question every migration must answer: (1) is the rollback fast or slow? (2) does the new system handle the OLD system's edge cases? (3) what's the data-shape change risk? (4) what's the on-call burden? Answers shape the rollout plan.

The prompt

Promptolis Original · Copy-ready

<role> You are a migration architect with 7+ years planning + executing critical migrations: DB upgrades, framework changes, service migrations, monolith splits. You have shipped 30+ migrations to production with zero customer-impacting incidents. You know the failure modes that surprise teams + how to design around them. You are direct. You will tell a builder their 'do it Saturday' plan is wrong, that they need shadow mode, or that their rollback is too slow given the blast radius. You refuse to recommend big-bang migrations for non-trivial changes — staged rollout almost always wins. </role> <principles> 1. Shadow mode first for non-trivial changes. 2. 5% → 25% → 50% → 100% traffic rollout. Verify at each. 3. Kill switch must be deploy-free. 4. Migrations during business hours, never weekends. 5. Verify dual-write consistency before flipping reads. 6. The 'last 5%' always breaks. Plan for it. 7. Document specific rollback commands BEFORE starting. </principles> <input> <migration-type>{DB upgrade / framework upgrade / service replacement / monolith split / data migration / 'recommend'}</migration-type> <from-state>{the current system, version, scale}</from-state> <to-state>{the target system, version, scale}</to-state> <scale>{traffic, data volume, user count}</scale> <criticality>{is this on payment / auth / login / public-facing critical path?}</criticality> <rollback-acceptable-delay>{how long can we tolerate broken state if rollback needed: instant / minutes / hours}</rollback-acceptable-delay> <existing-dual-run-capability>{can we run old + new in parallel? feature-flag system in place?}</existing-dual-run-capability> <team-availability>{who's available, what's their time zone, how many on-call}</team-availability> <prior-migration-experience>{has the team done migrations before? known anti-patterns?}</prior-migration-experience> <constraints>{deadlines, dependent migrations, blocked work}</constraints> <integration-points>{external systems that depend on this}</integration-points> </input> <output-format> # Migration Plan: [from] → [to] ## Risk Assessment What could go wrong, blast radius, confidence level. Honest. ## Pre-Migration Prep Things to do BEFORE starting. Often the highest-leverage work. ## The Staged Rollout Phase-by-phase plan: shadow → 5% → 25% → 50% → 100%. For each: action, verification, success criteria, decision gate. ## Verification Checks per Phase What to monitor + alert on. Specific metrics, not 'check it looks ok.' ## Kill Switch The deploy-free way to stop the migration. Specific config / flag / env var. ## Rollback Procedure For each phase: specific commands to roll back. Tested in advance. ## Communication Plan Who gets told what, when. Stakeholder updates, customer comms if needed. ## Timeline Specific dates + times. Always business hours. ## On-Call Setup Who's on, what they monitor, escalation path. ## Post-Migration Cleanup The deletion of old system, code paths, dual-write logic. When + how. ## What This Plan Doesn't Solve Honest limits. Some migrations have inherent risk that no plan removes. ## Specific Risk Scenarios 3-5 'what if X happens' with the response. ## Key Takeaways 4-6 bullets — for the team's migration playbook. </output-format> <auto-intake> If input incomplete: ask for migration type, from-state, to-state, scale, criticality, rollback delay tolerance, dual-run capability, team availability, prior experience, constraints, integration points. </auto-intake> Now, design the migration plan:

0 copies

🚀 Open in ChatGPT ✨ Open in Claude 💎 Open in Gemini

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input

<migration-type>Database vendor migration: Postgres 14 (self-hosted on AWS EC2) → Postgres 16 on AWS RDS</migration-type>
<from-state>Postgres 14.7 self-hosted on a single EC2 instance (db.r5.xlarge equivalent). 600GB data, ~12K queries/sec peak. Daily backups via pg_dump. ~3 years of operational history.</from-state>
<to-state>Postgres 16 on AWS RDS (Multi-AZ, db.r6i.xlarge). Goal: get managed backups, automated failover, cleaner upgrade path forward, and the Pg 16 features (improved partition support, faster vacuum).</to-state>
<scale>~12K queries/sec peak. ~50 services + workers connect. ~600GB data. 5K active customers.</scale>
<criticality>Critical. Every customer transaction touches the DB. Down DB = product down.</criticality>
<rollback-acceptable-delay>Minutes only. Multi-hour outage = compliance issue + customer SLA breach.</rollback-acceptable-delay>
<existing-dual-run-capability>No dual-write infrastructure today. We have feature flags via LaunchDarkly. We can route reads via env var per service.</existing-dual-run-capability>
<team-availability>4 backend engineers + 1 DBA contractor. All Pacific timezone. The DBA has done 2 prior Postgres migrations.</team-availability>
<prior-migration-experience>Team did one minor Pg 13→14 in-place upgrade 2 years ago (went well, simple). Have not done a vendor switch.</prior-migration-experience>
<constraints>Want to be on RDS by Q3 2026 for compliance audit. Currently April. So 5 months max. Prefer 3 months. Budget for AWS DMS or pg_logical replication.</constraints>
<integration-points>Backend services (50), Stripe webhook handler, our analytics pipeline (read-replica today), our CDC stream into Snowflake (uses logical replication slots).</integration-points>

🤖 Output

Migration Plan: Postgres 14 (EC2 self-hosted) → Postgres 16 (RDS Multi-AZ)

Risk Assessment

This is a high-blast-radius migration. Postgres is your single point of failure today; you're moving it. Worst case: extended outage if cut-over fails AND rollback fails.

Mitigating factors:

AWS DMS or pg_logical replication enables dual-state operation (write to both during transition)
RDS is well-trodden infrastructure; failure modes are documented
Your DBA contractor has prior Postgres migration experience

Aggravating factors:

50 service consumers — coordination complexity
Snowflake CDC uses logical replication slots — the migration affects the upstream
600GB data — initial sync takes hours, not minutes
12K QPS peak — small windows of inconsistency are visible to customers

Confidence: Medium-high with the staged plan below. Low if attempted as big-bang.

Estimated total timeline: 8 weeks from kickoff to old-DB-decommission. Aggressive but doable.

Pre-Migration Prep (Weeks 1-3, before any traffic moves)

Week 1: Infrastructure + tooling

Provision RDS Postgres 16 instance (Multi-AZ, same region, db.r6i.xlarge to start)
Set up VPC peering / Security Groups so app servers can reach both DBs
Provision AWS DMS instance OR set up pg_logical replication from EC2 → RDS
Configure CloudWatch alarms: connection count, replication lag, query latency, error rate
Set up cross-DB query latency monitoring (Datadog or similar)

Week 2: Initial sync + validation

Take a fresh pg_dump of EC2 → import to RDS (initial bulk load, ~6-8 hours)
Start logical replication from EC2 → RDS (continuous CDC of new writes)
Verify replication lag stays <5 seconds steady-state
Run row-count validation on every table: must match EC2
Spot-check ~50 random rows per major table for byte-identical match
Run query-correctness validation: pick 100 representative SELECT queries, run on both, diff results

Week 3: Application-layer prep

Add a database-router abstraction in your app code (LaunchDarkly flag controls which DB)
Default the flag to 'EC2' (current state) — no behavior change yet
Deploy this code to all 50 services. Smoke-test that 'EC2' route still works for everything
Verify the Stripe webhook handler is in the routing layer
Update the analytics read-replica to clone from RDS instead of EC2 (so analytics doesn't depend on the old DB after migration)
Update Snowflake CDC: this one is tricky — see specific scenario below

The Staged Rollout (Weeks 4-7)

Phase 1: Shadow Reads (Week 4 — 5 days)

Action: flip 5% of READ traffic to RDS via LaunchDarkly. Reads only, NOT writes.
Verification:

- Compare query latency (should be similar; RDS slightly faster expected)

- Compare error rates (must be 0)

- Compare result correctness via sampling (1% of routed queries also run on EC2; results compared)

Success criteria: 0 customer-visible issues, latency within 10% of EC2, no correctness mismatches
Decision gate: if any criteria fail → fix, don't proceed. Rollback: flag back to 100% EC2 (instant).

Phase 2: 25% Reads (Week 4 — 2 days after Phase 1)

Action: route 25% of reads to RDS
Verification: same as Phase 1 + connection-pool health on RDS
Success criteria: sustained for 48h with no incidents
Decision gate: same

Phase 3: 100% Reads on RDS (Week 5)

Action: route 100% of reads to RDS. Writes still hit EC2 only.
Verification: same + watch RDS performance under full read load
Success criteria: stable for 5 days minimum
Decision gate: before proceeding to Phase 4 — DBA + tech lead + on-call must sign off

Phase 4: Dual-Write (Week 6)

Action: application writes to BOTH EC2 and RDS for every transaction. Reads still on RDS.
Implementation: transaction wraps both writes; if EITHER fails, both roll back. Use 2-phase commit pattern in app layer (NOT distributed transactions — too complex).
Verification:

- 100% write parity (any divergence = bug)

- Latency: dual-write adds ~5-15ms; acceptable

- Replication lag from EC2 → RDS shrinks to ~0 (since we're writing both)

Success criteria: 5 days of dual-write with 0 divergence
Decision gate: automated diff check shows < 0.001% divergence

Phase 5: Cutover Writes (Week 7 — the actual migration moment)

Action: writes go ONLY to RDS. EC2 receives no new writes.
Pre-action: verify all running transactions on EC2 have completed
Action timing: during business hours (Tuesday 10am Pacific is ideal — full team awake, low-traffic period)
Verification immediately:

- Writes succeeding

- Reads succeeding

- Stripe webhooks processing

- Analytics pipeline still working

- Snowflake CDC still flowing

Hold dual-state for 7 days: EC2 stays online, receives no new writes, but is recoverable if RDS has unforeseen issues
Decision gate: 7 days of stable RDS-only operation before declaring success

Phase 6: Decommission (Week 8)

Action: stop EC2 Postgres process, snapshot disk for archive, terminate instance
Verification: all systems still working without EC2
Final cleanup: remove dual-write code path, remove LaunchDarkly flag, update docs

Verification Checks per Phase

Continuous monitoring (all phases):

- RDS CPU utilization (alert >70%)
- RDS connection count (alert >80% of max_connections)
- RDS query latency p95, p99 (alert >2× EC2 baseline)
- Replication lag (alert >10s during dual-state)
- Application error rate (alert >0.1%)
- Customer-impact metrics: checkout success rate, login success rate (any drop = stop)
- DMS replication health (if using DMS)

Automated diff checks:

During dual-write phase, run hourly:

Row counts of all tables (EC2 vs RDS)
Sample 0.01% of recent writes, query both DBs, diff
Foreign key integrity checks
Sequence values (if using sequences)

Kill Switch

Primary kill switch: LaunchDarkly flag database_target with values 'ec2' (default), 'rds', 'both'.

Flip to 'ec2' → all traffic returns to EC2 immediately (within seconds)
This works UNTIL phase 5 (cutover). After cutover, EC2 is no longer receiving writes, so flipping to 'ec2' would cause data inconsistency.

Secondary kill switch (post-cutover): during the 7-day hold period, you can:

Quickly re-enable dual-write (config change, ~10 min to deploy)
Or accept short data loss + revert to EC2 (extreme — only for catastrophic RDS failure)

Decision after cutover: if RDS fails in the 7-day hold window, the rollback is to:

1. Re-enable dual-write (now to EC2 + RDS) so writes are captured to BOTH

2. Wait for RDS issue to be diagnosed + fixed

3. Decide: continue on RDS or fall back to EC2

4. If falling back to EC2: replay any RDS-only writes from your app's audit log to EC2

This is why the 7-day hold matters — it's the window where RDS has to prove itself before EC2 is destroyed.

Rollback Procedure (specific commands)

During Phase 1-3 (read traffic only on RDS):

# LaunchDarkly UI: flip database_target flag to 'ec2'
# OR via API: ld flags set database_target ec2
# Effect: instant. App config refreshes within 30s.

During Phase 4 (dual-write):

# Stop dual-write — flip flag to 'ec2-only-writes'
# Effect: writes go to EC2 only; reads on RDS continue
# OR full rollback to EC2: flip database_target to 'ec2'

During Phase 5 cutover (writes on RDS):

# Re-enable dual-write: flip database_target to 'both'
# Verify writes flow to both
# If still problematic, flip to 'ec2' and accept short window of writes that didn't reach EC2
# These are recoverable from RDS via: pg_dump rds > diff.sql; psql ec2 < diff.sql (after diff)

During Phase 6 (post-cutover):

# Old EC2 still running but receiving no writes
# To rollback: 
# 1. Bring EC2 back to write-receiving state
# 2. Replicate RDS → EC2 to catch up the writes RDS-only received during the hold period
# 3. Flip flag back to ec2
# Estimated time: 1-3 hours depending on data volume diff

Communication Plan

Internal team:

Migration kickoff: all-hands, week 1 — align on plan + on-call
Daily standup mention during phases
Slack #engineering channel: live updates during phase transitions
Postmortem after Phase 5 success

Stakeholders:

CTO + CEO: weekly status email + immediate notification on any phase rollback
Product team: notify them of the migration window so they don't ship risky features that week
Compliance: notify of the migration completion for the audit

Customers:

No customer-facing communication unless something goes wrong (the migration should be invisible to them)
IF an issue occurs: status page update within 5 min, Twitter/X update within 15 min if customer-impacting

Timeline

Week 1 (Apr 28 - May 2): Infrastructure + tooling setup
Week 2 (May 5-9): Initial sync + validation
Week 3 (May 12-16): Application-layer prep, deploy router code
Week 4 (May 19-23): Phase 1 (shadow reads 5%) + Phase 2 (25%)
Week 5 (May 26-30): Phase 3 (100% reads on RDS) — hold for 5 days
Week 6 (Jun 2-6): Phase 4 (dual-write) — hold for 5 days
Week 7 (Jun 9-13): Phase 5 (cutover, Tuesday 10am Pacific) — hold for 7 days
Week 8 (Jun 16-20): Phase 6 (decommission EC2)

Done by mid-June. Compliance deadline Q3 met with comfort.

Migration windows: Phase transitions ALWAYS Tuesday-Thursday, 9am-3pm Pacific. Never Friday (no recovery window before weekend). Never Monday (Monday = highest traffic + highest risk).

On-Call Setup

Phase 4-5 (highest-risk weeks):

Primary on-call: DBA contractor (paged first for DB issues)
Secondary: Senior backend engineer (paged for app-layer issues)
Tech lead on standby (decision authority)

Standard hours coverage: all 4 backend engineers know the runbook + can execute kill switch

Escalation path: primary → secondary → tech lead → CTO

Pages-firing thresholds:

Customer error rate > 0.5% sustained 5 min → page primary
Customer error rate > 2% → page primary + secondary + tech lead
Any data integrity check failure → page everyone

Post-Migration Cleanup (Week 8+)

1. Remove dual-write code path from app. Single-DB code is simpler + faster.

2. Remove LaunchDarkly flag. Once Phase 6 is stable for 30 days.

3. Update internal docs to reflect RDS architecture.

4. Archive EC2 disk snapshots for 90 days, then delete. Compliance + audit.

5. Update runbooks for DB issues (RDS-specific procedures).

6. Postmortem doc: what went well, what surprises hit, lessons for next time.

What This Plan Doesn't Solve

Won't fix application-layer bugs. If your app has SQL queries that work on Pg 14 but break on Pg 16 (rare but possible), tests should catch it. If they don't, ship it + fix forward.
Won't handle 'we changed our mind' scenarios. Once you decommission EC2 in Week 8, going back is rebuild-from-scratch territory.
Won't compress timeline. 8 weeks is aggressive for a critical-path migration; squeezing to 4 weeks adds risk dramatically.
Won't reduce on-call burden during transition. Plan for higher-than-normal incident response.

Specific Risk Scenarios

Scenario 1: 'Snowflake CDC stops working after replication slot moves to RDS.'

Likelihood: medium. Snowflake CDC reads from EC2's logical replication slots; switching to RDS breaks the slot.

Response: pre-plan in Week 1. Either (a) reconfigure Snowflake to read from RDS replication slots, OR (b) accept a brief gap in CDC during cutover and backfill with a one-time SELECT query.

Scenario 2: 'RDS connection limits hit during peak traffic.'

Likelihood: low (we provisioned db.r6i.xlarge with ample max_connections). Watch metrics.

Response: scale RDS instance up (no downtime needed for compute scaling on Multi-AZ) OR enable RDS Proxy.

Scenario 3: 'Dual-write divergence detected in Week 6.'

Likelihood: medium. Race conditions in app layer, transaction isolation issues.

Response: stop progression. Diagnose divergence root cause. Don't proceed to Phase 5 until 0 divergence for 5 consecutive days.

Scenario 4: 'Stripe webhook delivers an event during cutover that double-processes.'

Likelihood: low (idempotency keys should prevent), but worth verifying.

Response: idempotency is your defense. Verify Stripe handler uses event-ID idempotency before starting. Test in staging.

Scenario 5: 'Tuesday 10am cutover fails; need to roll back at 11am.'

Likelihood: this is what we plan for.

Response: kill switch flag to 'ec2' (instant). Diagnose. Schedule retry for next Tuesday. No weekend deploys.

Key Takeaways

This is an 8-week staged migration. Big-bang is too risky for a critical-path system at your scale.
5 phases: shadow → 5% → 25% → 100% reads → dual-write → cutover. Hold periods at each gate.
Kill switch is a LaunchDarkly flag. Deploy-free. Instant. Tested before starting.
The 7-day hold after cutover is non-negotiable. Don't decommission EC2 until RDS has proven itself.
Tuesday-Thursday business-hours only for phase transitions. Never Friday. Never Monday.
Document specific rollback commands BEFORE starting. The Phase 4-5 rollback procedures must be tested in staging first.

Common use cases

Engineer planning a Postgres major version upgrade (14 → 16)
Team migrating from Redis to KeyDB / Memcached / a different cache
Backend lead upgrading their framework version (Next.js 14 → 15, Rails 7 → 8)
Architect splitting a monolith into 2 services (the first split is the hardest)
Engineer changing primary database vendor (Postgres → Aurora, MySQL → Postgres)
Team replacing a 3rd-party service (Auth0 → Clerk, SendGrid → Postmark)

Best AI model for this

Claude Opus 4. Migration planning needs reasoning about reversibility, blast radius, and edge cases — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Pro tips

Shadow mode first. Run new system alongside old, compare outputs, don't act on new system's output. Surfaces edge cases without risk.
5% → 25% → 50% → 100% traffic rollout is safer than direct cut-over. Each phase reveals new failure modes.
Always have a kill switch that doesn't require a deploy. Env var, feature flag, config change — something you can flip in 1 minute.
Migrations during business hours, not weekends. You want senior engineers awake when the failure happens.
Verify dual-write consistency before flipping reads. Most data migrations fail at this step, not at the cut-over.
The 'last 5%' always breaks. That's where the rare data conditions live. Plan for it; don't be surprised.
Document the rollback procedure BEFORE you start the migration. Not 'we can revert if needed' — actual specific commands.

Customization tips

Be specific about the from-state and to-state. Migration plans differ dramatically between minor version upgrades vs vendor changes.
Specify the criticality. A migration on internal-tooling can use simpler patterns than payment-system migrations.
Be honest about rollback acceptable delay. 'Minutes' vs 'hours' vs 'days' fundamentally shapes the migration design.
Mention prior migration experience. Teams who've done 5 prior migrations need less guardrails than first-timers.
List ALL integration points. Migrations often fail at the upstream/downstream systems (analytics pipeline, CDC, webhooks) not at the migrated system itself.
Use the Critical-Path Mode variant for migrations on payment / auth / data systems — adds extra verification gates and longer parallel-run periods.

Variants

Database Migration Mode

For DB schema or version upgrades — emphasizes data integrity, dual-write/dual-read, and zero-downtime patterns.

Framework Upgrade Mode

For framework or runtime version upgrades — emphasizes API surface changes, dependency cascades, and staged deploys.

Service Migration Mode

For replacing a 3rd-party service or splitting a monolith — emphasizes API contract testing and staged cutover.

Critical-Path Mode

For migrations on payment / auth / data-store systems — adds extra verification gates and longer parallel-run periods.

Frequently asked questions

How do I use the Migration Rollout Plan Designer prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Migration Rollout Plan Designer?

Claude Opus 4. Migration planning needs reasoning about reversibility, blast radius, and edge cases — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Can I customize the Migration Rollout Plan Designer prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Shadow mode first. Run new system alongside old, compare outputs, don't act on new system's output. Surfaces edge cases without risk.; 5% → 25% → 50% → 100% traffic rollout is safer than direct cut-over. Each phase reveals new failure modes.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals

Curated by

Promptolis Editorial

Every Promptolis Original is hand-crafted and reviewed before publishing — built from scratch for 2026-grade LLMs.

Last reviewed on 2026-04-28 · About Promptolis