⚡ Promptolis Original · Data & Analytics

🔄 Data Pipeline Architect — ETL/ELT Design For 2026

The structured data pipeline design — covering ETL vs. ELT tradeoffs, batch vs. streaming architecture, tool selection (dbt / Airflow / Fivetran / custom), idempotency + reliability patterns, and the monitoring framework.

⏱️ 3 weeks design + ongoing 🤖 ~2 min in Claude 🗓️ Updated 2026-04-20

Why this is epic

Data pipelines are critical infrastructure but often built ad-hoc. This Original produces structured design: ETL/ELT choice, streaming vs. batch, tool selection, reliability patterns, monitoring.

Names the 6 pipeline failure modes (non-idempotent, no monitoring, brittle dependencies, slow recovery, data quality ignored, schema drift) + fixes.

Produces complete architecture framework with specific tool recommendations + scaling considerations.

The prompt

Promptolis Original · Copy-ready
<role> You are a data engineering architect with 12 years of experience. You've built pipelines at companies from startup to Fortune 500. You draw on dbt, Airflow, Fivetran, Snowflake/BigQuery, Kafka, + modern data stack patterns. You are direct. You will name when architecture is over-engineered, when streaming unnecessary, when monitoring inadequate, and when tool choices expensive. </role> <principles> 1. ELT > ETL for modern cloud warehouses. 2. Idempotent pipelines. 3. Schema contracts. 4. Monitor freshness + completeness + quality + latency. 5. Modern stack: dbt + Airflow/Prefect + Fivetran + Snowflake/BigQuery. 6. Batch first, streaming only if needed. 7. Automated data quality checks. 8. Document lineage. </principles> <input> <data-sources>{what produces data}</data-sources> <destination>{data warehouse + other consumers}</destination> <volume>{rows/day, size}</volume> <freshness-needs>{real-time / hourly / daily}</freshness-needs> <current-stack>{existing tools}</current-stack> <team-capacity>{who builds + maintains}</team-capacity> <budget>{tool + infrastructure}</budget> <constraints>{compliance, latency, data residency}</constraints> </input> <output-format> # Data Pipeline Architecture: [Scope] ## Sources + Destinations Overview ## ETL vs. ELT Decision ## Batch vs. Streaming ## Tool Stack Recommendations ## Idempotency + Reliability Patterns ## Monitoring + Alerting ## Implementation Plan ## Cost Estimation ## Key Takeaways 5 bullets. </output-format> <auto-intake> If input incomplete: ask for sources, destination, volume, freshness, current stack, team, budget, constraints. </auto-intake> Now, architect:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<data-sources>PostgreSQL (application data, 10M rows daily update), Stripe (billing), Salesforce (CRM), Mixpanel (product events, 50M events/month), Intercom (support).</data-sources>
<destination>Snowflake data warehouse for analytics + reporting. Metabase for dashboards.</destination>
<volume>Total daily ingest: ~500 GB raw. After processing: ~80 GB of modeled data.</volume>
<freshness-needs>Daily for dashboards + reports. Hourly for marketing + product metrics. Near-real-time for customer-facing features (separate pipeline).</freshness-needs>
<current-stack>Mixed legacy. Some custom Python scripts in cron. AWS Glue for some jobs. Stripe + Salesforce via manual exports. Patchwork, unreliable.</current-stack>
<team-capacity>1 data engineer full-time, 1 analytics engineer full-time. $50K annual tool budget.</team-capacity>
<budget>Tool budget $50K/year. Warehouse costs separate (~$5K/month).</budget>
<constraints>SOC 2 compliance. EU data stays in EU region. PII handling careful.</constraints>
🤖 Output

Data Pipeline Architecture: Modern Data Stack, 5 Sources → Snowflake, Legacy Migration

A data pipeline architecture produces structured modern stack replacing legacy patchwork. Your situation — mixed custom + Glue + manual exports — is classic modernization opportunity. This design: modern stack (Fivetran + dbt + Airflow + Snowflake), batch-first with clear streaming path if needed, robust monitoring, 3-month migration plan.

Sources + Destinations Overview

Sources:

SourceTypeVolumeMethod
PostgreSQLApplication DB10M rows/dayFivetran
StripeBilling API~10K events/dayFivetran
SalesforceCRM API~5K records/dayFivetran
MixpanelEvent tracking50M events/moFivetran export OR native Mixpanel → Snowflake
IntercomSupport~1K conversations/dayFivetran

Destination:

  • Snowflake (data warehouse)
  • Metabase (dashboards, connects to Snowflake)
  • Reverse ETL for ops tools (optional future)

Data flow:

Sources → Fivetran → Snowflake (raw layer)
                          ↓
                    dbt transformations
                          ↓
                    Snowflake (staging + marts)
                          ↓
                    Metabase + other consumers
                          ↓
                  (Optional) Reverse ETL

ETL vs. ELT Decision

ELT (Extract, Load, Transform) — Modern Choice

Why ELT for your stack:

  • Snowflake has massive compute + storage
  • Transformation cost-effective in-warehouse
  • dbt provides version-controlled, testable transformations
  • Preserves raw data for re-processing
  • Faster time-to-data (load first, transform later)

When ETL (legacy) would fit:

  • Very expensive compute target
  • Compliance requiring transformation before storage
  • Massive volumes where pre-processing saves significant cost

Your situation: ELT fits. Snowflake transform is economical + flexible.

Batch vs. Streaming

Batch (daily + hourly) — Primary approach

Why batch for you:

  • Freshness needs: daily/hourly (not real-time)
  • Simpler to build + maintain
  • Lower cost
  • Easier debugging
  • Team capacity supports

Streaming reserved for:

  • Customer-facing features (separate pipeline — not part of this analytics stack)
  • Future use cases if freshness requirements tighten

Architecture:

  • Daily refreshes: most sources (Postgres, Stripe, Salesforce, Intercom)
  • Hourly refreshes: product events + marketing data (Mixpanel)
  • If latency requirements tighten: consider Kafka + Snowflake streaming in 12-18 months

Tool Stack Recommendations

Ingestion: Fivetran

Why Fivetran:

  • Managed connectors for all 5 sources
  • Automated schema evolution
  • Low maintenance
  • Compliance features (SOC 2, GDPR)

Cost: $1,500-3,000/month based on your volume (5-15M monthly active rows)

Alternatives:

  • Airbyte (open-source, self-hosted) — lower cost but more maintenance
  • Segment (for product events specifically) — already have Mixpanel
  • Custom ingestion — more engineering, less reliable
Transformation: dbt (Data Build Tool)

Why dbt:

  • Industry standard for analytics engineering
  • Version-controlled SQL
  • Testing framework
  • Documentation generation
  • Lineage tracking

dbt Cloud vs. Core:

  • dbt Core (free) — self-hosted, requires setup
  • dbt Cloud ($100-300/month) — managed, scheduling, UI, worth it for small team

Recommendation: dbt Cloud for your team size (saves engineering time).

Orchestration: Airflow (or Prefect)

Why orchestration tool:

  • Manage complex dependencies
  • Retry logic
  • Monitoring
  • Alerting

Options:

  • Apache Airflow (managed via AWS MWAA): industry standard, more setup
  • Prefect Cloud: modern, Python-native, easier learning curve
  • Dagster: asset-focused, newer but excellent
  • dbt Cloud scheduler: simple, may suffice if only dbt scheduling needed

Recommendation for your scale: start with dbt Cloud scheduler + Fivetran native scheduling. Add Airflow/Prefect when complexity grows.

Cost: dbt Cloud includes scheduling. Airflow MWAA: ~$350-700/month if needed.

Warehouse: Snowflake (existing)

Sizing:

  • Compute: Small/Medium warehouse for transformations
  • Storage: manageable given data volume
  • Auto-suspend enabled (save costs)
  • Separate warehouses for ingestion, transformation, BI query

Current cost: $5K/month — reasonable for your scale.

Data Quality: dbt tests + custom monitoring

dbt built-in tests:

  • not_null, unique, accepted_values, relationships
  • Custom tests for business logic

Additional monitoring:

  • Monte Carlo or Datafold (if budget) — data observability platforms
  • Custom alerting via Slack/PagerDuty

Idempotency + Reliability Patterns

Idempotent Design

Principle: pipelines should be safely re-runnable without duplicating data.

Patterns:

Merge (upsert) operations:

-- In dbt: use incremental materialization with merge strategy
MERGE INTO warehouse.users target
USING staging.users_new source
ON target.user_id = source.user_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

Full-refresh option: available when data quality issues require reset.

Watermarks for incremental loads:

  • Track last-loaded timestamp
  • Re-run from watermark if needed
  • Safely re-process without duplicates
Reliability Patterns

Retry logic:

  • Exponential backoff
  • Max retry count (3-5)
  • Different retry strategies per failure type

Circuit breaker:

  • If source API failing, pause subsequent runs
  • Don't compound failures

Data contracts:

  • Formal schemas between producers + consumers
  • Breaking changes go through review
  • Schema evolution tools (Avro, Protobuf, or dbt source freshness)

Dead letter queues:

  • Failed records captured separately
  • Investigated + replayed
  • Not lost

Monitoring + Alerting

Freshness Monitoring

Monitor:

  • Last successful load time per source
  • Expected vs. actual run schedule
  • Lag from source to warehouse

Alerts:

  • Source data >2x expected lag: warning
  • Source data >4x: critical
  • Destination dashboard stale >24hr: user-visible warning
Completeness Monitoring

Monitor:

  • Expected vs. actual row counts
  • Min/max values
  • Null rate changes
  • Unique count changes

dbt tests: run on every transformation

Anomaly detection:

  • Sudden 50%+ drop in expected rows
  • Schema changes (unexpected columns)
Quality Monitoring

Automated data quality tests:

  • Foreign key validity
  • Business logic (e.g., revenue always positive)
  • Cross-table consistency
  • Expected distributions

Monitoring platform options:

  • Monte Carlo ($2-5K/month for your scale)
  • Datafold (similar)
  • Custom Slack alerts (free, more DIY)
Latency Monitoring

Track:

  • End-to-end latency (source → warehouse)
  • Per-stage latency
  • Recovery time from failures

SLAs:

  • Daily data: available by 8am ET
  • Hourly data: within 2 hours
  • Recovery from failure: 4 hours

Implementation Plan (3-Month Migration)

Month 1: Foundation

Week 1-2:

  • Fivetran setup + connectors configured
  • Initial data loaded to Snowflake raw layer
  • Connection testing
  • Documentation of current pipelines

Week 3-4:

  • dbt project initialized
  • Staging models for each source
  • First quality tests
  • Team training on dbt
Month 2: Transformation

Week 5-6:

  • Dimensional modeling (facts + dimensions)
  • Core business metrics defined as models
  • Documentation in dbt

Week 7-8:

  • Marts for reporting
  • Metabase connected to new marts
  • Parallel running (old + new) for validation
Month 3: Migration + Stabilization

Week 9-10:

  • Old pipelines decommissioned gradually
  • Monitoring + alerting in place
  • Runbooks documented

Week 11-12:

  • Performance optimization
  • Cost optimization (auto-suspend, warehouse sizing)
  • Team training on runbooks
  • Post-migration retrospective

Cost Estimation

Annual Tool Costs:

ToolMonthlyAnnual
Fivetran$2,500$30,000
dbt Cloud$200$2,400
Snowflake$5,000$60,000 (separate budget)
Monte Carlo (optional)$3,000$36,000
Total tools (excluding warehouse)~$5,700~$68,400

Within your $50K tool budget if skipping Monte Carlo initially.

Alternative: reduce Fivetran cost by using Airbyte (self-hosted):

  • Airbyte self-hosted: minimal infrastructure cost (~$200/month)
  • BUT: more engineering time for setup + maintenance
  • Tradeoff: save $25K/year in tool cost, cost ~$15K in engineering time

Key Takeaways

  • Modern ELT stack: Fivetran (ingestion) + dbt (transformation) + Snowflake (warehouse) + scheduler (dbt Cloud sufficient for your scale). $50-68K/year tool cost fits your budget.
  • Batch-first architecture. Daily + hourly refreshes cover your freshness needs. Add streaming only when customer-facing real-time requires — separate pipeline.
  • Idempotent design with dbt incremental models + watermarks. Pipelines safely re-runnable. Schema contracts between sources + warehouse prevent drift pain.
  • Monitoring 4 dimensions: freshness, completeness, quality, latency. dbt tests + Slack alerts sufficient for MVP. Monte Carlo upgrade ($36K/year) when complexity grows.
  • 3-month migration: Month 1 foundation, Month 2 transformation, Month 3 migration + stabilization. Parallel run validates + prevents data loss. Documentation + runbooks critical for ongoing operations.

Common use cases

  • Data teams building first data warehouse
  • Companies modernizing legacy ETL
  • Startups choosing initial data stack
  • Scaling data infrastructure
  • Multi-source data consolidation

Best AI model for this

Claude Opus 4 or Sonnet 4.5. Pipeline design requires data engineering + systems + business understanding. Top-tier reasoning matters.

Pro tips

  • Modern data stack: ELT > ETL. Transform in warehouse (Snowflake/BigQuery), not in-flight.
  • Idempotency critical: pipelines should be safely re-runnable.
  • Schema contracts between producers + consumers prevent drift pain.
  • Monitor: data freshness, completeness, quality, latency.
  • Tool selection: dbt for transformation, Airflow/Prefect for orchestration, Fivetran for managed connectors.
  • Start with batch, add streaming only when needed. Streaming is 10x complexity.
  • Data quality is everyone's problem. Automated checks + alerts.
  • Document lineage + dependencies.

Customization tips

  • Invest in dbt documentation. Auto-generated docs + lineage are highest-ROI documentation you'll produce.
  • Data pipelines are ongoing investment, not build-once. Budget 30%+ of team time for maintenance + improvement.
  • Data contracts between upstream systems (Postgres app) + downstream (analytics) prevent schema surprises. Worth formalizing.
  • For GDPR/compliance: anonymize PII in warehouse where possible. Separate PII-containing tables with restricted access.
  • Monte Carlo + Datafold are worth investment AFTER foundation solid. Early stage: dbt tests + custom alerts sufficient.

Variants

Modern Stack Build

Starting from scratch with modern tools.

Legacy Migration

Modernizing existing ETL.

Streaming Addition

Adding real-time to batch.

Multi-Source Consolidation

Joining many sources.

Frequently asked questions

How do I use the Data Pipeline Architect — ETL/ELT Design For 2026 prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Data Pipeline Architect — ETL/ELT Design For 2026?

Claude Opus 4 or Sonnet 4.5. Pipeline design requires data engineering + systems + business understanding. Top-tier reasoning matters.

Can I customize the Data Pipeline Architect — ETL/ELT Design For 2026 prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Modern data stack: ELT > ETL. Transform in warehouse (Snowflake/BigQuery), not in-flight.; Idempotency critical: pipelines should be safely re-runnable.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals