Data Pipeline Architect — ETL/ELT Design For 2026

⚡ Quick Answer

Data Pipeline Architect — ETL/ELT Design For 2026 — The structured data pipeline design — covering ETL vs. ELT tradeoffs, batch vs. streaming architecture… Setup: 3 weeks design + ongoing · Best AI: Claude Opus 4 or Sonnet 4.5. Pipeline design requires data engineering + systems + business understanding. Top-tier reasoning matters. · Cost: Free, MIT-licensed.

Why this is epic

Data pipelines are critical infrastructure but often built ad-hoc. This Original produces structured design: ETL/ELT choice, streaming vs. batch, tool selection, reliability patterns, monitoring.

Names the 6 pipeline failure modes (non-idempotent, no monitoring, brittle dependencies, slow recovery, data quality ignored, schema drift) + fixes.

Produces complete architecture framework with specific tool recommendations + scaling considerations.

📑 Page navigation + Key Takeaways Click to expand

📌 Key Takeaways

What it is: The structured data pipeline design — covering ETL vs. ELT tradeoffs, batch vs. streaming architecture…
Best for: Data teams building first data warehouse
Time investment: 3 weeks design + ongoing setup, ~2 min in Claude output
Recommended AI model: Claude Opus 4 or Sonnet 4.5. Pipeline design requires data engineering + systems + business understanding. Top-tier reasoning matters.
Cost: Free forever — MIT-licensed, no signup, no paywall

⚙️ At a glance

Category:: Data & Analytics
Setup time:: 3 weeks design + ongoing
Output time:: ~2 min in Claude
Best AI model:: Claude Opus 4 or Sonnet 4.5. Pipeline design requires data engineering + systems + business understanding. Top-tier reasoning matters.
License:: MIT (free commercial use)
Last reviewed:: 2026-07-06

📊 Promptolis Original vs generic AI prompts Click to expand

Feature	Promptolis	Generic prompts
Structure:	XML + chain-of-thought	Role-play one-liner
Example output:	Real full example	Rare
Variants:	3-7 per prompt	Single
Output quality:	+30-50% accurate ^[Anthropic]	Baseline

On the other hand, generic prompts work fine for simple lookups. Promptolis Originals shine for nuanced reasoning where precision matters.

The prompt

Promptolis Original · Copy-ready

<role> You are a data engineering architect with 12 years of experience. You've built pipelines at companies from startup to Fortune 500. You draw on dbt, Airflow, Fivetran, Snowflake/BigQuery, Kafka, + modern data stack patterns. You are direct. You will name when architecture is over-engineered, when streaming unnecessary, when monitoring inadequate, and when tool choices expensive. </role> <principles> 1. ELT > ETL for modern cloud warehouses. 2. Idempotent pipelines. 3. Schema contracts. 4. Monitor freshness + completeness + quality + latency. 5. Modern stack: dbt + Airflow/Prefect + Fivetran + Snowflake/BigQuery. 6. Batch first, streaming only if needed. 7. Automated data quality checks. 8. Document lineage. </principles> <input> <data-sources>{what produces data}</data-sources> <destination>{data warehouse + other consumers}</destination> <volume>{rows/day, size}</volume> <freshness-needs>{real-time / hourly / daily}</freshness-needs> <current-stack>{existing tools}</current-stack> <team-capacity>{who builds + maintains}</team-capacity> <budget>{tool + infrastructure}</budget> <constraints>{compliance, latency, data residency}</constraints> </input> <output-format> # Data Pipeline Architecture: [Scope] ## Sources + Destinations Overview ## ETL vs. ELT Decision ## Batch vs. Streaming ## Tool Stack Recommendations ## Idempotency + Reliability Patterns ## Monitoring + Alerting ## Implementation Plan ## Cost Estimation ## Key Takeaways 5 bullets. </output-format> <auto-intake> If input incomplete: ask for sources, destination, volume, freshness, current stack, team, budget, constraints. </auto-intake> Now, architect:

0 copies

🚀 Open in ChatGPT ✨ Open in Claude 💎 Open in Gemini

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input

<data-sources>PostgreSQL (application data, 10M rows daily update), Stripe (billing), Salesforce (CRM), Mixpanel (product events, 50M events/month), Intercom (support).</data-sources>
<destination>Snowflake data warehouse for analytics + reporting. Metabase for dashboards.</destination>
<volume>Total daily ingest: ~500 GB raw. After processing: ~80 GB of modeled data.</volume>
<freshness-needs>Daily for dashboards + reports. Hourly for marketing + product metrics. Near-real-time for customer-facing features (separate pipeline).</freshness-needs>
<current-stack>Mixed legacy. Some custom Python scripts in cron. AWS Glue for some jobs. Stripe + Salesforce via manual exports. Patchwork, unreliable.</current-stack>
<team-capacity>1 data engineer full-time, 1 analytics engineer full-time. $50K annual tool budget.</team-capacity>
<budget>Tool budget $50K/year. Warehouse costs separate (~$5K/month).</budget>
<constraints>SOC 2 compliance. EU data stays in EU region. PII handling careful.</constraints>

🤖 Output

Data Pipeline Architecture: Modern Data Stack, 5 Sources → Snowflake, Legacy Migration

A data pipeline architecture produces structured modern stack replacing legacy patchwork. Your situation — mixed custom + Glue + manual exports — is classic modernization opportunity. This design: modern stack (Fivetran + dbt + Airflow + Snowflake), batch-first with clear streaming path if needed, robust monitoring, 3-month migration plan.

Sources + Destinations Overview

Sources:

Source	Type	Volume	Method
PostgreSQL	Application DB	10M rows/day	Fivetran
Stripe	Billing API	~10K events/day	Fivetran
Salesforce	CRM API	~5K records/day	Fivetran
Mixpanel	Event tracking	50M events/mo	Fivetran export OR native Mixpanel → Snowflake
Intercom	Support	~1K conversations/day	Fivetran

Destination:

Snowflake (data warehouse)
Metabase (dashboards, connects to Snowflake)
Reverse ETL for ops tools (optional future)

Data flow:

Sources → Fivetran → Snowflake (raw layer)
                          ↓
                    dbt transformations
                          ↓
                    Snowflake (staging + marts)
                          ↓
                    Metabase + other consumers
                          ↓
                  (Optional) Reverse ETL

ETL vs. ELT Decision

ELT (Extract, Load, Transform) — Modern Choice

Why ELT for your stack:

Snowflake has massive compute + storage
Transformation cost-effective in-warehouse
dbt provides version-controlled, testable transformations
Preserves raw data for re-processing
Faster time-to-data (load first, transform later)

When ETL (legacy) would fit:

Very expensive compute target
Compliance requiring transformation before storage
Massive volumes where pre-processing saves significant cost

Your situation: ELT fits. Snowflake transform is economical + flexible.

Batch vs. Streaming

Batch (daily + hourly) — Primary approach

Why batch for you:

Freshness needs: daily/hourly (not real-time)
Simpler to build + maintain
Lower cost
Easier debugging
Team capacity supports

Streaming reserved for:

Customer-facing features (separate pipeline — not part of this analytics stack)
Future use cases if freshness requirements tighten

Architecture:

Daily refreshes: most sources (Postgres, Stripe, Salesforce, Intercom)
Hourly refreshes: product events + marketing data (Mixpanel)
If latency requirements tighten: consider Kafka + Snowflake streaming in 12-18 months

Tool Stack Recommendations

Ingestion: Fivetran

Why Fivetran:

Managed connectors for all 5 sources
Automated schema evolution
Low maintenance
Compliance features (SOC 2, GDPR)

Cost: $1,500-3,000/month based on your volume (5-15M monthly active rows)

Alternatives:

Airbyte (open-source, self-hosted) — lower cost but more maintenance
Segment (for product events specifically) — already have Mixpanel
Custom ingestion — more engineering, less reliable

Transformation: dbt (Data Build Tool)

Why dbt:

Industry standard for analytics engineering
Version-controlled SQL
Testing framework
Documentation generation
Lineage tracking

dbt Cloud vs. Core:

dbt Core (free) — self-hosted, requires setup
dbt Cloud ($100-300/month) — managed, scheduling, UI, worth it for small team

Recommendation: dbt Cloud for your team size (saves engineering time).

Orchestration: Airflow (or Prefect)

Why orchestration tool:

Manage complex dependencies
Retry logic
Monitoring
Alerting

Options:

Apache Airflow (managed via AWS MWAA): industry standard, more setup
Prefect Cloud: modern, Python-native, easier learning curve
Dagster: asset-focused, newer but excellent
dbt Cloud scheduler: simple, may suffice if only dbt scheduling needed

Recommendation for your scale: start with dbt Cloud scheduler + Fivetran native scheduling. Add Airflow/Prefect when complexity grows.

Cost: dbt Cloud includes scheduling. Airflow MWAA: ~$350-700/month if needed.

Warehouse: Snowflake (existing)

Sizing:

Compute: Small/Medium warehouse for transformations
Storage: manageable given data volume
Auto-suspend enabled (save costs)
Separate warehouses for ingestion, transformation, BI query

Current cost: $5K/month — reasonable for your scale.

Data Quality: dbt tests + custom monitoring

dbt built-in tests:

not_null, unique, accepted_values, relationships
Custom tests for business logic

Additional monitoring:

Monte Carlo or Datafold (if budget) — data observability platforms
Custom alerting via Slack/PagerDuty

Idempotency + Reliability Patterns

Idempotent Design

Principle: pipelines should be safely re-runnable without duplicating data.

Patterns:

Merge (upsert) operations:

-- In dbt: use incremental materialization with merge strategy
MERGE INTO warehouse.users target
USING staging.users_new source
ON target.user_id = source.user_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

Full-refresh option: available when data quality issues require reset.

Watermarks for incremental loads:

Track last-loaded timestamp
Re-run from watermark if needed
Safely re-process without duplicates

Reliability Patterns

Retry logic:

Exponential backoff
Max retry count (3-5)
Different retry strategies per failure type

Circuit breaker:

If source API failing, pause subsequent runs
Don't compound failures

Data contracts:

Formal schemas between producers + consumers
Breaking changes go through review
Schema evolution tools (Avro, Protobuf, or dbt source freshness)

Dead letter queues:

Failed records captured separately
Investigated + replayed
Not lost

Monitoring + Alerting

Freshness Monitoring

Monitor:

Last successful load time per source
Expected vs. actual run schedule
Lag from source to warehouse

Alerts:

Source data >2x expected lag: warning
Source data >4x: critical
Destination dashboard stale >24hr: user-visible warning

Completeness Monitoring

Monitor:

Expected vs. actual row counts
Min/max values
Null rate changes
Unique count changes

dbt tests: run on every transformation

Anomaly detection:

Sudden 50%+ drop in expected rows
Schema changes (unexpected columns)

Quality Monitoring

Automated data quality tests:

Foreign key validity
Business logic (e.g., revenue always positive)
Cross-table consistency
Expected distributions

Monitoring platform options:

Monte Carlo ($2-5K/month for your scale)
Datafold (similar)
Custom Slack alerts (free, more DIY)

Latency Monitoring

Track:

End-to-end latency (source → warehouse)
Per-stage latency
Recovery time from failures

SLAs:

Daily data: available by 8am ET
Hourly data: within 2 hours
Recovery from failure: 4 hours

Implementation Plan (3-Month Migration)

Month 1: Foundation

Week 1-2:

Fivetran setup + connectors configured
Initial data loaded to Snowflake raw layer
Connection testing
Documentation of current pipelines

Week 3-4:

dbt project initialized
Staging models for each source
First quality tests
Team training on dbt

Month 2: Transformation

Week 5-6:

Dimensional modeling (facts + dimensions)
Core business metrics defined as models
Documentation in dbt

Week 7-8:

Marts for reporting
Metabase connected to new marts
Parallel running (old + new) for validation

Month 3: Migration + Stabilization

Week 9-10:

Old pipelines decommissioned gradually
Monitoring + alerting in place
Runbooks documented

Week 11-12:

Performance optimization
Cost optimization (auto-suspend, warehouse sizing)
Team training on runbooks
Post-migration retrospective

Cost Estimation

Annual Tool Costs:

Tool	Monthly	Annual
Fivetran	$2,500	$30,000
dbt Cloud	$200	$2,400
Snowflake	$5,000	$60,000 (separate budget)
Monte Carlo (optional)	$3,000	$36,000
Total tools (excluding warehouse)	~$5,700	~$68,400

Within your $50K tool budget if skipping Monte Carlo initially.

Alternative: reduce Fivetran cost by using Airbyte (self-hosted):

Airbyte self-hosted: minimal infrastructure cost (~$200/month)
BUT: more engineering time for setup + maintenance
Tradeoff: save $25K/year in tool cost, cost ~$15K in engineering time

Key Takeaways

Modern ELT stack: Fivetran (ingestion) + dbt (transformation) + Snowflake (warehouse) + scheduler (dbt Cloud sufficient for your scale). $50-68K/year tool cost fits your budget.
Batch-first architecture. Daily + hourly refreshes cover your freshness needs. Add streaming only when customer-facing real-time requires — separate pipeline.
Idempotent design with dbt incremental models + watermarks. Pipelines safely re-runnable. Schema contracts between sources + warehouse prevent drift pain.
Monitoring 4 dimensions: freshness, completeness, quality, latency. dbt tests + Slack alerts sufficient for MVP. Monte Carlo upgrade ($36K/year) when complexity grows.
3-month migration: Month 1 foundation, Month 2 transformation, Month 3 migration + stabilization. Parallel run validates + prevents data loss. Documentation + runbooks critical for ongoing operations.

📋 How to use this prompt (4 steps · under 60 seconds) Click to expand

1 Copy the prompt above. Click "Copy prompt". XML-structured prompt now on clipboard.
2 Open ChatGPT, Claude, or Gemini. One-click launch above. Recommended: Claude Opus 4 or Sonnet 4.5. Pipeline design requires data engineering + systems + business understanding. Top-tier reasoning matters..
3 Paste + fill placeholders. Replace {curly braces} with your context. Specificity = quality.
4 Run + iterate. Setup: 3 weeks design + ongoing. Output: ~2 min in Claude.

Common use cases

Data teams building first data warehouse
Companies modernizing legacy ETL
Startups choosing initial data stack
Scaling data infrastructure
Multi-source data consolidation

Best AI model for this

Claude Opus 4 or Sonnet 4.5. Pipeline design requires data engineering + systems + business understanding. Top-tier reasoning matters.

Pro tips

Modern data stack: ELT > ETL. Transform in warehouse (Snowflake/BigQuery), not in-flight.
Idempotency critical: pipelines should be safely re-runnable.
Schema contracts between producers + consumers prevent drift pain.
Monitor: data freshness, completeness, quality, latency.
Tool selection: dbt for transformation, Airflow/Prefect for orchestration, Fivetran for managed connectors.
Start with batch, add streaming only when needed. Streaming is 10x complexity.
Data quality is everyone's problem. Automated checks + alerts.
Document lineage + dependencies.

Customization tips

Invest in dbt documentation. Auto-generated docs + lineage are highest-ROI documentation you'll produce.
Data pipelines are ongoing investment, not build-once. Budget 30%+ of team time for maintenance + improvement.
Data contracts between upstream systems (Postgres app) + downstream (analytics) prevent schema surprises. Worth formalizing.
For GDPR/compliance: anonymize PII in warehouse where possible. Separate PII-containing tables with restricted access.
Monte Carlo + Datafold are worth investment AFTER foundation solid. Early stage: dbt tests + custom alerts sufficient.

Variants

Modern Stack Build

Starting from scratch with modern tools.

Legacy Migration

Modernizing existing ETL.

Streaming Addition

Adding real-time to batch.

Multi-Source Consolidation

Joining many sources.

Frequently asked questions

Common questions about this prompt and how to get the best results from it.

How do I use the Data Pipeline Architect — ETL/ELT Design For 2026 prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Data Pipeline Architect — ETL/ELT Design For 2026?

Claude Opus 4 or Sonnet 4.5. Pipeline design requires data engineering + systems + business understanding. Top-tier reasoning matters.

Can I customize the Data Pipeline Architect — ETL/ELT Design For 2026 prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Modern data stack: ELT > ETL. Transform in warehouse (Snowflake/BigQuery), not in-flight.; Idempotency critical: pipelines should be safely re-runnable.

What does it cost to use this prompt?

The prompt itself is free, MIT-licensed, with no email signup required. You only pay for your AI model subscription (ChatGPT Plus $20/mo, Claude Pro $20/mo, Gemini Advanced $20/mo) — and even those have free tiers that work with most Promptolis Originals.

How is this different from PromptBase or PromptHero?

PromptBase sells prompts in a marketplace ($2-15 each). PromptHero focuses on image-generation prompts. Promptolis Originals are free, MIT-licensed text/reasoning prompts hand-crafted with full example outputs, multiple variants, and a recommended best AI model per prompt. We don't sell anything.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals

P

Curated by Promptolis Editorial · Last reviewed 2026-07-06

Editorial process + credentials ▼

Credentials: Independent prompt-engineering team since 2026. Sister projects: SeoScore.tools and 9bench.com. Meet the team →

Editorial process: Each prompt is built from primary sources (research papers, established frameworks, professional methodologies), structured with XML tags + chain-of-thought scaffolding for 2026-grade LLMs, tested across multiple models before publishing.

🔄 Data Pipeline Architect — ETL/ELT Design For 2026

Why this is epic

📌 Key Takeaways

📑 On this page

⚙️ At a glance

The prompt

Example: input → output

Data Pipeline Architecture: Modern Data Stack, 5 Sources → Snowflake, Legacy Migration

Sources + Destinations Overview

ETL vs. ELT Decision

Batch vs. Streaming

Tool Stack Recommendations

Ingestion: Fivetran

Transformation: dbt (Data Build Tool)

Orchestration: Airflow (or Prefect)

Warehouse: Snowflake (existing)

Data Quality: dbt tests + custom monitoring

Idempotency + Reliability Patterns

Idempotent Design

Reliability Patterns

Monitoring + Alerting

Freshness Monitoring

Completeness Monitoring

Quality Monitoring

Latency Monitoring

Implementation Plan (3-Month Migration)

Month 1: Foundation

Month 2: Transformation

Month 3: Migration + Stabilization

Cost Estimation

Key Takeaways

Common use cases

Best AI model for this

Pro tips

Customization tips

Variants

Modern Stack Build

Legacy Migration

Streaming Addition

Multi-Source Consolidation

Frequently asked questions

Explore more Originals