⚡ Promptolis Original · Data & Analytics

🔍 Data Quality Audit — Find + Fix Data Issues Before They Break Decisions

The structured data quality audit covering the 6 quality dimensions (completeness / accuracy / consistency / timeliness / validity / uniqueness), systematic testing framework, monitoring automation, and the 'fix upstream, not downstream' discipline.

⏱️ 2-4 weeks initial audit + ongoing 🤖 ~2 min in Claude 🗓️ Updated 2026-04-20

Why this is epic

Most companies make decisions on data with 20-40% quality issues. This Original produces systematic audit + fix framework + ongoing monitoring.

Names the 6 dimensions of data quality + common issues per dimension.

Produces complete quality program with monitoring automation.

The prompt

Promptolis Original · Copy-ready
<role> You are a data quality + engineering specialist with 12 years of experience. You've audited data quality at 50+ companies + installed ongoing quality programs. You are direct. You will name when quality issues aren't being addressed upstream, when monitoring gaps exist, and when business impact not quantified. </role> <principles> 1. 6 quality dimensions framework. 2. Fix upstream, not downstream. 3. Automated monitoring essential. 4. Business impact prioritization. 5. Cross-functional ownership. 6. Document institutional knowledge. 7. Regression testing for pipelines. 8. Quality SLAs per dataset. </principles> <input> <current-state>{known quality issues}</current-state> <data-sources>{what's in scope}</data-sources> <critical-datasets>{which data matters most}</critical-datasets> <current-monitoring>{what exists}</current-monitoring> <team-structure>{who owns data quality}</team-structure> <business-impact>{examples of bad decisions from bad data}</business-impact> <timeline>{urgency}</timeline> <budget>{for tools + improvements}</budget> </input> <output-format> # Data Quality Audit: [Scope] ## 6-Dimension Assessment ## Priority Issues (Business Impact) ## Root Cause Analysis ## Upstream Fixes ## Monitoring Automation ## Testing Framework ## Governance Structure ## Implementation Plan ## Key Takeaways 5 bullets. </output-format> <auto-intake> If input incomplete: ask for current state, sources, critical datasets, monitoring, team, impact, timeline, budget. </auto-intake> Now, audit:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<current-state>Known issues: (1) 15-20% of customer records have missing fields, (2) duplicate accounts in some systems, (3) inconsistent date formats across sources, (4) 48-hour lag on some Salesforce syncs, (5) product events occasionally lost.</current-state>
<data-sources>Salesforce (CRM), Postgres (app), Stripe (billing), Mixpanel (events), HubSpot (marketing).</data-sources>
<critical-datasets>Customer master data, revenue data, pipeline data, product usage.</critical-datasets>
<current-monitoring>Basic dbt tests. Informal ad-hoc checks. No systematic monitoring.</current-monitoring>
<team-structure>2 data engineers, 1 analytics engineer, no dedicated data quality role.</team-structure>
<business-impact>Q1 forecast missed by 8% due to duplicate customers counted. Marketing spent $200K on wrong-segment campaigns. Support had customer complaints about incorrect account info.</business-impact>
<timeline>Critical — forecast issues impacting board credibility.</timeline>
<budget>$50K for tools + initial improvements.</budget>
🤖 Output

Data Quality Audit: Multi-Source SaaS, Known Issues, Critical Business Impact

A data quality audit produces systematic assessment + fix program. Your situation — documented business impact (forecast miss, marketing waste) — demands immediate action. This audit: 6-dimension assessment, priority fixes, upstream + monitoring + governance.

6-Dimension Assessment

1. Completeness

Issue: 15-20% of customer records missing fields.

Impact: segmentation broken, marketing wasted.

Score: 2/5 (major issue)

2. Accuracy

Issue: inconsistencies across sources. Salesforce account might show different revenue than Stripe.

Impact: trust erosion, wrong decisions.

Score: 3/5 (moderate issue)

3. Consistency

Issue: date formats, enum values, conventions differ across sources.

Impact: broken joins, wrong aggregations.

Score: 2/5 (major issue)

4. Timeliness

Issue: 48-hour Salesforce sync lag, intermittent event loss.

Impact: stale dashboards, missed deals.

Score: 3/5 (moderate)

5. Validity

Issue: some records have invalid data types (e.g., 'N/A' in numeric fields).

Impact: analysis errors, system failures.

Score: 3/5 (moderate)

6. Uniqueness

Issue: duplicate customers across systems. Same company appearing multiple times.

Impact: inflated counts, wrong forecasting.

Score: 1/5 (critical — caused $200K forecast miss)

Priority Issues (Business Impact)

Priority 1: Customer Deduplication (CRITICAL)

Issue: duplicate customers caused 8% forecast miss.

Root cause: Salesforce + Stripe + HubSpot not matching on consistent ID.

Impact: forecast accuracy, revenue recognition, customer count.

Priority: IMMEDIATE.

Priority 2: Completeness of Customer Master Data

Issue: 15-20% missing fields → bad segmentation.

Root cause: sales doesn't fill required fields; no validation at entry.

Impact: marketing targeting, sales analytics.

Priority: HIGH.

Priority 3: Source-to-Warehouse Timeliness

Issue: Salesforce lag, event loss.

Root cause: ETL process issues, error handling gaps.

Impact: stale dashboards, lost signals.

Priority: HIGH.

Priority 4: Consistency Across Sources

Issue: formats + enums + conventions differ.

Root cause: no central data contract.

Impact: broken joins, wrong aggregations.

Priority: MEDIUM.

Root Cause Analysis

For each priority issue — WHY is it happening?

Priority 1 (Duplicates):
  • No unified customer identity across systems
  • Manual account creation in Salesforce
  • HubSpot + Salesforce not synced on company matching
  • No automated deduplication

Fix approach: master data management (MDM) strategy + automated matching rules.

Priority 2 (Completeness):
  • Sales reps skip required fields ('just want to close')
  • No validation at Salesforce entry
  • No enforcement in app data collection

Fix approach: required-field enforcement + entry validation + automated reminders.

Priority 3 (Timeliness):
  • Salesforce batch sync every 2 hours (not real-time)
  • Event lost during Mixpanel client-side errors
  • No retry logic for failed events

Fix approach: faster syncs + retry logic + event reliability.

Priority 4 (Consistency):
  • Each source has own conventions
  • No canonical data model
  • dbt layers reference inconsistent types

Fix approach: canonical model in dbt with mapping layers.

Upstream Fixes (Address Source)

Fix 1: Customer Identity Unification

Build unified customer identity:

  • Reverse ETL to sync Salesforce-Stripe-HubSpot
  • Master ID generated from company domain + name matching
  • Propagated to all downstream tables
  • Automated deduplication rules

Cost: $10K (1 week of engineering time).

Fix 2: Required Field Enforcement

Salesforce:

  • Required fields for new records
  • Validation rules
  • Admin notifications for incomplete
  • Data quality dashboard for sales team

Application data entry:

  • Form validation
  • Tooltip guidance
  • Review queues for edge cases

Cost: $5K (Salesforce admin + engineering).

Fix 3: Pipeline Reliability

Salesforce sync:

  • Increase to every 30 minutes
  • Retry logic on failures
  • Alerting on sync errors

Mixpanel events:

  • Client-side retry queue
  • Server-side event capture as backup
  • Event loss monitoring

Cost: $15K (engineering time + tool upgrades).

Fix 4: Canonical Data Model

dbt model restructure:

  • raw_ layer: as-is from sources
  • staging_ layer: cleaned + standardized
  • marts_ layer: business-ready

Standards:

  • Consistent column naming
  • Consistent types
  • Enum value normalization
  • Time zone standards

Cost: $10K (analytics engineering time).

Total upstream fixes: $40K — within $50K budget.

Monitoring Automation

Automated dbt Tests

Per model, test:

  • not_null on critical fields
  • unique on ID fields
  • accepted_values on enum columns
  • relationships between foreign keys
  • recency (data freshness)

Custom tests:

  • Revenue consistency (Stripe vs. Salesforce should match)
  • Customer count freshness
  • Expected volume ranges
Runtime Monitoring

Alerts in Slack:

  • Pipeline failure: immediate
  • Data freshness >2x expected: within hour
  • Volume anomaly (>20% change): within 4 hours
  • Test failure: within 1 hour

Dashboard:

  • Data quality scorecard (per dimension per dataset)
  • Trending over time
  • Incident history
Specific Critical Checks

Daily:

  • Customer count match (Salesforce = Stripe = warehouse)
  • Revenue reconciliation (Stripe vs. warehouse)
  • Pipeline totals (sum of opportunities = dashboard)

Weekly:

  • Segment counts consistency
  • Historical comparison (expected growth patterns)

Testing Framework

Pre-Production Testing

Before production data pipeline changes:

  • Unit tests on transformations
  • Integration tests on data pipelines
  • Regression tests against known-good data
  • Schema validation
Production Testing

Continuous validation:

  • Runtime assertions
  • Anomaly detection
  • Expected-value checks
Historical Testing

Periodically:

  • Compare historical snapshots to current
  • Identify drift
  • Investigate divergences

Governance Structure

Data Quality Ownership

Dedicated role (new):

  • Data quality owner (analytics engineering lead)
  • Cross-functional coordination
  • Program management

Distributed ownership:

  • Each pipeline has primary engineer owner
  • Each dataset has business owner (PM, CS, etc.)
  • Monthly quality review
SLAs Per Dataset

Critical datasets (customer master, revenue, pipeline):

  • Freshness: within 1 hour of source
  • Completeness: 98%+ for critical fields
  • Accuracy: validated against source
  • Uniqueness: 99%+ (< 1% duplicates)

Standard datasets:

  • Freshness: within 24 hours
  • Completeness: 95%+ for critical fields
  • Accuracy: spot-checked
Incident Response

Data quality incident:

1. Alert fires (via Slack/PagerDuty)

2. On-call engineer investigates

3. Root cause determined

4. Fix implemented

5. Post-mortem within 48 hours

6. Prevention + monitoring added

Implementation Plan

Week 1-2: Foundation
  • Tool selection + setup
  • Critical dataset inventory
  • Initial dbt tests deployed
  • Incident protocol defined
Week 3-6: Upstream Fixes
  • Customer identity unification
  • Required field enforcement
  • Pipeline reliability improvements
  • Canonical data model rollout
Week 7-8: Monitoring Automation
  • Full test suite deployed
  • Alerts configured
  • Quality dashboard live
  • Documentation complete
Week 9-10: Stabilize + Refine
  • Tune alerting (reduce false positives)
  • Add additional checks based on incidents
  • Quarterly review cadence established
  • Team training complete
Ongoing
  • Weekly quality review
  • Monthly incident retrospective
  • Quarterly strategic review
  • Annual audit of framework

Key Takeaways

  • 6-dimension assessment reveals 4 priority issues: duplicate customers (CRITICAL), completeness (HIGH), timeliness (HIGH), consistency (MEDIUM). Business impact documented ($200K marketing waste + 8% forecast miss).
  • Fix upstream, not downstream. Customer identity unification + field enforcement + pipeline reliability + canonical model. $40K investment addresses root causes.
  • Monitoring automation via dbt tests + Slack alerts + quality dashboard. Automated, not manual. Covers 6 dimensions continuously.
  • Governance: data quality owner role + per-pipeline engineer owner + per-dataset business owner. SLAs per dataset. Monthly review cadence.
  • 10-week implementation: foundation + upstream fixes + monitoring + stabilize. Within $50K budget. Ongoing quality program thereafter.

Common use cases

  • Data teams establishing quality program
  • Post-incident quality investigation
  • Pre-migration data assessment
  • Analytics quality improvement
  • Compliance data audits

Best AI model for this

Claude Opus 4 or Sonnet 4.5. Data quality requires systematic thinking + engineering + business context. Top-tier reasoning matters.

Pro tips

  • 6 quality dimensions: completeness, accuracy, consistency, timeliness, validity, uniqueness.
  • Fix upstream, not downstream. Address source, not symptoms.
  • Monitor automatically — humans can't watch 100+ tables.
  • Prioritize by business impact, not volume.
  • Data quality is cross-functional. Engineering + analytics + business.
  • Document known issues + fixes. Institutional knowledge.
  • Regression testing for data pipelines.
  • Data quality SLAs per dataset.

Customization tips

  • Data quality is not 'set + forget.' Ongoing program with dedicated ownership.
  • Monitor business-impact metrics, not just technical metrics. 'Duplicate customer count' matters more than '95th percentile pipeline latency.'
  • When incidents happen, post-mortem within 48 hours + prevention added. Learning culture.
  • Sales + product + marketing care about quality more than data teams realize. Involve them in governance.
  • Annual data quality audit. Framework evolves with business changes.

Variants

Initial Quality Audit

First comprehensive assessment.

Ongoing Quality Program

Established monitoring + improvement.

Post-Incident Audit

Investigation + prevention.

Compliance-Driven Audit

For regulatory requirements.

Frequently asked questions

How do I use the Data Quality Audit — Find + Fix Data Issues Before They Break Decisions prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with Data Quality Audit — Find + Fix Data Issues Before They Break Decisions?

Claude Opus 4 or Sonnet 4.5. Data quality requires systematic thinking + engineering + business context. Top-tier reasoning matters.

Can I customize the Data Quality Audit — Find + Fix Data Issues Before They Break Decisions prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: 6 quality dimensions: completeness, accuracy, consistency, timeliness, validity, uniqueness.; Fix upstream, not downstream. Address source, not symptoms.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals