Data Quality Monitoring with AI: 5 Checks That Catch 90% of Breakage

Data Quality Monitoring with AI: 5 Checks That Catch 90% of Breakage

6/14/20261 views9 min read

TL;DR

  • SMB data pipelines fail silently — most outages are detected by humans noticing weird dashboard numbers, days or weeks after the fact.
  • Five generic checks (volume, freshness, schema, distribution, uniqueness) catch the vast majority of pipeline breakage at SMB scale.
  • AI is excellent at suggesting first-draft thresholds and explaining anomalies in plain language — it should not own detection.

When a COO of a 200-person logistics company called me about a pricing decision that turned out to be based on a join that had silently double-counted orders for six weeks, I asked her what data quality monitoring they ran. Her answer was honest: "We assumed the warehouse was right." That assumption costs SMBs more than every other data mistake combined.

Why do SMB data pipelines fail silently?

Because no one is paid to watch them. A pipeline is built, ships data to the dashboard, and then runs unattended. When upstream changes — a vendor renames a field, a webhook drops a column, an API rate-limits — the pipeline doesn't crash. It just delivers wrong numbers, and the dashboard quietly lies.

Definition: Silent failure — a pipeline outcome where downstream consumers receive plausible-looking but materially incorrect data, without any error surfaced to the operator.

Silent failures are the most expensive kind because they erode trust slowly. By the time someone notices, weeks of decisions have referenced wrong numbers. Recovery is not "rerun the pipeline" — it's "audit every decision since the last known-good day."

What are the five checks every SMB pipeline needs?

These five generic checks, run after every pipeline load, catch roughly 90% of breakage at SMB scale. None requires a specialist; all five can be expressed as SQL.

1. Volume check

Is today's row count within reasonable bounds versus the trailing baseline? Implementation: row count of today's load vs the 4-week DoW-matched median, alert if outside ±40%.

The volume check catches the largest category of issues — a feed half-arrived, a webhook dropped, a backfill ran twice. It's the cheapest check to write and the most useful to run.

2. Freshness check

Is the latest row recent enough? Implementation: max timestamp in the table vs the current time, alert if older than the expected SLA (e.g. 2 hours for hourly feeds, 26 hours for daily feeds).

Definition: Freshness SLA — the maximum acceptable lag between an event occurring in the source system and that event being queryable in the warehouse.

The freshness check catches stuck pipelines that don't error but stop delivering new data. It is shockingly under-implemented; teams assume that "no error" means "data is current," which is almost never true.

3. Schema check

Did the columns and types change versus yesterday? Implementation: snapshot the table schema after each load, diff against yesterday's snapshot, alert on any column added, dropped, or retyped.

Schema drift is the silent killer of dashboards. A vendor renames client_name to customer_name and your join silently nulls out. Schema checks catch this on day one, not on the next quarterly review.

4. Distribution check

Has the shape of the data shifted? Implementation: track the distribution of a few key categorical and numeric fields (categories: counts per value; numeric: median and IQR) day over day. Alert on shifts beyond a threshold.

This is the check that catches "vendor changed the field semantics without telling us" — when status used to mean one thing and now means another. The volume check still passes; the freshness check still passes; the data is just wrong in a new way.

5. Uniqueness check

Are the primary-key fields still unique? Implementation: count rows vs count of distinct primary keys, alert if they diverge.

Duplicates are the cause of every silent double-counting bug. They sneak in when upstream systems retry, when joins go many-to-many unintentionally, or when a backfill races with a live load. The uniqueness check is the cheapest insurance against the most expensive class of bug.

Copy/paste check definition

Pipeline: [name]
Load frequency: [hourly / daily]
Owner: [team / individual]

Checks (run after every load):
1. Volume: today's row count within ±40% of 4-week DoW-matched median?
2. Freshness: max(timestamp) within SLA of [N hours]?
3. Schema: columns and types unchanged versus yesterday's snapshot?
4. Distribution:
   - [field A]: top-5 value share within ±10pp of baseline?
   - [field B]: median + IQR within ±20% of baseline?
5. Uniqueness: count(*) = count(distinct primary_key)?

Failure routing:
- Volume / Freshness / Schema fail → page pipeline owner
- Distribution / Uniqueness fail → notify pipeline owner + #data-quality channel

The routing split matters. Volume, freshness, and schema failures are usually upstream issues — page someone. Distribution and uniqueness failures are often semantic — they need a human conversation, not a 3am page.

Where does AI fit in?

In three specific places, none of which is owning detection.

  1. Threshold suggestion. Give an AI six weeks of a metric and ask for sensible volume / distribution bands. The first draft is usually within 20% of where a human would land — close enough to ship and tune.
  2. Anomaly explanation. When a check fails, feed the failure plus recent operational context (deploys, vendor announcements, holidays) to an LLM and ask "what's the most likely cause?" — this cuts triage time substantially.
  3. Plain-language alert summaries. Turn distribution check failed on status field: enterprise share dropped from 18% to 4% into "Enterprise tier signups appear to have dropped sharply since yesterday — likely upstream filter change."

Tool tip (AIAdvisoryBoard.me): Data quality checks are most useful when failures route into the same Plan → Fact → Gap loop the team already uses for operational metrics. A failed freshness check on the revenue feed is a Gap on every metric downstream of it — and naming it as such forces accountability. Our daily-management OS does this routing automatically, so a broken pipeline is visible as a Gap in the founder's morning review, not as an alert nobody reads. The 7-day diagnostic shows which of your pipelines have the highest blast radius. See it at https://aiadvisoryboard.me/?lang=en.

Manager scan (2-minute digest example)

  • Every pipeline has the five generic checks wired before it lands in production
  • Volume / freshness / schema failures page the pipeline owner; distribution / uniqueness notify
  • Thresholds are first-drafted by AI from 6 weeks of history, then tuned monthly
  • A pipeline without an owner is paused — no orphan feeds in production
  • Schema snapshot stored daily — recovery from drift is grep, not archaeology
  • Distribution check covers at most 3-5 fields per table — broader is noise
  • Failed checks attach to the metrics downstream of that pipeline in the dashboard
  • Alert summaries are AI-rewritten into plain language, not raw SQL output
  • Monthly review: which checks fired, which were real, which were tuned away
  • Quarterly: any pipeline with zero failures all quarter is either perfect or unmonitored — verify

Micro-case (what changes after 7-14 days)

A 130-person ecommerce SMB had 14 data pipelines and zero quality monitoring beyond "did the cron exit zero." We wired the five checks to all 14, used AI to first-draft thresholds from 6 weeks of history, and set up the page/notify routing split. In the first week the system caught three issues that had been silently corrupting downstream metrics: a vendor webhook had been delivering ~30% fewer events since a TLS change two weeks earlier (freshness + volume); the marketing-attribution pipeline had been double-counting a campaign source for nine days because of a join change (uniqueness); and the customer status field had shifted distribution after a CRM update that nobody had told the data team about (distribution). All three would have continued silently for weeks without the checks. The founder's reaction at the end of week two was telling: "We weren't running blind, we were running on lies, and we didn't know."

Note on this case: This example is illustrative — based on typical patterns we observe with companies of 30-500 employees, not a single named client. Specific numbers are rounded approximations of common ranges, not guarantees.

Tool tip (AIAdvisoryBoard.me): The reason data quality work usually stalls in SMBs is that nobody connects it to a business consequence. A failed check is filed as an "engineering issue" and the dashboard continues to mislead. Our daily-management OS surfaces data quality failures as Gap items on the affected business metrics — so "our pipeline broke" becomes "we under-counted revenue last week," and the conversation changes. Start the 7-day diagnostic at https://aiadvisoryboard.me/?lang=en.

FAQ

Don't I need a tool like Monte Carlo for this? Eventually, maybe — and only at scale. For a 30-500-person SMB, the five checks above can be implemented in SQL plus a job scheduler in a day. Buying a tool before you have the discipline buys you fancier blind spots.

What about row-level data quality (per-record validation)? That's a different layer — it belongs in the ingestion code, not the monitoring layer. The five checks here are table-level; per-row checks happen earlier. Don't conflate them.

My pipeline reloads the full table every night. Do these checks still apply? Yes — even more so. Full reloads are the most prone to silent volume drops because there's no incremental delta to inspect. Volume + uniqueness checks are your best protection.

How do I baseline a brand-new pipeline? Run it in shadow mode for 2-3 weeks before turning checks on, collect baseline numbers, then have AI first-draft thresholds from that window. Don't try to baseline from day one; you'll just chase noise.

What's the relationship between this and anomaly detection on business metrics? Data quality checks protect the metric definition; anomaly detection protects the metric value. You need both, in that order — if data quality is broken, anomaly detection on the broken metric is meaningless.

Conclusion

Data quality monitoring at SMB scale is not a data-platform purchase. It is five generic checks wired after every pipeline load, with AI helping on thresholds and plain-language summaries, and a routing discipline that separates pages from notifications.

Pick your highest-blast-radius pipeline first. Wire the five checks. Tune for two weeks. Add the next pipeline.

If you want a system that surfaces the Plan → Fact → Gap automatically — including the Gap when a pipeline lies to you — see how the 7-day diagnostic works at https://aiadvisoryboard.me/?lang=en.

Frequently Asked Questions

AI-Powered Solution

Ready to transform your team's daily workflow?

AI Advisory Board helps teams automate daily standups, prevent burnout, and make data-driven decisions. Join hundreds of teams already saving 2+ hours per week.

Save 2+ hours weekly
Boost team morale
Data-driven insights
Start 14-Day Free TrialNo credit card required
Newsletter

Get weekly insights on team management

Join 2,000+ leaders receiving our best tips on productivity, burnout prevention, and team efficiency.

No spam. Unsubscribe anytime.