D1 · Trust — Pipelines for AI playbook

Cross-cutting discipline · designed in at L1

Trust — the lake is correct. Or you have nothing.

Most teams treat observability as a layer to add at the end. That framing is the problem. Trust — observability, data quality, freshness, schema drift — is a discipline that runs through every build layer. Designed in at L1. Present at L2, L3, L4, L5. By L4 it's too late to retrofit.

Our take

Four things, always on: freshness, volume, schema, and nulls on key columns. Wired into every layer from day one — not bolted on later.

Pipelines for AI · Starters by Getting Automated

FREE STARTER · CDK · MIT ~10 min deploy · ~$5/mo to run

LAKEWATCH · D1 TRUST STARTER

Deploy production-grade observability in 10 minutes.

A working CDK stack that gives any AWS data lake the four D1 Trust monitors from this playbook — wired to Slack, config-driven, ~$5/mo to run. The same stack we use with clients on Pipelines for AI engagements.

↳ ENGINEERS no email

Just want the code? Take it.

Public repo. MIT-licensed. Clone, fork, ship. PRs welcome.

github.com/Getting-Automated/pipelinesforai-starters

Python CDK · 5 Lambdas · 30+ AWS resources

40 unit tests · MIT license · maintained quarterly

Get on GitHub

↳ OPERATORS · BUYERS · CURIOUS email

Want the guided walkthrough emailed?

We’ll send the curated kit and add you to the Pipelines for AI letter. Fortnightly. Reply “remove” and you’re out.

The starter as a single download (no git required)
12-min screen recording of the deploy from scratch
5-page architecture write-up — the version that doesn’t fit in a README
Sample Slack alert mockups you can show your team
Subscribed to the Pipelines for AI letter

No newsletter spam. Real architecture writing, twice a month, plus heads-up when new starters drop.

Why this matters Silent data rot

Data pipelines almost never crash. They just quietly start lying.

A web app breaks loudly — a 500 error, a broken page, something obvious. A data pipeline breaks silently: a connector stops updating at 3am and the dashboard just keeps showing yesterday's number. A source renames a column and every downstream metric subtly shifts. Nobody notices for a week.

Observability is the smoke detector for this. You can't watch every row; you watch four signals — is it fresh, is it the right volume, is the schema still what we expect, are the key columns not full of nulls — and those four catch the vast majority of "the numbers look weird today" incidents. Start here. Add more signals when a specific failure teaches you to.

Freshness

Every mart has a max staleness. Breach it, page someone.

dim_customer · < 4h

Row-count variance

Today vs. the last 30 days. Outliers mean upstream broke.

14,000 → 280 rows ⚠︎

Schema drift

Contract-test staging against raw. Fail the build, don't silently drop columns.

new col → PR blocks

Nulls on key cols

Join keys and grain columns should never be null. Ever.

customer_id · 0 nulls

Rule 03

Freshness is the most-ignored metric.

Every mart should have a max acceptable staleness. If dim_customer hasn't updated in 4 hours, something is wrong — even if queries still succeed. Alert on it.

Heuristic

Row-count variance is free and incredibly effective.

If yesterday you landed 14,000 orders and today you landed 280, a source broke and nobody told you. A simple "is today within 3 standard deviations of the last 30 days?" check catches most upstream outages within one sync.

Opinion

Schema drift should fail loud.

The CRM added a new field. Great. The nightly refresh silently drops it because staging hardcoded the column list. Three weeks later someone asks about it. Contract-test your staging models against raw. Break the build on mismatch.

What the channel looks like

#data-alertsmute disabled · humans on-call

09:14row-count variance · stg_orders landed 280 rows (30d avg 14,204, σ=1,820). shopify connector last succeeded 08:55 with 0 rows.

09:15freshness breach · dim_customer is 4h12m stale. SLA 4h. Upstream: stg_orders (see 09:14).

09:22ack'd · @jamie · restarting shopify connector, backfilling 08:00–09:00

09:48resolved · dim_customer 12m stale, row-count back inside σ.

Heuristic

One alert channel. Humans on call for it.

If alerts land in a Slack channel nobody watches, you don't have observability — you have logging. Route to the same channel your platform alerts go to. Fewer, louder, actionable.

Artifact slack/alerts/freshness.json ~25 lines · Slack Block Kit

// slack/alerts/freshness.json — what hits #data-alerts when an asset goes stale
{
  "channel": "#data-alerts",
  "blocks": [
    { "type": "header",
      "text": { "type": "plain_text", "text": "🟠 Stale: raw_shopify.orders" } },
    { "type": "section",
      "fields": [
        { "type": "mrkdwn", "text": "*Last refresh:*\n42 min ago" },
        { "type": "mrkdwn", "text": "*SLA:*\n20 min" },
        { "type": "mrkdwn", "text": "*Owner:*\n@ops-data-team" },
        { "type": "mrkdwn", "text": "*Downstream:*\ndaily_kpi_report, ai_assistant" }
      ]
    },
    { "type": "actions",
      "elements": [
        { "type": "button", "text": { "type": "plain_text", "text": "View asset" },
          "url": "https://dagster.your-co.com/assets/raw_shopify/orders" },
        { "type": "button", "text": { "type": "plain_text", "text": "Mute 1h" },
          "value": "mute:raw_shopify.orders:1h" }
      ]
    }
  ]
}

Load-bearingEvery alert names who owns it and what breaks downstream. An alert without a name and a blast radius is just noise — it gets muted in a week.