Cross-cutting discipline Discipline 1 of 2 cross-cutting disciplines ~10 min read
D1
Cross-cutting discipline · designed in at L1

Trust — the lake is correct. Or you have nothing.

Most teams treat observability as a layer to add at the end. That framing is the problem. Trust — observability, data quality, freshness, schema drift — is a discipline that runs through every build layer. Designed in at L1. Present at L2, L3, L4, L5. By L4 it's too late to retrofit.

Our take

Four things, always on: freshness, volume, schema, and nulls on key columns. Wired into every layer from day one — not bolted on later.

Why this matters Silent data rot

Data pipelines almost never crash. They just quietly start lying.

A web app breaks loudly — a 500 error, a broken page, something obvious. A data pipeline breaks silently: a connector stops updating at 3am and the dashboard just keeps showing yesterday's number. A source renames a column and every downstream metric subtly shifts. Nobody notices for a week.

Observability is the smoke detector for this. You can't watch every row; you watch four signals — is it fresh, is it the right volume, is the schema still what we expect, are the key columns not full of nulls — and those four catch the vast majority of "the numbers look weird today" incidents. Start here. Add more signals when a specific failure teaches you to.

01
Freshness

Every mart has a max staleness. Breach it, page someone.

dim_customer · < 4h
02
Row-count variance

Today vs. the last 30 days. Outliers mean upstream broke.

14,000 → 280 rows ⚠︎
03
Schema drift

Contract-test staging against raw. Fail the build, don't silently drop columns.

new col → PR blocks
04
Nulls on key cols

Join keys and grain columns should never be null. Ever.

customer_id · 0 nulls
Rule 03

Freshness is the most-ignored metric.

Every mart should have a max acceptable staleness. If dim_customer hasn't updated in 4 hours, something is wrong — even if queries still succeed. Alert on it.

Heuristic

Row-count variance is free and incredibly effective.

If yesterday you landed 14,000 orders and today you landed 280, a source broke and nobody told you. A simple "is today within 3 standard deviations of the last 30 days?" check catches most upstream outages within one sync.

Opinion

Schema drift should fail loud.

The CRM added a new field. Great. The nightly refresh silently drops it because staging hardcoded the column list. Three weeks later someone asks about it. Contract-test your staging models against raw. Break the build on mismatch.

What the channel looks like
#data-alertsmute disabled · humans on-call
09:14row-count variance · stg_orders landed 280 rows (30d avg 14,204, σ=1,820). shopify connector last succeeded 08:55 with 0 rows.
09:15freshness breach · dim_customer is 4h12m stale. SLA 4h. Upstream: stg_orders (see 09:14).
09:22ack'd · @jamie · restarting shopify connector, backfilling 08:00–09:00
09:48resolved · dim_customer 12m stale, row-count back inside σ.
Heuristic

One alert channel. Humans on call for it.

If alerts land in a Slack channel nobody watches, you don't have observability — you have logging. Route to the same channel your platform alerts go to. Fewer, louder, actionable.

Artifact slack/alerts/freshness.json ~25 lines · Slack Block Kit
// slack/alerts/freshness.json — what hits #data-alerts when an asset goes stale
{
  "channel": "#data-alerts",
  "blocks": [
    { "type": "header",
      "text": { "type": "plain_text", "text": "🟠 Stale: raw_shopify.orders" } },
    { "type": "section",
      "fields": [
        { "type": "mrkdwn", "text": "*Last refresh:*\n42 min ago" },
        { "type": "mrkdwn", "text": "*SLA:*\n20 min" },
        { "type": "mrkdwn", "text": "*Owner:*\n@ops-data-team" },
        { "type": "mrkdwn", "text": "*Downstream:*\ndaily_kpi_report, ai_assistant" }
      ]
    },
    { "type": "actions",
      "elements": [
        { "type": "button", "text": { "type": "plain_text", "text": "View asset" },
          "url": "https://dagster.your-co.com/assets/raw_shopify/orders" },
        { "type": "button", "text": { "type": "plain_text", "text": "Mute 1h" },
          "value": "mute:raw_shopify.orders:1h" }
      ]
    }
  ]
}

Load-bearingEvery alert names who owns it and what breaks downstream. An alert without a name and a blast radius is just noise — it gets muted in a week.