Most teams treat observability as a layer to add at the end. That framing is the problem. Trust — observability, data quality, freshness, schema drift — is a discipline that runs through every build layer. Designed in at L1. Present at L2, L3, L4, L5. By L4 it's too late to retrofit.
Four things, always on: freshness, volume, schema, and nulls on key columns. Wired into every layer from day one — not bolted on later.
A web app breaks loudly — a 500 error, a broken page, something obvious. A data pipeline breaks silently: a connector stops updating at 3am and the dashboard just keeps showing yesterday's number. A source renames a column and every downstream metric subtly shifts. Nobody notices for a week.
Observability is the smoke detector for this. You can't watch every row; you watch four signals — is it fresh, is it the right volume, is the schema still what we expect, are the key columns not full of nulls — and those four catch the vast majority of "the numbers look weird today" incidents. Start here. Add more signals when a specific failure teaches you to.
Every mart has a max staleness. Breach it, page someone.
Today vs. the last 30 days. Outliers mean upstream broke.
Contract-test staging against raw. Fail the build, don't silently drop columns.
Join keys and grain columns should never be null. Ever.
Every mart should have a max acceptable staleness. If dim_customer hasn't updated in 4 hours, something is wrong — even if queries still succeed. Alert on it.
If yesterday you landed 14,000 orders and today you landed 280, a source broke and nobody told you. A simple "is today within 3 standard deviations of the last 30 days?" check catches most upstream outages within one sync.
The CRM added a new field. Great. The nightly refresh silently drops it because staging hardcoded the column list. Three weeks later someone asks about it. Contract-test your staging models against raw. Break the build on mismatch.
stg_orders landed 280 rows (30d avg 14,204, σ=1,820). shopify connector last succeeded 08:55 with 0 rows.dim_customer is 4h12m stale. SLA 4h. Upstream: stg_orders (see 09:14).dim_customer 12m stale, row-count back inside σ.If alerts land in a Slack channel nobody watches, you don't have observability — you have logging. Route to the same channel your platform alerts go to. Fewer, louder, actionable.
// slack/alerts/freshness.json — what hits #data-alerts when an asset goes stale
{
"channel": "#data-alerts",
"blocks": [
{ "type": "header",
"text": { "type": "plain_text", "text": "🟠 Stale: raw_shopify.orders" } },
{ "type": "section",
"fields": [
{ "type": "mrkdwn", "text": "*Last refresh:*\n42 min ago" },
{ "type": "mrkdwn", "text": "*SLA:*\n20 min" },
{ "type": "mrkdwn", "text": "*Owner:*\n@ops-data-team" },
{ "type": "mrkdwn", "text": "*Downstream:*\ndaily_kpi_report, ai_assistant" }
]
},
{ "type": "actions",
"elements": [
{ "type": "button", "text": { "type": "plain_text", "text": "View asset" },
"url": "https://dagster.your-co.com/assets/raw_shopify/orders" },
{ "type": "button", "text": { "type": "plain_text", "text": "Mute 1h" },
"value": "mute:raw_shopify.orders:1h" }
]
}
]
}
Load-bearingEvery alert names who owns it and what breaks downstream. An alert without a name and a blast radius is just noise — it gets muted in a week.