The decision everyone wants to overthink. L3 Model is where the lake gets structured and queryable. It has two halves we cover in order: storage (the lake's tiers and zones, immediately below) and transform (SQL in git with tests, further down). Together they're how the lake becomes a single source of truth.
Most companies don't need a warehouse. They need a lake — S3 + Glue + Athena. You can run the whole business from there for a long time. A warehouse is what you reach for when you need hotter, faster answers or the query costs on the lake stop making sense.
A data lake is like a big garage — you throw files in (JSON, CSV, logs, whatever), label the shelves, and dig things out when you need them. Cheap to keep stuff there. Slower to find a specific thing. You only pay when you go looking.
A data warehouse is like a filing cabinet — everything pre-sorted, indexed, and formatted. Fast to pull any single thing. But you pay rent on the cabinet whether you open it today or not, and you have to convert every document before it goes in.
Good at: keeping everything, cheaply. Handling weird shapes (JSON, logs, images). Paying only when queried.
Less good at: sub-second dashboards. High-frequency querying.
Good at: fast dashboards. Frequent BI queries. Sub-second responses your CEO wants.
Less good at: keeping cheap raw exhaust (logs, clicks, raw events) at scale.
Parquet in object storage. Schema-on-read via Glue catalog. Query with Athena. Pay per scan, not per uptime. Start here; stay here longer than you think you should.
Dashboards timing out? AI tools needing sub-second responses? Athena scan bills beating warehouse subscriptions? Now you add Snowflake / BigQuery / Redshift — for the hot marts. Lake stays as the source of truth.
Partitioning strategy, cost governance, hybrid lake/warehouse patterns, Iceberg table maintenance — this is a dedicated role, not a side-quest.
Raw is the delivery dock — crates of ingredients arriving from suppliers, exactly as they were shipped. Nobody cooks from here. Nobody edits the crates. If something's wrong, you can trace it back to the supplier.
Staging is the prep station — peel, chop, rinse, label. Every item gets cleaned and standardized. Same shapes, same units, same names. The menu doesn't happen here yet; this is just getting ingredients ready to be cooked with.
Marts is the plated dish — the thing the customer (your dashboard, your AI, your finance team) actually eats. One row per customer. One row per order. Pre-joined, pre-computed, business-ready. This is the only thing consumers ever read.
raw → staging → marts.raw mirrors sources exactly. staging cleans and types, 1:1 with source tables. marts models the business — one row per customer, one row per order, pre-joined, pre-aggregated where it matters.
Consumers only ever read marts. Staging is for modelers. Raw is for debugging and disaster recovery.
# Bucket layout — your-co-data
s3://your-co-data/
├── raw/ <- as-is from source. partitioned by ingest date.
│ ├── shopify/orders/dt=2026-04-30/
│ ├── netsuite/transactions/dt=2026-04-30/
│ └── klaviyo/events/dt=2026-04-30/
├── staging/ <- typed, deduped, 1:1 with source. dbt-managed.
│ ├── stg_shopify__orders/
│ └── stg_netsuite__txns/
└── marts/ <- business concepts. customers, orders, ad_spend.
├── core/
└── marketing/
Load-bearingRaw is append-only — we never transform in place. If a model is wrong, fix the model and rebuild from raw. Touching raw means giving up your safety net the first time something breaks.
Storage (above) gives you the lake's structure. Transform (below) gives you its queryable shape. Together they're L3 Model. This is where “data engineering” becomes “software engineering.” Once you've done it this way, you can't go back.
Every model is a SELECT. Every SELECT is reviewed, tested, and versioned.
Without one: SQL lives in ten places. A query in a dashboard, a saved query in someone's editor, a stored procedure in the database, a script in a notebook. When a number changes, nobody can tell you why, because nobody tracks which query produced it.
With one (dbt, SQLMesh, etc.): every query is a file in a git repo. The tool reads all the files, figures out what depends on what, runs them in the right order, runs tests against each result ("this ID column is never null," "this table is unique by customer"), and fails the build if anything is off. Same workflow as any software codebase — pull requests, reviews, CI checks.
It gives you three superpowers: dependency resolution (model B depends on model A — run A first), tests (the computer checks the work), and versioning (git-blame every number).
You need three things: dependency resolution (model B depends on model A), tests (this column is never null, this ID is unique), and CI (broken SQL fails PR before prod). Any tool that does those three is fine. The category is mature; don't litigate it.
Avoid: ltv_amt_usd_ttm. Compact, efficient, impenetrable to anyone who wasn't in the room when it was named. The LLM guesses, and guesses wrong.
Prefer: lifetime_revenue_usd. Plain English. Unit in the name. No guessing. Works whether the reader is an analyst, an agent, or yourself in six months.
Same idea as software unit tests, but about your data. You declare a rule — "every row in dim_customer has a non-null customer_id" or "order_id is unique in fct_orders" — and the test runs against the real output after every build. If it fails, the build fails, and the bad table never gets published to dashboards or AI.
Five tests catch 90% of "the numbers look weird today" incidents: not_null, unique, accepted_values (is this enum one of the expected options), relationships (every foreign key exists in the parent table), and freshness (this table was updated in the last N hours).
For every mart: uniqueness on the grain, non-null on join keys, referential integrity against staging, accepted values on enums. Adding a new column? Add a test. It takes thirty seconds and saves a production outage.
One row per customer. Every metric the business cares about, pre-computed. This is the table both your dashboards and your AI read from.
-- one row per customer, joined to every relevant source with orders as ( select customer_id, count(*) as lifetime_orders, sum(gross_amount) as lifetime_revenue, max(ordered_at) as last_order_at from staging.orders group by 1 ), tickets as ( select customer_id, count(*) as lifetime_tickets, count(*) filter (where status = 'open') as open_tickets from staging.support_tickets group by 1 ) select c.customer_id, c.email, c.signed_up_at, coalesce(o.lifetime_revenue, 0) as lifetime_revenue, coalesce(o.lifetime_orders, 0) as lifetime_orders, o.last_order_at, coalesce(t.lifetime_tickets, 0) as lifetime_tickets, coalesce(t.open_tickets, 0) as open_tickets, -- business-defined status, not a source-system field case when o.last_order_at > current_date - interval '90 days' then 'active' when o.last_order_at > current_date - interval '365 days' then 'lapsing' else 'dormant' end as lifecycle_stage from staging.customers c left join orders o using (customer_id) left join tickets t using (customer_id)
lifetime_revenue, not ltv_amt_usd. The LLM on the other end is going to read these names and reason about them. Good names reduce hallucinations more than any prompt engineering.
-- models/staging/stg_shopify__orders.sql
{{ config(materialized='incremental', unique_key='order_id') }}
SELECT
id::TEXT AS order_id,
customer_id::TEXT AS customer_id,
total_price::NUMERIC AS total_amount,
total_discounts::NUMERIC AS discount_amount,
total_tax::NUMERIC AS tax_amount,
cancelled_at::TIMESTAMP AS cancelled_at,
COALESCE(cancelled_at, processed_at)::TIMESTAMP AS effective_at,
_ingested_at::TIMESTAMP AS ingested_at
FROM {{ source('raw_shopify', 'orders') }}
{% if is_incremental() %}
WHERE _ingested_at > (SELECT MAX(ingested_at) FROM {{ this }})
{% endif %}
# models/staging/_schema.yml
models:
- name: stg_shopify__orders
columns:
- name: order_id
tests: [unique, not_null]
- name: total_amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
Load-bearingTests are non-optional. Every column that drives a business decision gets a not_null. Every monetary column gets a range. If a model has no tests, it doesn’t merge — that's the rule.