Build layer Layer 3 of 5 build layers ~10 min read
L3
Model — storage + transform together

Model — start with a lake. The schema work that turns 6 systems into one row.

The decision everyone wants to overthink. L3 Model is where the lake gets structured and queryable. It has two halves we cover in order: storage (the lake's tiers and zones, immediately below) and transform (SQL in git with tests, further down). Together they're how the lake becomes a single source of truth.

Our take

Most companies don't need a warehouse. They need a lakeS3 + Glue + Athena. You can run the whole business from there for a long time. A warehouse is what you reach for when you need hotter, faster answers or the query costs on the lake stop making sense.

First, the words Lake vs warehouse

Two ways to keep your data. One is a garage. One is a filing cabinet.

A data lake is like a big garage — you throw files in (JSON, CSV, logs, whatever), label the shelves, and dig things out when you need them. Cheap to keep stuff there. Slower to find a specific thing. You only pay when you go looking.

A data warehouse is like a filing cabinet — everything pre-sorted, indexed, and formatted. Fast to pull any single thing. But you pay rent on the cabinet whether you open it today or not, and you have to convert every document before it goes in.

Lake · garage
Files in folders, on cheap shelves.
.json .parquet .csv .log .parquet

Good at: keeping everything, cheaply. Handling weird shapes (JSON, logs, images). Paying only when queried.

Less good at: sub-second dashboards. High-frequency querying.

Warehouse · filing cabinet
Rows & columns, pre-sorted, indexed.
CUSTOMERS ORDERS PRODUCTS REVENUE id, name, email...

Good at: fast dashboards. Frequent BI queries. Sub-second responses your CEO wants.

Less good at: keeping cheap raw exhaust (logs, clicks, raw events) at scale.

Upgrade when…
Dashboards take > 3 seconds, scan bills overtake a warehouse subscription, or the AI tool needs sub-second answers. Then you add a warehouse on top of the lake — not instead of it. The lake is still the source of truth.
cheap & slow mixed hot & fast specialist
Stage A · default
Data lake. S3 + Glue + Athena.

Parquet in object storage. Schema-on-read via Glue catalog. Query with Athena. Pay per scan, not per uptime. Start here; stay here longer than you think you should.

S3AWS GlueAthenaIceberg / Parquet
Stage B · when you need hot
Add a warehouse for the queries that can't wait.

Dashboards timing out? AI tools needing sub-second responses? Athena scan bills beating warehouse subscriptions? Now you add Snowflake / BigQuery / Redshift — for the hot marts. Lake stays as the source of truth.

SnowflakeBigQueryRedshiftDuckDB / MotherDuck
Stage C · 10B+ rows
You have a storage problem. Hire for it.

Partitioning strategy, cost governance, hybrid lake/warehouse patterns, Iceberg table maintenance — this is a dedicated role, not a side-quest.

dedicated data eng
Anti-pattern
"We'll just query prod Postgres." You'll take down prod. Or you'll build a read replica and slowly reinvent a warehouse on top of it with worse query ergonomics. Either way, you end up here. Skip ahead.
Inside the storage Raw → Staging → Marts

Think of your data like a restaurant kitchen. Three stations. One direction.

Raw is the delivery dock — crates of ingredients arriving from suppliers, exactly as they were shipped. Nobody cooks from here. Nobody edits the crates. If something's wrong, you can trace it back to the supplier.

Staging is the prep station — peel, chop, rinse, label. Every item gets cleaned and standardized. Same shapes, same units, same names. The menu doesn't happen here yet; this is just getting ingredients ready to be cooked with.

Marts is the plated dish — the thing the customer (your dashboard, your AI, your finance team) actually eats. One row per customer. One row per order. Pre-joined, pre-computed, business-ready. This is the only thing consumers ever read.

Why this order
If someone "cleans up" raw directly, your kitchen is now non-reproducible — you can't re-cook a dish when a supplier changes. Staging is the one place mess gets resolved, and marts is the one place the business reads from. Three zones. One direction. No shortcuts.
Structure

Three zones, every time: rawstagingmarts.

raw mirrors sources exactly. staging cleans and types, 1:1 with source tables. marts models the business — one row per customer, one row per order, pre-joined, pre-aggregated where it matters.

Consumers only ever read marts. Staging is for modelers. Raw is for debugging and disaster recovery.

figRaw → Staging → Marts · what each zone actually contains
RAW As-is from source raw.shopify_orders raw.hubspot_deals raw.stripe_charges raw.support_tickets • append-only • never hand-edit STAGING Cleaned, typed, 1:1 stg.orders stg.deals stg.charges stg.tickets • tested types + nulls • renamed like humans MARTS Business concepts dim_customer fct_pipeline fct_revenue_daily dim_account • one row per entity • consumers read here
debug / disaster recovery modeler-facing only read by AI + dashboards
DefaultLake-first (S3 + Iceberg + Athena)
WhenWarehouse-first (Snowflake / BigQuery)
Storage cost — pennies per GB-month. Pay for the data you keep, not the queries you might run.
Storage cost — bundled into compute pricing. Cheap when small; not when you keep five years of clickstream.
Query cost — pay-per-scan (Athena ≈ $5/TB scanned). With partitioning, most queries are cents.
Query cost — credits for warehouse-uptime. Predictable monthly bill; tougher when usage spikes.
Pick when: You’re starting cold. You want low fixed cost. You’ll add a warehouse for hot marts when query latency hurts.
Pick when: You already have one and a team that knows it. Sub-second BI matters. You don’t want to operate two storage layers.
Artifact s3://your-co-data/ (lake layout) ~12 lines · tree
# Bucket layout — your-co-data
s3://your-co-data/
├── raw/              <- as-is from source. partitioned by ingest date.
│   ├── shopify/orders/dt=2026-04-30/
│   ├── netsuite/transactions/dt=2026-04-30/
│   └── klaviyo/events/dt=2026-04-30/
├── staging/          <- typed, deduped, 1:1 with source. dbt-managed.
│   ├── stg_shopify__orders/
│   └── stg_netsuite__txns/
└── marts/            <- business concepts. customers, orders, ad_spend.
    ├── core/
    └── marketing/

Load-bearingRaw is append-only — we never transform in place. If a model is wrong, fix the model and rebuild from raw. Touching raw means giving up your safety net the first time something breaks.

L3 · Model continues — Transform

SQL in git. With tests.

Storage (above) gives you the lake's structure. Transform (below) gives you its queryable shape. Together they're L3 Model. This is where “data engineering” becomes “software engineering.” Once you've done it this way, you can't go back.

Our take

Every model is a SELECT. Every SELECT is reviewed, tested, and versioned.

First, the words Transform tool

A transform tool turns your collection of SQL queries into software.

Without one: SQL lives in ten places. A query in a dashboard, a saved query in someone's editor, a stored procedure in the database, a script in a notebook. When a number changes, nobody can tell you why, because nobody tracks which query produced it.

With one (dbt, SQLMesh, etc.): every query is a file in a git repo. The tool reads all the files, figures out what depends on what, runs them in the right order, runs tests against each result ("this ID column is never null," "this table is unique by customer"), and fails the build if anything is off. Same workflow as any software codebase — pull requests, reviews, CI checks.

It gives you three superpowers: dependency resolution (model B depends on model A — run A first), tests (the computer checks the work), and versioning (git-blame every number).

Category is mature
Any tool that does those three things is fine. dbt is the default. Don't spend a month picking.
Rule 02

Use a transform tool that compiles SQL and runs tests.

You need three things: dependency resolution (model B depends on model A), tests (this column is never null, this ID is unique), and CI (broken SQL fails PR before prod). Any tool that does those three is fine. The category is mature; don't litigate it.

Pattern

Name columns the way an LLM reads them, not the way SQL devs type them.

Avoid: ltv_amt_usd_ttm. Compact, efficient, impenetrable to anyone who wasn't in the room when it was named. The LLM guesses, and guesses wrong.

Prefer: lifetime_revenue_usd. Plain English. Unit in the name. No guessing. Works whether the reader is an analyst, an agent, or yourself in six months.

Why tests? Data tests

Data tests are assertions the computer checks every time the pipeline runs.

Same idea as software unit tests, but about your data. You declare a rule — "every row in dim_customer has a non-null customer_id" or "order_id is unique in fct_orders" — and the test runs against the real output after every build. If it fails, the build fails, and the bad table never gets published to dashboards or AI.

Five tests catch 90% of "the numbers look weird today" incidents: not_null, unique, accepted_values (is this enum one of the expected options), relationships (every foreign key exists in the parent table), and freshness (this table was updated in the last N hours).

Opinion

Write tests like you'd write any other tests.

For every mart: uniqueness on the grain, non-null on join keys, referential integrity against staging, accepted values on enums. Adding a new column? Add a test. It takes thirty seconds and saves a production outage.

Example

What a mart actually looks like.

One row per customer. Every metric the business cares about, pre-computed. This is the table both your dashboards and your AI read from.

figThe dependency graph a transform tool walks · with tests
raw.customers raw.orders raw.tickets stg_customers stg_orders stg_tickets unique not_null typed dim_customer unique relations grain fct_revenue_daily Dashboards AI tools Sheets sync
every box has tests before it ships break a test → break the build
marts/dim_customer.sql — abbreviated
-- one row per customer, joined to every relevant source
with orders as (
  select
    customer_id,
    count(*)                as lifetime_orders,
    sum(gross_amount)       as lifetime_revenue,
    max(ordered_at)         as last_order_at
  from staging.orders
  group by 1
),

tickets as (
  select customer_id,
    count(*)                as lifetime_tickets,
    count(*) filter (where status = 'open') as open_tickets
  from staging.support_tickets
  group by 1
)

select
  c.customer_id,
  c.email,
  c.signed_up_at,
  coalesce(o.lifetime_revenue, 0) as lifetime_revenue,
  coalesce(o.lifetime_orders,  0) as lifetime_orders,
  o.last_order_at,
  coalesce(t.lifetime_tickets, 0) as lifetime_tickets,
  coalesce(t.open_tickets, 0)     as open_tickets,
  -- business-defined status, not a source-system field
  case
    when o.last_order_at > current_date - interval '90 days' then 'active'
    when o.last_order_at > current_date - interval '365 days' then 'lapsing'
    else 'dormant'
  end as lifecycle_stage

from staging.customers c
left join orders   o using (customer_id)
left join tickets  t using (customer_id)
Heuristic

Name columns like a human.

lifetime_revenue, not ltv_amt_usd. The LLM on the other end is going to read these names and reason about them. Good names reduce hallucinations more than any prompt engineering.

Artifact models/staging/stg_shopify__orders.sql + _schema.yml ~30 lines · dbt
-- models/staging/stg_shopify__orders.sql
{{ config(materialized='incremental', unique_key='order_id') }}

SELECT
  id::TEXT                         AS order_id,
  customer_id::TEXT                AS customer_id,
  total_price::NUMERIC             AS total_amount,
  total_discounts::NUMERIC         AS discount_amount,
  total_tax::NUMERIC               AS tax_amount,
  cancelled_at::TIMESTAMP          AS cancelled_at,
  COALESCE(cancelled_at, processed_at)::TIMESTAMP AS effective_at,
  _ingested_at::TIMESTAMP          AS ingested_at
FROM {{ source('raw_shopify', 'orders') }}
{% if is_incremental() %}
  WHERE _ingested_at > (SELECT MAX(ingested_at) FROM {{ this }})
{% endif %}

# models/staging/_schema.yml
models:
  - name: stg_shopify__orders
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: total_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000

Load-bearingTests are non-optional. Every column that drives a business decision gets a not_null. Every monetary column gets a range. If a model has no tests, it doesn’t merge — that's the rule.