L5 · AI interface — Pipelines for AI playbook

Build layer · the payoff

AI interface — tools, not SQL. The payoff layer.

The fifth and final build layer. Where the AI actually queries the lake. The reflex is to build a chatbot — almost always wrong. The pattern that works for production: structured AI features with the request envelope state machine. Where most DIY attempts go sideways.

Our take

Expose the modeled lake through a small set of typed tools. The LLM calls them. Never writes SQL. Wrap every irreversible action in a human-approval gate. Track cost per call from day one.

Pipelines for AI by Getting Automated

FREE RESOURCE .xlsx · 7 sheets · ~30 KB

L5 AI COST + EVAL WORKBOOK

Budget the AI. Eval before you ship.

Every AI feature has a per-call cost and an eval rubric — or it doesn’t belong in production. This is the workbook we use to size AI features before scoping, and to set governance gates before launch. Pre-filled with 8 feature shapes spanning chatbot, structured AI, agents, ambient.

WHAT'S INSIDE — 7 SHEETS

01Start here. Plain-English orientation. Tools vs chatbots vs agents — what each means + where each fits.
02Tool vs chatbot. Decision matrix: when each AI feature shape is right, when each is wrong.
03AI feature inventory. 8 pre-filled features (proactive outreach, theme extraction, narrative gen) + 20 blank rows.
04Eval test cases. 10 categories of evals: golden examples, regression tests, hallucination probes, citation tests.
05Cost + governance. 8 action classes mapped to gates: reversible / external / financial → who approves what.
06Glossary. Every term defined plainly — tool, eval, prompt, context, agent, gate, hallucination.

No newsletter spam. Occasional ping when the playbook gets a real update. Reply “remove” and you’re out.

Cover 01 · Start here 02 · Tool vs chatbot 03 · AI feature inventory 04 · Evals 05 · Cost + gov 06 · Glossary

Feature

Shape

Model

$ / call

Eval

proactive_outreach

tool-calling

sonnet-4-6

$0.04

passing

extract_review_themes

structured

haiku-4-5

$0.012

passing

weekly_narrative

envelope

opus-4-7

$0.60

drafting

cs_assistant

chatbot

sonnet-4-6

$0.08

passing

inventory_alerts

ambient

haiku-4-5

$0.005

passing

churn_analyzer

tool-calling

sonnet-4-6

$0.03

drafting

Ready Sheet 03 · AI feature inventory 6 of 8 features · governance gates + eval rubrics in the .xlsx

First, the words MCP & tool-calling

Tools are the menu. The LLM picks from the menu. It doesn't write its own recipes.

Tool-calling: instead of letting the LLM write SQL against raw tables (where it can hallucinate wrong joins or miss a filter and report "total revenue" for a subset), you give it a short list of named functions — get_revenue_by_period(start, end), list_at_risk_customers(threshold), compare_periods(a, b). The model picks one, fills in the arguments, and your code runs the actual query.

MCP — Model Context Protocol — is the standardized way to describe those tools to the model. Same format whether you're plugging into Claude, ChatGPT, Cursor, or a custom agent. Define the tool once, it works everywhere.

Behind each tool lives a semantic layer: a dictionary of named metrics with exactly one definition each. When the tool says "revenue," there's exactly one SQL behind it — not six slightly-different versions scattered across dashboards.

Why not just let the LLM write SQL?

Because it's a demo, not a product. On a real warehouse with 200 tables, the LLM guesses the join, picks the wrong date column, forgets a filter — and confidently returns a wrong number. Tools constrain the surface area to the queries you've already validated. Boring. Reliable. Shippable.

figOne tool call · user question to grounded answer

LLM never writes SQL every call logged · auditable forever row-level access enforced in the tool, not the model

Rule 05

Think in verbs, not in tables.

Good tools are verbs the business uses: get_pipeline, list_at_risk_customers, compare_periods. Bad tools are thin wrappers around SQL: query_warehouse(sql). The first shape is constrained and auditable; the second is an LLM with a loaded gun.

From a recent build

One tool — `find_stuck_deals` — replaced a year of Slack back-and-forth.

One verb. Takes a stage and a staleness threshold. Returns deals, summary, and suggested next steps. Replaced a channel where people asked the sales ops lead the same question twice a day for a year.

Opinion

Every tool returns structured results with a natural-language summary.

The structured part is for downstream tools and renderers. The summary is what the LLM quotes back to the user. Forcing both keeps answers grounded — the LLM has to reconcile what it wants to say with what the data actually returned.

Example

A tool definition, end to end.

One tool. Strongly typed inputs. A query it actually runs (against marts, through the semantic layer). A result shape the model can narrate.

▾ tools/get_pipeline.ts — shape, not vendor-specific

// Tool surface the LLM sees — names and types are the prompt.
export const getPipeline = defineTool({
  name: "get_pipeline",
  description:
    "Return open sales pipeline, optionally filtered by owner, " +
    "stage, or close-date range. Returns total value, deal count, " +
    "weighted forecast, and a short human-readable summary.",

  input: z.object({
    owner:     z.string().optional(),
    stage:     z.enum(["qualified", "proposal", "negotiation"]).optional(),
    close_by:  z.string().datetime().optional(),
  }),

  async run({ owner, stage, close_by }, ctx) {
    // 1. Enforce row-level access for the calling user.
    const scoped = await scopeToUser(ctx.user);

    // 2. Call the semantic layer — NOT raw SQL.
    const rows = await metric("pipeline").by("deal").where({
      owner, stage,
      close_date_lte: close_by,
      account_id_in:  scoped.accounts,
    });

    // 3. Shape it for both tool-chaining AND narration.
    return {
      structured: {
        total_value:        sum(rows, "amount"),
        deal_count:         rows.length,
        weighted_forecast:  sum(rows, r => r.amount * r.stage_prob),
        deals:              rows,
      },
      summary:
        `${rows.length} open deals worth ` +
        `$${fmt(sum(rows, "amount"))}. ` +
        `Weighted forecast: $${fmt(weighted)}.`,
    };
  },
});

Heuristic

Ship 8–15 tools, not 80.

Every tool you add makes the model slower and more confused. A small, well-named set of verbs beats an exhaustive surface. When in doubt, compose existing tools instead of adding new ones.

Opinion

Log every tool call. Forever.

Input, output, user, timestamp. This is your audit trail, your debugging log, and your eval dataset — all at once. Tool-call logs are the single most valuable artifact a pipeline produces after six months. Keep them.

Artifact tools/metrics.py ~22 lines · Python · MCP tool

# tools/metrics.py — what the agent sees
from typing import Literal
from semantic_layer import resolve_metric

def query_metric(
    metric: Literal["revenue", "orders", "cac", "ltv", "active_customers"],
    grain: Literal["day", "week", "month"],
    start_date: str,            # YYYY-MM-DD
    end_date: str,
    segment: str | None = None  # e.g. 'channel:paid_search'
) -> dict:
    """Query a defined business metric over a time window.

    Returns: {"metric", "grain", "rows": [{"period", "value", "source_tables": [...]}]}
    Every row cites the source tables that produced it.
    """
    return resolve_metric(metric, grain, start_date, end_date, segment).to_dict()

Load-bearingOne tightly-scoped tool that only queries pre-defined metrics beats letting the agent write raw SQL every time. The Literal enum prevents hallucinated metric names; the response shape is fixed; source-table citations are baked into the contract — not optional, not after-the-fact.

AI interface — tools, not SQL. The payoff layer.

Tools are the menu. The LLM picks from the menu. It doesn't write its own recipes.

Think in verbs, not in tables.

One tool — find_stuck_deals — replaced a year of Slack back-and-forth.

Every tool returns structured results with a natural-language summary.

A tool definition, end to end.

Ship 8–15 tools, not 80.

Log every tool call. Forever.

One tool — `find_stuck_deals` — replaced a year of Slack back-and-forth.