The fifth and final build layer. Where the AI actually queries the lake. The reflex is to build a chatbot — almost always wrong. The pattern that works for production: structured AI features with the request envelope state machine. Where most DIY attempts go sideways.
Expose the modeled lake through a small set of typed tools. The LLM calls them. Never writes SQL. Wrap every irreversible action in a human-approval gate. Track cost per call from day one.
Tool-calling: instead of letting the LLM write SQL against raw tables (where it can hallucinate wrong joins or miss a filter and report "total revenue" for a subset), you give it a short list of named functions — get_revenue_by_period(start, end), list_at_risk_customers(threshold), compare_periods(a, b). The model picks one, fills in the arguments, and your code runs the actual query.
MCP — Model Context Protocol — is the standardized way to describe those tools to the model. Same format whether you're plugging into Claude, ChatGPT, Cursor, or a custom agent. Define the tool once, it works everywhere.
Behind each tool lives a semantic layer: a dictionary of named metrics with exactly one definition each. When the tool says "revenue," there's exactly one SQL behind it — not six slightly-different versions scattered across dashboards.
Good tools are verbs the business uses: get_pipeline, list_at_risk_customers, compare_periods. Bad tools are thin wrappers around SQL: query_warehouse(sql). The first shape is constrained and auditable; the second is an LLM with a loaded gun.
find_stuck_deals — replaced a year of Slack back-and-forth.One verb. Takes a stage and a staleness threshold. Returns deals, summary, and suggested next steps. Replaced a channel where people asked the sales ops lead the same question twice a day for a year.
The structured part is for downstream tools and renderers. The summary is what the LLM quotes back to the user. Forcing both keeps answers grounded — the LLM has to reconcile what it wants to say with what the data actually returned.
One tool. Strongly typed inputs. A query it actually runs (against marts, through the semantic layer). A result shape the model can narrate.
// Tool surface the LLM sees — names and types are the prompt. export const getPipeline = defineTool({ name: "get_pipeline", description: "Return open sales pipeline, optionally filtered by owner, " + "stage, or close-date range. Returns total value, deal count, " + "weighted forecast, and a short human-readable summary.", input: z.object({ owner: z.string().optional(), stage: z.enum(["qualified", "proposal", "negotiation"]).optional(), close_by: z.string().datetime().optional(), }), async run({ owner, stage, close_by }, ctx) { // 1. Enforce row-level access for the calling user. const scoped = await scopeToUser(ctx.user); // 2. Call the semantic layer — NOT raw SQL. const rows = await metric("pipeline").by("deal").where({ owner, stage, close_date_lte: close_by, account_id_in: scoped.accounts, }); // 3. Shape it for both tool-chaining AND narration. return { structured: { total_value: sum(rows, "amount"), deal_count: rows.length, weighted_forecast: sum(rows, r => r.amount * r.stage_prob), deals: rows, }, summary: `${rows.length} open deals worth ` + `$${fmt(sum(rows, "amount"))}. ` + `Weighted forecast: $${fmt(weighted)}.`, }; }, });
Every tool you add makes the model slower and more confused. A small, well-named set of verbs beats an exhaustive surface. When in doubt, compose existing tools instead of adding new ones.
Input, output, user, timestamp. This is your audit trail, your debugging log, and your eval dataset — all at once. Tool-call logs are the single most valuable artifact a pipeline produces after six months. Keep them.
# tools/metrics.py — what the agent sees
from typing import Literal
from semantic_layer import resolve_metric
def query_metric(
metric: Literal["revenue", "orders", "cac", "ltv", "active_customers"],
grain: Literal["day", "week", "month"],
start_date: str, # YYYY-MM-DD
end_date: str,
segment: str | None = None # e.g. 'channel:paid_search'
) -> dict:
"""Query a defined business metric over a time window.
Returns: {"metric", "grain", "rows": [{"period", "value", "source_tables": [...]}]}
Every row cites the source tables that produced it.
"""
return resolve_metric(metric, grain, start_date, end_date, segment).to_dict()
Load-bearingOne tightly-scoped tool that only queries pre-defined metrics beats letting the agent write raw SQL every time. The Literal enum prevents hallucinated metric names; the response shape is fixed; source-table citations are baked into the contract — not optional, not after-the-fact.