L2 · Ingest — Pipelines for AI playbook

Ingest — connectors

Ingest — get data in. Reliably.

Once L1 has identified what sources you need, this layer is how their data reaches the lake. Three patterns — real-time webhooks, scheduled polling, batch — picked per source. Most people try to make it clever. We try to make it predictable.

Our take

Custom connectors — every time. Built for the business they plug into. Idempotent writes. Reconciliation backstops on every webhook.

Pipelines for AI by Getting Automated

FREE RESOURCE .xlsx · 6 sheets · ~27 KB

L2 INGEST DECISION MATRIX

Pick the pattern. Skip the cost surprise.

Webhook vs polling vs batch — wrong choice on a high-volume source and you’ll get a $2K AWS bill the first month. Right choice and ingest costs $20. This is the same matrix we use to scope source builds. Pre-filled with 18 industry examples + idempotency patterns.

WHAT'S INSIDE — 6 SHEETS

01Start here. Plain-English orientation. What's webhook vs polling vs batch and when each one is right.
02Pattern decision. Flow chart logic: volume + freshness needs + source capability → which pattern.
03Source × Pattern. 18 pre-filled rows: Shopify, ShipStation, Klaviyo, Stripe, Salesforce, Epic, Plaid + 30 blank rows.
04Idempotency + reconcile. 8 patterns: upsert by natural key, dedup by hash, watermark resume, scheduled reconcile.
05Glossary. Every term defined plainly — webhook, polling, idempotent, watermark, reconcile, exactly-once.

No newsletter spam. Occasional ping when the playbook gets a real update. Reply “remove” and you’re out.

Cover 01 · Start here 02 · Pattern decision 03 · Source × Pattern 04 · Idempotency 05 · Glossary

Source

Volume / day

Pattern

Idempotency

Cost / mo

Shopify orders

1,200

webhook

order_id

Stripe charges

800

webhook

charge_id

Klaviyo events

45,000

polling

watermark

$18

Salesforce

200

polling

sfdc_id

Northbeam

12,000

batch

full replace

Epic FHIR

5,000

polling

resource_id

$12

Ready Sheet 03 · Source × Pattern DTC + SaaS + healthcare · 6 of 18 rows · scroll for the rest in the .xlsx

First, the words Connector

A connector is a little courier that picks up data from a system and drops it in your storage.

Imagine you have six suppliers — one sends you invoices, one sends receipts, one sends orders. A connector is a courier you hire to visit that supplier on a schedule, collect whatever's new, and leave a copy on your loading dock (the raw zone). One courier per supplier. Same uniform, same clipboard, same handoff procedure.

In software terms: a connector is a small program that authenticates to a source — your CRM, your payments processor, your app's database — pulls new records, and writes them as files (to a lake) or rows (to a warehouse). Nothing fancy. The hard part isn't the code; it's the hundred edge cases each source has.

Why this matters

"Connecting to a new data source" sounds like one thing, but each source has its own auth, its own rate limits, its own way of telling you what changed. Every business ends up with a handful of these — and they're the layer most people underestimate.

Our default

Every source gets its own connector — written for this business, this schema, this SLA.

Generic connectors optimize for the average SaaS. Your business isn't average — you have one internal system of record that actually matters, two vendors with non-standard APIs, and a spreadsheet somebody swears is temporary. We write the connector per source so every ingest path has the same ownership, the same retry semantics, the same observability hooks.

What this covers: your Postgres / Mongo / internal API · Salesforce · HubSpot · Stripe · Google & Meta Ads · Zendesk · Intercom · Segment · Amplitude · CSV drops · partner SFTP · one-off APIs.

For context

Third-party tools — Fivetran · Airbyte · Stitch — are legitimate and plenty of teams run on them happily. If the long tail of vendor APIs is the only thing you're ingesting and you have nothing custom, use them. For any build we take on, the custom path wins on total cost: one code style, one alerting path, one way to explain what went wrong.

Why we commit up front

The thing managed connectors can't give you is matching the rest of your system.

Same deploys, same secrets, same on-call rotation, same logging. One connector that breaks the pattern is worth more incidents than five that follow it.

Heuristic

Full refresh is fine until it isn't.

Under a million rows, refreshing the whole table nightly is simpler than anything incremental. Fewer moving parts, fewer lies. Optimize when it actually hurts.

Rule 01

Land it raw. Never transform on the way in.

Raw tables are append-only, untouched, ugly. JSON blobs where JSON blobs came from. You'll thank yourself when you need to reprocess history because the business definition of "customer" changed.

Pitfall

Don't let anyone "clean up" raw in a one-off script. The second raw isn't reproducible, every downstream mart becomes suspect.

Two ways to stay in sync Polling vs CDC

Two ways to know what changed in a source. Check the mailbox, or get a phone call.

Polling is checking the mailbox every 15 minutes. You ask the source: "what changed since I last looked?" The source replies with the new stuff. Simple. Works for almost anything. You miss deletes unless the source tells you about them, and there's always a little lag (15 min, 5 min, whatever you set).

CDC (change data capture) is subscribing to a phone call. The source calls you every time a row is inserted, updated, or deleted — usually by reading the database's own transaction log. You get changes within seconds, including deletes. But you need a persistent connection and some plumbing (Debezium, a replication slot, etc.) that needs babysitting.

Which to pick

Polling for SaaS APIs you don't control. CDC for your own database when you care about deletes or want sub-minute freshness. Most real pipelines use both.

Polling · simple, defensible

select * where updated_at > last_sync

Good for: SaaS APIs, sources you don't control, tables where updated_at is trustworthy.
Watch for: Back-dated writes and hard-deletes — you need a reconcile pass or tombstone convention.
Latency: Whatever the poll interval is. 5–15m is usual.

Log-based CDC · tight coupling, near real-time

Read the write-ahead log directly

Good for: Your own Postgres / Mongo / MySQL — anywhere you control the log.
Watch for: Replication slot babysitting, schema changes, and keeping the consumer caught up.
Latency: Seconds, if you want it.

Both are fine

Neither is the "right" answer. Pick the one that matches the source. For a SaaS API with a sane updated_at, polling is less moving parts. For your own transactional database where you care about deletes and sub-minute freshness, CDC earns the extra moving parts. Most real deployments use both — polling for vendors, CDC for systems of record.

DefaultPolling with updated_at

WhenCDC (Debezium / DMS / native)

How it works — connector queries WHERE updated_at > last_cursor on a schedule; new rows land in raw.

How it works — log-based: tail the source DB’s WAL/binlog and stream every change, deletes included.

Lower operational cost. One cron job, one cursor. Easy to reason about, easy to backfill.

Higher operational cost. Streaming infra, ordering guarantees, replay strategy. Earns its weight when you need it.

Pick when: SaaS API, modest freshness needs (5–30 min), no hard requirement to track deletes.

Pick when: Your own transactional database, sub-minute freshness, deletes matter (refunds, soft deletes, GDPR).

Artifact connectors/klaviyo/extract.py ~25 lines · Python 3.11

from time import sleep
from random import uniform
import httpx

BASE = "https://a.klaviyo.com/api"
RATE_LIMIT_RPS = 10  # Klaviyo's published cap; halve for prod headroom.

def fetch_paginated(endpoint: str, since: str, *, max_retries: int = 5):
    """Pull every record updated since `since`. Retries with jitter on 429/5xx."""
    url, attempt = f"{BASE}/{endpoint}?filter=greater-than(updated,{since})", 0
    while url:
        try:
            r = httpx.get(url, headers=AUTH, timeout=30)
            r.raise_for_status()
            yield from r.json()["data"]
            url = r.json().get("links", {}).get("next")
            sleep(1 / RATE_LIMIT_RPS)
        except httpx.HTTPStatusError as e:
            if e.response.status_code in (429, 502, 503, 504) and attempt < max_retries:
                attempt += 1
                sleep((2 ** attempt) + uniform(0, 1))   # exponential + jitter
                continue
            raise

Load-bearingThe + uniform(0, 1) jitter is non-negotiable. Without it, every connector retries at the same instant after a 429 storm and the API rate-limits you twice as long.