Once L1 has identified what sources you need, this layer is how their data reaches the lake. Three patterns — real-time webhooks, scheduled polling, batch — picked per source. Most people try to make it clever. We try to make it predictable.
Custom connectors — every time. Built for the business they plug into. Idempotent writes. Reconciliation backstops on every webhook.
Imagine you have six suppliers — one sends you invoices, one sends receipts, one sends orders. A connector is a courier you hire to visit that supplier on a schedule, collect whatever's new, and leave a copy on your loading dock (the raw zone). One courier per supplier. Same uniform, same clipboard, same handoff procedure.
In software terms: a connector is a small program that authenticates to a source — your CRM, your payments processor, your app's database — pulls new records, and writes them as files (to a lake) or rows (to a warehouse). Nothing fancy. The hard part isn't the code; it's the hundred edge cases each source has.
Generic connectors optimize for the average SaaS. Your business isn't average — you have one internal system of record that actually matters, two vendors with non-standard APIs, and a spreadsheet somebody swears is temporary. We write the connector per source so every ingest path has the same ownership, the same retry semantics, the same observability hooks.
What this covers: your Postgres / Mongo / internal API · Salesforce · HubSpot · Stripe · Google & Meta Ads · Zendesk · Intercom · Segment · Amplitude · CSV drops · partner SFTP · one-off APIs.
Same deploys, same secrets, same on-call rotation, same logging. One connector that breaks the pattern is worth more incidents than five that follow it.
Under a million rows, refreshing the whole table nightly is simpler than anything incremental. Fewer moving parts, fewer lies. Optimize when it actually hurts.
Raw tables are append-only, untouched, ugly. JSON blobs where JSON blobs came from. You'll thank yourself when you need to reprocess history because the business definition of "customer" changed.
Polling is checking the mailbox every 15 minutes. You ask the source: "what changed since I last looked?" The source replies with the new stuff. Simple. Works for almost anything. You miss deletes unless the source tells you about them, and there's always a little lag (15 min, 5 min, whatever you set).
CDC (change data capture) is subscribing to a phone call. The source calls you every time a row is inserted, updated, or deleted — usually by reading the database's own transaction log. You get changes within seconds, including deletes. But you need a persistent connection and some plumbing (Debezium, a replication slot, etc.) that needs babysitting.
updated_at is trustworthy.updated_at, polling is less moving parts. For your own transactional database where you care about deletes and sub-minute freshness, CDC earns the extra moving parts. Most real deployments use both — polling for vendors, CDC for systems of record.updated_atWHERE updated_at > last_cursor on a schedule; new rows land in raw.from time import sleep
from random import uniform
import httpx
BASE = "https://a.klaviyo.com/api"
RATE_LIMIT_RPS = 10 # Klaviyo's published cap; halve for prod headroom.
def fetch_paginated(endpoint: str, since: str, *, max_retries: int = 5):
"""Pull every record updated since `since`. Retries with jitter on 429/5xx."""
url, attempt = f"{BASE}/{endpoint}?filter=greater-than(updated,{since})", 0
while url:
try:
r = httpx.get(url, headers=AUTH, timeout=30)
r.raise_for_status()
yield from r.json()["data"]
url = r.json().get("links", {}).get("next")
sleep(1 / RATE_LIMIT_RPS)
except httpx.HTTPStatusError as e:
if e.response.status_code in (429, 502, 503, 504) and attempt < max_retries:
attempt += 1
sleep((2 ** attempt) + uniform(0, 1)) # exponential + jitter
continue
raise
Load-bearingThe + uniform(0, 1) jitter is non-negotiable. Without it, every connector retries at the same instant after a 429 storm and the API rate-limits you twice as long.