Build layer Layer 2 of 5 build layers ~10 min read
L2
Ingest — connectors

Ingest — get data in. Reliably.

Once L1 has identified what sources you need, this layer is how their data reaches the lake. Three patterns — real-time webhooks, scheduled polling, batch — picked per source. Most people try to make it clever. We try to make it predictable.

Our take

Custom connectors — every time. Built for the business they plug into. Idempotent writes. Reconciliation backstops on every webhook.

First, the words Connector

A connector is a little courier that picks up data from a system and drops it in your storage.

Imagine you have six suppliers — one sends you invoices, one sends receipts, one sends orders. A connector is a courier you hire to visit that supplier on a schedule, collect whatever's new, and leave a copy on your loading dock (the raw zone). One courier per supplier. Same uniform, same clipboard, same handoff procedure.

In software terms: a connector is a small program that authenticates to a source — your CRM, your payments processor, your app's database — pulls new records, and writes them as files (to a lake) or rows (to a warehouse). Nothing fancy. The hard part isn't the code; it's the hundred edge cases each source has.

Why this matters
"Connecting to a new data source" sounds like one thing, but each source has its own auth, its own rate limits, its own way of telling you what changed. Every business ends up with a handful of these — and they're the layer most people underestimate.
Our default

Every source gets its own connector — written for this business, this schema, this SLA.

Generic connectors optimize for the average SaaS. Your business isn't average — you have one internal system of record that actually matters, two vendors with non-standard APIs, and a spreadsheet somebody swears is temporary. We write the connector per source so every ingest path has the same ownership, the same retry semantics, the same observability hooks.

What this covers: your Postgres / Mongo / internal API · Salesforce · HubSpot · Stripe · Google & Meta Ads · Zendesk · Intercom · Segment · Amplitude · CSV drops · partner SFTP · one-off APIs.

For context
Third-party tools — Fivetran · Airbyte · Stitch — are legitimate and plenty of teams run on them happily. If the long tail of vendor APIs is the only thing you're ingesting and you have nothing custom, use them. For any build we take on, the custom path wins on total cost: one code style, one alerting path, one way to explain what went wrong.
Why we commit up front

The thing managed connectors can't give you is matching the rest of your system.

Same deploys, same secrets, same on-call rotation, same logging. One connector that breaks the pattern is worth more incidents than five that follow it.

Heuristic

Full refresh is fine until it isn't.

Under a million rows, refreshing the whole table nightly is simpler than anything incremental. Fewer moving parts, fewer lies. Optimize when it actually hurts.

Rule 01

Land it raw. Never transform on the way in.

Raw tables are append-only, untouched, ugly. JSON blobs where JSON blobs came from. You'll thank yourself when you need to reprocess history because the business definition of "customer" changed.

Pitfall
Don't let anyone "clean up" raw in a one-off script. The second raw isn't reproducible, every downstream mart becomes suspect.
Two ways to stay in sync Polling vs CDC

Two ways to know what changed in a source. Check the mailbox, or get a phone call.

Polling is checking the mailbox every 15 minutes. You ask the source: "what changed since I last looked?" The source replies with the new stuff. Simple. Works for almost anything. You miss deletes unless the source tells you about them, and there's always a little lag (15 min, 5 min, whatever you set).

CDC (change data capture) is subscribing to a phone call. The source calls you every time a row is inserted, updated, or deleted — usually by reading the database's own transaction log. You get changes within seconds, including deletes. But you need a persistent connection and some plumbing (Debezium, a replication slot, etc.) that needs babysitting.

Which to pick
Polling for SaaS APIs you don't control. CDC for your own database when you care about deletes or want sub-minute freshness. Most real pipelines use both.
Polling · simple, defensible
select * where updated_at > last_sync
SOURCE row · t=10:00 row · t=10:05 row · t=10:10 row · t=10:15 every 15m STORAGE t=10:00 ✓ t=10:05 ✓ t=10:10 ✓ t=10:15 ✓
Good for
SaaS APIs, sources you don't control, tables where updated_at is trustworthy.
Watch for
Back-dated writes and hard-deletes — you need a reconcile pass or tombstone convention.
Latency
Whatever the poll interval is. 5–15m is usual.
Log-based CDC · tight coupling, near real-time
Read the write-ahead log directly
SOURCE DB WAL INSERT · 10:00 INSERT · 10:05 UPDATE · 10:02 DELETE · 10:14 stream STORAGE 10:00 insert ✓ 10:02 update ✓ 10:05 insert ✓ 10:14 delete ✓
Good for
Your own Postgres / Mongo / MySQL — anywhere you control the log.
Watch for
Replication slot babysitting, schema changes, and keeping the consumer caught up.
Latency
Seconds, if you want it.
Both are fine
Neither is the "right" answer. Pick the one that matches the source. For a SaaS API with a sane updated_at, polling is less moving parts. For your own transactional database where you care about deletes and sub-minute freshness, CDC earns the extra moving parts. Most real deployments use both — polling for vendors, CDC for systems of record.
DefaultPolling with updated_at
WhenCDC (Debezium / DMS / native)
How it works — connector queries WHERE updated_at > last_cursor on a schedule; new rows land in raw.
How it works — log-based: tail the source DB’s WAL/binlog and stream every change, deletes included.
Lower operational cost. One cron job, one cursor. Easy to reason about, easy to backfill.
Higher operational cost. Streaming infra, ordering guarantees, replay strategy. Earns its weight when you need it.
Pick when: SaaS API, modest freshness needs (5–30 min), no hard requirement to track deletes.
Pick when: Your own transactional database, sub-minute freshness, deletes matter (refunds, soft deletes, GDPR).
Artifact connectors/klaviyo/extract.py ~25 lines · Python 3.11
from time import sleep
from random import uniform
import httpx

BASE = "https://a.klaviyo.com/api"
RATE_LIMIT_RPS = 10  # Klaviyo's published cap; halve for prod headroom.

def fetch_paginated(endpoint: str, since: str, *, max_retries: int = 5):
    """Pull every record updated since `since`. Retries with jitter on 429/5xx."""
    url, attempt = f"{BASE}/{endpoint}?filter=greater-than(updated,{since})", 0
    while url:
        try:
            r = httpx.get(url, headers=AUTH, timeout=30)
            r.raise_for_status()
            yield from r.json()["data"]
            url = r.json().get("links", {}).get("next")
            sleep(1 / RATE_LIMIT_RPS)
        except httpx.HTTPStatusError as e:
            if e.response.status_code in (429, 502, 503, 504) and attempt < max_retries:
                attempt += 1
                sleep((2 ** attempt) + uniform(0, 1))   # exponential + jitter
                continue
            raise

Load-bearingThe + uniform(0, 1) jitter is non-negotiable. Without it, every connector retries at the same instant after a 429 storm and the API rate-limits you twice as long.