RAG pipeline in a Next.js SaaS with Supabase pgvector
Most 'AI templates' are a `fetch()` to OpenAI with a system prompt and call it shipped. That's chat without retrieval — useful for Q&A demos, useless for any product where the answer needs to be grounded in customer-uploaded data. A real RAG pipeline has five distinct stages, each with its own failure modes and operational concerns.
Stage 1: ingest — uploads and source-of-truth content
Users upload PDFs, Markdown, plain text, or paste content. The first job is to extract clean text — PDFs with poor OCR or HTML with unclosed tags will quietly poison retrieval if you skip cleanup. SaaSForge AI standardizes on text extraction at upload time and stores the cleaned content alongside the original blob, scoped to the user's workspace via RLS.
Workspace scoping at the database is the same isolation pattern used for multi-tenant data. Without it, a careless retrieval query could surface another tenant's chunks in the LLM context window — the worst-case privacy failure for an AI product.
Stage 2: chunk — splitting source text into retrievable pieces
Embedding a 50-page PDF as a single vector is useless for retrieval; the vector represents an average of everything in the document and matches no specific query well. Chunking splits text into ~500-1000 token segments with overlap, so each chunk is small enough to be a coherent retrieval unit and big enough to carry surrounding context.
Chunk overlap exists to handle the case where the answer to a query straddles a chunk boundary. Default overlap of 100-200 tokens is a reasonable starting point; tune per-corpus when retrieval feels brittle.
Stage 3: embed — turning chunks into vectors
Each chunk is sent to an embedding model (text-embedding-3-small for OpenAI, voyage-3 for Voyage, etc.) which returns a fixed-length vector. The vector is stored in a `vector(1536)` column in Postgres via pgvector. Embedding is the most cost-sensitive stage at ingest time — chunk well so you embed efficiently.
Embedding API errors (rate limits, timeouts) need a retry loop with backoff. SaaSForge AI ships a queue-friendly pattern so a 1000-page corpus doesn't fail halfway and leave the workspace in a half-embedded state.
await sql`
insert into chunks (workspace_id, document_id, content, embedding)
values (
${workspaceId},
${documentId},
${chunkText},
${'[' + embedding.join(',') + ']'}::vector
)
`;Stage 4: retrieve — top-k by cosine similarity
On each chat turn, embed the user's question with the same model, then run a similarity query against the chunks table scoped to the active workspace. pgvector exposes operators (`<->` for L2, `<=>` for cosine) and supports approximate-nearest-neighbor indexes (HNSW or IVFFlat) once your chunk count grows.
Top-k is usually 3-8. Too few and the context misses the answer; too many and the model loses focus or burns tokens. SaaSForge AI defaults to 5 with a similarity threshold so irrelevant chunks are dropped before they reach the prompt.
Stage 5: generate — streaming with Claude or OpenAI
The retrieved chunks become context in a structured prompt. The model streams the answer back to the UI via the Vercel AI SDK. Provider switching (Claude vs OpenAI) is a config flag, so you can compare quality and cost across the same retrieval result.
Citations matter for trust: SaaSForge AI threads chunk IDs through the prompt and surfaces them in the UI so users see which source supports which sentence. Citation-friendly retrieval is the difference between an LLM toy and a defensible knowledge product.