Skip to main content

Feature deep-dive · SaaSForge AI

RAG pipeline in a Next.js SaaS with Supabase pgvector

RAG pipeline in a Next.js SaaS with Supabase pgvector

Most 'AI templates' are a `fetch()` to OpenAI with a system prompt and call it shipped. That's chat without retrieval — useful for Q&A demos, useless for any product where the answer needs to be grounded in customer-uploaded data. A real RAG pipeline has five distinct stages, each with its own failure modes and operational concerns.

Stage 1: ingest — uploads and source-of-truth content

Users upload PDFs, Markdown, plain text, or paste content. The first job is to extract clean text — PDFs with poor OCR or HTML with unclosed tags will quietly poison retrieval if you skip cleanup. SaaSForge AI standardizes on text extraction at upload time and stores the cleaned content alongside the original blob, scoped to the user's workspace via RLS.

Workspace scoping at the database is the same isolation pattern used for multi-tenant data. Without it, a careless retrieval query could surface another tenant's chunks in the LLM context window — the worst-case privacy failure for an AI product.

Stage 2: chunk — splitting source text into retrievable pieces

Embedding a 50-page PDF as a single vector is useless for retrieval; the vector represents an average of everything in the document and matches no specific query well. Chunking splits text into ~500-1000 token segments with overlap, so each chunk is small enough to be a coherent retrieval unit and big enough to carry surrounding context.

Chunk overlap exists to handle the case where the answer to a query straddles a chunk boundary. Default overlap of 100-200 tokens is a reasonable starting point; tune per-corpus when retrieval feels brittle.

Stage 3: embed — turning chunks into vectors

Each chunk is sent to an embedding model (text-embedding-3-small for OpenAI, voyage-3 for Voyage, etc.) which returns a fixed-length vector. The vector is stored in a `vector(1536)` column in Postgres via pgvector. Embedding is the most cost-sensitive stage at ingest time — chunk well so you embed efficiently.

Embedding API errors (rate limits, timeouts) need a retry loop with backoff. SaaSForge AI ships a queue-friendly pattern so a 1000-page corpus doesn't fail halfway and leave the workspace in a half-embedded state.

Storing an embedding in pgvector
await sql`
  insert into chunks (workspace_id, document_id, content, embedding)
  values (
    ${workspaceId},
    ${documentId},
    ${chunkText},
    ${'[' + embedding.join(',') + ']'}::vector
  )
`;

Stage 4: retrieve — top-k by cosine similarity

On each chat turn, embed the user's question with the same model, then run a similarity query against the chunks table scoped to the active workspace. pgvector exposes operators (`<->` for L2, `<=>` for cosine) and supports approximate-nearest-neighbor indexes (HNSW or IVFFlat) once your chunk count grows.

Top-k is usually 3-8. Too few and the context misses the answer; too many and the model loses focus or burns tokens. SaaSForge AI defaults to 5 with a similarity threshold so irrelevant chunks are dropped before they reach the prompt.

Stage 5: generate — streaming with Claude or OpenAI

The retrieved chunks become context in a structured prompt. The model streams the answer back to the UI via the Vercel AI SDK. Provider switching (Claude vs OpenAI) is a config flag, so you can compare quality and cost across the same retrieval result.

Citations matter for trust: SaaSForge AI threads chunk IDs through the prompt and surfaces them in the UI so users see which source supports which sentence. Citation-friendly retrieval is the difference between an LLM toy and a defensible knowledge product.

Frequently asked

Why pgvector instead of a dedicated vector database?
pgvector lives in your existing Postgres, so retrieval queries can join against domain tables (workspace, document, user) without cross-system queries. For B2B SaaS where workspace isolation matters, that's a much simpler operational story than running a separate Pinecone or Weaviate alongside Supabase. The trade-off is that very large corpora (10M+ chunks per workspace) eventually want a dedicated store; you can swap the retrieval layer when you hit that scale.
How is the credit economy wired?
Each chat turn consumes tokens — embedding the question, the LLM input prompt, the LLM output. SaaSForge AI tracks token use per workspace and converts it to credits at a configurable rate. Credits decrement against a Stripe subscription's monthly allotment, so the billing surface matches the operational cost.
What about hallucinations?
RAG reduces hallucinations by grounding answers in retrieved chunks, but doesn't eliminate them — the model can still ignore the context. Mitigations in the template: instruct the model to cite chunks explicitly, refuse to answer when retrieval scores are below threshold, and surface citations in the UI so users verify before trusting.
Can I swap Claude for a different provider?
Yes. The provider layer uses the Vercel AI SDK's provider abstraction, so swapping Claude → OpenAI → Mistral → Bedrock is a config and SDK call, not a rewrite. Embedding provider can also be swapped, but re-embedding the corpus is a one-time migration cost when you do.
Ships in SaaSForge AI

See SaaSForge AI. Skip the deliberation.

Full source code. Lifetime updates. Polar Merchant-of-Record checkout. Private GitHub repo on purchase.