Skip to main content
AI

RAG starter kit: a guide to shipping retrieval in Next.js

Published Apr 29, 20267 min read
RAG starter kit: a guide to shipping retrieval in Next.js

RAG starter kit: a guide to shipping retrieval in Next.js

Retrieval-Augmented Generation sounds heavy in research papers and trivial in tutorial blog posts. The truth is in the middle, the pipeline is small, but the operational choices (chunk size, embedding model, retrieval top-k, eval loop) are what make it useful versus embarrassing.

The four-step pipeline

A production RAG pipeline in Next.js looks like this:

1. Upload. A route handler that accepts a file (PDF, DOCX, plain text, markdown), extracts the text, and writes a documents row with the workspace id, filename, and content hash. Cap upload size, validate MIME types, and reject anything you cannot extract, a corrupt PDF should fail the upload, not silently embed garbage.

2. Chunk. Split the extracted text into overlapping chunks. Common defaults: 500-800 tokens per chunk with 100-token overlap. Use a recursive splitter (paragraphs → sentences → tokens) so chunks respect document structure when possible. Write each chunk to a chunks table with the parent document id and the position.

3. Embed. Call an embedding model for each chunk and store the resulting vector in a pgvector column. Cost scales with total tokens embedded, for a long PDF this is the most expensive step at runtime, so cache by content hash so re-uploading the same file is free. Embedding models worth supporting in mid-2026: OpenAI's text-embedding-3 family, or Voyage AI for higher-quality retrieval (verify in current docs, embedding model names move).

4. Retrieve. On every chat turn, embed the user's question, do a cosine-similarity search against the workspace's chunks (order by embedding <=> :query), take the top 5-10 results, and inject them into the system prompt as context. The model now has the right passages in its context window without re-reading the whole document.

The full pipeline, including the document UI, the chunking job, the embedding queue, and the retrieval at chat time, ships in /features/pgvector-chat as production code.

What "good RAG" actually means

Three operational concerns separate a working RAG demo from a useful one:

  • Chunk size matters more than embedding model. A poorly chunked document embedded with the best model retrieves worse than a well-chunked one embedded with a cheap model. Test chunk sizes against your real documents before tuning anything else.
  • Top-k tuning is product-specific. Top-5 is the default; some products need top-20 for breadth or top-3 for cost. Track the per-query token cost, retrieval cost is invisible until you read it on the Stripe invoice.
  • Eval loop. Maintain a small set of question-answer pairs against your real documents. Re-run them every time you change chunk size, embedding model, or retrieval strategy. Without an eval loop, "we changed the chunking and it feels better" is the entire change log.

What boilerplates skip and why it bites

Most "AI templates" wire a fetch() to OpenAI and call it RAG. The pieces that get skipped:

  • Auth on uploaded documents. A user from workspace A should not retrieve chunks from workspace B. RLS on the chunks and documents tables is the only enforcement that survives a missing WHERE clause in your code.
  • Token cost tracking. Embedding is paid per token; retrieval is paid per chat turn. Without a credit ledger, your unit economics are opaque.
  • Re-embedding on chunk size changes. If you change the chunker, every existing document needs re-embedding. A migration that does not handle this leaves stale chunks in production.
  • Failed-extraction handling. PDFs with scanned images, corrupt DOCX files, encrypted documents, all of these fail extraction. Show the user a clear error, do not silently leave the document in a "embedding..." state forever.

The full set, RAG pipeline plus auth plus credit metering plus the chat UI, ships in /saasforge-ai. If you are building an AI product where retrieval quality and unit economics both matter, starting from a working integration is faster than wiring four services together from scratch. Read the pipeline shape at /features/pgvector-chat before you build your own, even if you do not buy the template, the shape is the same.

Newsletter

Get the BoilerlyKit newsletter

Practical Next.js SaaS launch tips, delivered when we ship something worth sharing.

We respect your inbox. See our privacy policy.