Anyone can wire OpenAI to a vector database. Production RAG is harder — chunking strategies that respect document structure, hybrid retrieval, rerankers, evaluation pipelines that catch silent quality drift, and citation grounding that holds up in front of legal. Our RAG engineers ship that. With evals.
$20/hr
Senior RAG engineer
5 days
Free PoC, your data
85%+
Typical answer accuracy at GA
We'll build a working RAG pipeline against your real corpus and report accuracy with citations — in 5 days, free.
NDA-friendly · Replies in 4 business hours
You're not the first team to hit these. The work is in knowing which lever to pull, in what order, and how to measure whether it helped.
Usually a chunking problem, not a model problem. We add hybrid retrieval (dense + BM25), tune chunk size per document type, and add a Cohere or Voyage reranker. Recall@10 typically jumps 25–40 points.
We rebuild the ingestion pipeline with page-aware splitters, store source metadata at chunk level, and have the LLM emit citations as structured output. Citations link to the exact page and bounding box.
Tables get extracted with Camelot or LlamaParse and serialized as markdown. Figures get captioned by a vision model and indexed alongside the surrounding text. The LLM can finally reason about quarterly numbers.
Metadata filtering at retrieval time, namespace isolation per tenant, and signed retrieval calls. Row-level security in pgvector or per-tenant indices in Pinecone — chosen to fit your auth model.
We build an eval pipeline first: golden dataset, Ragas metrics, LLM-as-judge, regression tests on every chunking or prompt change. Quality stops being vibes — it becomes a number you can defend.
Chunk cache, embedding cache, prompt cache, smaller embeddings (Matryoshka or quantized), small-model first with escalation, and pruning low-value documents. Typical 4–8x cost reduction without quality loss.
Not a one-size-fits-all. But this is the skeleton most production deployments converge to — and the pieces our engineers can stand up in days, not months.
We start every engagement by building evals before chasing accuracy. If we can't measure it, we can't improve it — and you can't trust it.
We collect 30–50 of your hardest real questions, label gold answers and source documents, and stand up Ragas in CI. This becomes your accuracy baseline.
End-to-end RAG against your real corpus. Chunking, hybrid retrieval, reranker, generation, citations. We report eval scores at the end of day 5.
Ingestion automation, access control, scale testing, observability dashboards, and answer quality tuning until eval scores hit your bar.
Weekly eval runs, content freshness alerts, query log analysis, and a quarterly retraining or model-upgrade window.
Three engagement models. Start with the free PoC — see real eval numbers on your real documents before you commit a dollar.
End-to-end on your data
Free
Working RAG pipeline against your real corpus with measured accuracy. You see the eval scores. We walk away if it’s not impressive.
6–10 weeks
$20K – $80K
Full ingestion, retrieval, generation, eval, and observability. Multi-tenant ready, scale tested, handed off with a runbook.
Monthly
$20 – $32/hr
Fractional or full-time RAG specialist embedded with your team. Best for ongoing tuning and adding new document sources.
The RAG ecosystem is full of demos. Production RAG requires a different muscle — measurement first, then iteration.
We refuse to ship a RAG without a Ragas dashboard. Quality is a number, not a vibe.
BM25 + dense + reranker is our starting point — not the optimization we get to in month three.
Access control is built into retrieval, not bolted on. Your enterprise deal won’t fail security review.
If your problem is a 200-page PDF and 10 users, we’ll tell you to skip RAG and just stuff the context. We don’t over-engineer.
Because the gap between a 30-minute notebook RAG and production RAG is enormous. Production RAG means hybrid retrieval (dense + BM25 + reranker), chunking strategies tuned per document type, query rewriting, citation grounding, freshness/staleness handling, multi-tenant isolation, evaluation pipelines that catch retrieval quality drift, and cost controls. Most prototypes hit a wall at 60% answer accuracy. Our engineers know how to push past that wall.
It depends on scale, latency, filter complexity, and cost. We default to pgvector when you already run Postgres and your corpus is under 5M chunks — it’s simpler, cheaper, and the metadata filtering is excellent. Weaviate or Qdrant for tighter latency requirements with hybrid out-of-the-box. Pinecone for managed simplicity at scale. Milvus when you need GPU acceleration. We’ll recommend based on your real constraints, not vendor preference.
Chunking is half the battle. We use document-type-aware chunking — recursive character splits for prose, semantic chunking for narrative, layout-aware chunking via Unstructured.io or LlamaParse for slides and complex PDFs, and dedicated table extraction (Camelot, pdfplumber) for financial documents where row context matters. Tables get serialized to markdown so the LLM can reason about them.
We build an evaluation pipeline before we ship anything. That means a labeled dataset of 50–500 real questions with gold-standard answers and source documents, automated metrics (recall@k, MRR, faithfulness, answer relevance via Ragas or LangSmith), LLM-as-judge for nuance, and regression tests on every change to chunking, embeddings, or prompts. Without evals, RAG is unfalsifiable — and it will silently degrade.
Yes — and we treat this as a first-class concern, not an afterthought. We implement metadata-based filtering at retrieval time (not post-filtering), sign embeddings to per-tenant namespaces, encrypt sensitive chunks at rest, and audit every retrieval call. Common patterns: row-level security via pgvector + RLS policies, namespace isolation in Pinecone, or per-tenant indices in OpenSearch.
We design the ingestion pipeline first. Change-data-capture from SharePoint, Confluence, S3, Google Drive, or your DB triggers re-embedding only the changed chunks. We version embeddings so a model upgrade doesn’t require a full re-index. We TTL stale content. And we add ”freshness” as a retrieval signal so newer documents rank higher when the question is time-sensitive.
PoC: free 5-day end-to-end pipeline on a slice of your data. Production build: $20K–$80K depending on document volume, source systems, and eval rigour. Dedicated RAG engineer: $20–$32/hr. Most teams start with the PoC, see the answer quality improvement, then commit to a 6–10 week production build.
Send us your hardest 30 questions and a slice of your corpus. We'll ship a working pipeline and the eval numbers — no contract, no fees.