Skip to content
AI/LLM
December 10, 202512 min read

Beyond Document Q&A: Building Production RAG Systems That Actually Scale

Most RAG tutorials end at 'it works in a notebook.' The gap to production—1000+ QPS, sub-50ms latency—is where things get interesting.

RAG
Vector Search
Production ML
Performance
System Design
Dr. Jody-Ann Jones

Dr. Jody-Ann Jones

Founder & CEO, The Data Sensei

Beyond Document Q&A: Building Production RAG Systems That Actually Scale

Every RAG tutorial ends the same way: you've got a Jupyter notebook that answers questions about PDFs. It works. Ship it, right?

Wrong.

The gap between a RAG demo and a production RAG system is enormous. I learned this building CineRAG, a movie recommendation engine that needed to handle real traffic, real latency requirements, and real user expectations.

Here's what I discovered—and what most tutorials won't tell you.

The 7-Stage RAG Pipeline Nobody Talks About

Most tutorials focus on two stages: embed and retrieve. Production RAG has seven:

Ingestion → Embedding → VectorStore → Query Processing → Retrieval → Evaluation → Optimization

Let's break down each stage and what actually matters in production.

Stage 1: Ingestion

Tutorials show you LangChain.DirectoryLoader and move on. Reality isn't that clean.

Movies have titles, descriptions, genres, cast, ratings, images—and your source data is messier than any of it. Here's what actually matters:

  • Normalize heterogeneous data sources (APIs, databases, files)
  • Enrich sparse data with external sources (I used TMDB API)
  • Validate data quality before embedding (garbage in, garbage out)
  • Version your data (when did this movie's rating change?)

What I Built: A pipeline that pulls from MovieLens, enriches with TMDB, validates completeness, and versions snapshots.

Stage 2: Embedding

Most tutorials just call OpenAIEmbeddings() and move on. But embedding strategy makes or breaks retrieval quality.

Key decisions:

  • What to embed: Just titles? Titles + descriptions? Concatenated metadata?
  • Chunking strategy: For movies, each item is one "chunk." For documents, you need smarter splitting.
  • Model choice: OpenAI is expensive. Sentence-Transformers is free and often just as good.

What I Built: Composite embeddings combining title, overview, genres, and keywords—384 dimensions using all-MiniLM-L6-v2.

Why This Matters: I tested title-only embeddings vs. composite. Relevance jumped from 71% to 89% with richer text.

Stage 3: Vector Storage

The Tutorial Version: In-memory FAISS or a free Pinecone tier.

The Reality: Your vector database is infrastructure, not a demo tool.

Requirements I had:

  • Persistence: Data survives restarts
  • Scalability: Handle growth from 10K to 100K vectors
  • Filtering: Query by metadata (genre, year, rating)
  • Self-hosting: No vendor lock-in

What I Chose: Qdrant. It's open-source, fast, and has excellent filtering. HNSW indexing gives me sub-linear search.

Trade-off I Made: Pinecone is simpler. But I wanted full control and zero recurring costs.

Stage 4: Query Processing

The naive approach: pass the user query directly to the retriever.

The problem: user queries are awful. Typos, vague descriptions, mixed intent.

What I implemented:

  • Intent detection: Is this a genre query? Actor query? Similarity query?
  • Query expansion: "sci-fi" → "science fiction, futuristic, space"
  • Spell correction: "Intersteler" → "Interstellar"
  • Normalization: Lowercase, strip punctuation, handle synonyms

Impact: Query processing alone improved relevance by 15%.

Stage 5: Retrieval

The Tutorial Version: vectorstore.similarity_search(query, k=5)

The Reality: Vector search is just the beginning.

My retrieval stack:

  1. Vector search: Semantic similarity (top 50)
  2. Keyword matching: BM25 for exact terms like actor names
  3. Metadata filtering: Year range, minimum rating
  4. Reranking: Cross-encoder to refine top 20 → top 10
  5. Diversity: Ensure varied genres in results

Why Hybrid Search: "Tom Hanks space movies" needs:

  • Vector for "space" (semantic concept)
  • Keyword for "Tom Hanks" (exact match)
  • Filter for actor metadata

Pure vector search missed exact matches 23% of the time. Hybrid dropped that to 4%.

Stage 6: Evaluation

"It looks good!" is not a metric. Without actual numbers, you're guessing.

Metrics I track:

  • NDCG (Normalized Discounted Cumulative Gain): Are relevant items ranked higher?
  • MAP (Mean Average Precision): Overall ranking quality
  • MRR (Mean Reciprocal Rank): How quickly do we find the first relevant item?
  • Recall@K: What percentage of relevant items are in top K?

How I Use Them: Every query logs its results. I have a test set of 500 queries with known-good results. CI runs evaluation on every PR.

Stage 7: Optimization

The Tutorial Version: "It's fast enough." (on your laptop)

The Reality: Production has different standards.

My optimizations:

  • Multi-tier caching: LRU (hot) + Redis (distributed)
  • Query batching: Combine similar queries
  • Connection pooling: Reuse database connections
  • Async processing: Non-blocking I/O everywhere

Results:

  • Before optimization: 180ms average latency
  • After optimization: 19-45ms average latency
  • Cache hit rate: 40%+ (superior to typical 20-30%)

The Performance Numbers That Actually Matter

When people ask "how fast is your RAG system?" they usually mean latency. But production cares about more:

MetricWhat It MeasuresMy TargetMy Result
p50 LatencyTypical response timeUnder 50ms19ms
p99 LatencyWorst-case (matters!)Under 200ms85ms
ThroughputRequests/second500 QPS1000+ QPS
Cache Hit RateEfficiency30%40%+
Error RateReliabilityUnder 0.1%0.02%

The p99 Trap: Average latency is misleading. If your p50 is 20ms but p99 is 2 seconds, 1 in 100 users has a terrible experience. Monitor percentiles.

Why Most RAG Systems Fail in Production

Failure Mode 1: No Caching

Vector operations are expensive. Every query hits the embedding model, then the vector DB. Without caching:

  • Latency spikes under load
  • Costs explode with traffic
  • You're re-computing identical queries

Fix: Multi-tier caching. Hot queries in memory (LRU), warm queries in Redis, cold queries hit the full pipeline.

Failure Mode 2: Single-Point Retrieval

Vector search alone isn't enough. It misses:

  • Exact matches (proper nouns, IDs)
  • Filtered queries ("movies from 2020")
  • Negation ("not horror")

Fix: Hybrid search combining vector + keyword + metadata filtering.

Failure Mode 3: No Monitoring

"Works on my machine" is the production ML meme. Without monitoring:

  • You don't know when quality degrades
  • You can't debug user complaints
  • You're blind to drift

Fix: Log everything. Track latency, relevance scores, cache rates. Alert on anomalies.

Failure Mode 4: Ignoring Cold Start

First query for a new term has no cache. First request of the day warms up connections. Cold start can be 10x slower than steady state.

Fix: Cache warming on startup. Pre-compute embeddings for common queries. Connection pooling.

What I'd Do Differently

Building CineRAG taught me lessons I'll apply to every future RAG project:

  1. Start with evaluation: Define success metrics before writing code. I added evaluation late and had to retrofit.

  2. Cache earlier: I treated caching as optimization. It should be architecture.

  3. Build hybrid from day one: I started with pure vector search, then added keyword. Hybrid should be the default.

  4. Invest in query processing: 15% relevance gain from query preprocessing. Should have done more.

  5. Test at scale: My laptop handled 10 QPS fine. Production needs 1000+. Test with realistic load early.

The RAG Stack I Recommend

After building CineRAG, here's my recommended stack for production RAG:

ComponentMy ChoiceWhy
Vector DBQdrantSelf-hostable, great filtering, fast
EmbeddingsSentence-TransformersFree, local, good quality
Keyword SearchBM25 (Elasticsearch/custom)Exact match capability
CacheRedis + LRUDistributed + in-memory layers
BackendFastAPIAsync, auto-docs, Python ecosystem
MonitoringCustom + PrometheusFull visibility
DeploymentDocker + K8sPortable, scalable

Notable Absence: I'm not using LangChain in production. It's great for prototyping but adds abstraction where I want control.

RAG Beyond Documents

The biggest lesson from CineRAG: RAG isn't just for document Q&A.

The same patterns work for:

  • Product search: "Comfortable running shoes for flat feet"
  • Music discovery: "Jazz that sounds like late-night city vibes"
  • Job matching: "Engineering roles with mentorship culture"
  • Support routing: Match tickets to the right agent

Anywhere you have structured content and natural language queries, RAG applies.

Getting Started

If you're building a production RAG system, here's my advice:

  1. Define your metrics before writing code
  2. Start hybrid (vector + keyword) from day one
  3. Cache aggressively (it's not premature optimization)
  4. Test under load early and often
  5. Monitor everything in production

The gap between demo and production is real—but it's bridgeable. The techniques aren't secret. They just require engineering discipline.


I wrote a deeper breakdown of CineRAG with architecture diagrams and code samples in the case study. There's also a checklist I use before any RAG deployment.

If you're building something similar and hitting walls, reach out—I'm always happy to talk shop.

RAG
Vector Search
Production ML
Performance
System Design

Related Articles

Why RAG Beats Fine-Tuning for Most Enterprise Use Cases
AI/LLM
December 8, 20258 min read

Why RAG Beats Fine-Tuning for Most Enterprise Use Cases

Fine-tuning sounds impressive, but for 90% of enterprise applications, Retrieval-Augmented Generation delivers better results faster. Here's why.

RAG
LLM
Fine-Tuning
Dr. Jody-Ann JonesDr. Jody-Ann Jones
The Modern Data Stack for SMEs: Building Enterprise-Grade BI Without Enterprise Budgets
Data Engineering
December 10, 202514 min read

The Modern Data Stack for SMEs: Building Enterprise-Grade BI Without Enterprise Budgets

Tableau and Looker aren't your only options. Here's how to build a production-ready analytics platform with Supabase, dbt, and Metabase—for $0 in licensing costs.

Business Intelligence
dbt
Supabase
Dr. Jody-Ann JonesDr. Jody-Ann Jones
Data Quality: The Silent Killer of ML Projects
Data Engineering
December 1, 20256 min read

Data Quality: The Silent Killer of ML Projects

85% of ML projects fail, and bad data is the #1 cause. Here's how to build data quality into your pipeline from day one.

Data Quality
Machine Learning
MLOps
Dr. Jody-Ann JonesDr. Jody-Ann Jones

Enjoyed this article?

Subscribe to get notified when we publish new content. No spam, just valuable insights.