Every RAG tutorial ends the same way: you've got a Jupyter notebook that answers questions about PDFs. It works. Ship it, right?
Wrong.
The gap between a RAG demo and a production RAG system is enormous. I learned this building CineRAG, a movie recommendation engine that needed to handle real traffic, real latency requirements, and real user expectations.
Here's what I discovered—and what most tutorials won't tell you.
The 7-Stage RAG Pipeline Nobody Talks About
Most tutorials focus on two stages: embed and retrieve. Production RAG has seven:
Ingestion → Embedding → VectorStore → Query Processing → Retrieval → Evaluation → Optimization
Let's break down each stage and what actually matters in production.
Stage 1: Ingestion
Tutorials show you LangChain.DirectoryLoader and move on. Reality isn't that clean.
Movies have titles, descriptions, genres, cast, ratings, images—and your source data is messier than any of it. Here's what actually matters:
- Normalize heterogeneous data sources (APIs, databases, files)
- Enrich sparse data with external sources (I used TMDB API)
- Validate data quality before embedding (garbage in, garbage out)
- Version your data (when did this movie's rating change?)
What I Built: A pipeline that pulls from MovieLens, enriches with TMDB, validates completeness, and versions snapshots.
Stage 2: Embedding
Most tutorials just call OpenAIEmbeddings() and move on. But embedding strategy makes or breaks retrieval quality.
Key decisions:
- What to embed: Just titles? Titles + descriptions? Concatenated metadata?
- Chunking strategy: For movies, each item is one "chunk." For documents, you need smarter splitting.
- Model choice: OpenAI is expensive. Sentence-Transformers is free and often just as good.
What I Built: Composite embeddings combining title, overview, genres, and keywords—384 dimensions using all-MiniLM-L6-v2.
Why This Matters: I tested title-only embeddings vs. composite. Relevance jumped from 71% to 89% with richer text.
Stage 3: Vector Storage
The Tutorial Version: In-memory FAISS or a free Pinecone tier.
The Reality: Your vector database is infrastructure, not a demo tool.
Requirements I had:
- Persistence: Data survives restarts
- Scalability: Handle growth from 10K to 100K vectors
- Filtering: Query by metadata (genre, year, rating)
- Self-hosting: No vendor lock-in
What I Chose: Qdrant. It's open-source, fast, and has excellent filtering. HNSW indexing gives me sub-linear search.
Trade-off I Made: Pinecone is simpler. But I wanted full control and zero recurring costs.
Stage 4: Query Processing
The naive approach: pass the user query directly to the retriever.
The problem: user queries are awful. Typos, vague descriptions, mixed intent.
What I implemented:
- Intent detection: Is this a genre query? Actor query? Similarity query?
- Query expansion: "sci-fi" → "science fiction, futuristic, space"
- Spell correction: "Intersteler" → "Interstellar"
- Normalization: Lowercase, strip punctuation, handle synonyms
Impact: Query processing alone improved relevance by 15%.
Stage 5: Retrieval
The Tutorial Version: vectorstore.similarity_search(query, k=5)
The Reality: Vector search is just the beginning.
My retrieval stack:
- Vector search: Semantic similarity (top 50)
- Keyword matching: BM25 for exact terms like actor names
- Metadata filtering: Year range, minimum rating
- Reranking: Cross-encoder to refine top 20 → top 10
- Diversity: Ensure varied genres in results
Why Hybrid Search: "Tom Hanks space movies" needs:
- Vector for "space" (semantic concept)
- Keyword for "Tom Hanks" (exact match)
- Filter for actor metadata
Pure vector search missed exact matches 23% of the time. Hybrid dropped that to 4%.
Stage 6: Evaluation
"It looks good!" is not a metric. Without actual numbers, you're guessing.
Metrics I track:
- NDCG (Normalized Discounted Cumulative Gain): Are relevant items ranked higher?
- MAP (Mean Average Precision): Overall ranking quality
- MRR (Mean Reciprocal Rank): How quickly do we find the first relevant item?
- Recall@K: What percentage of relevant items are in top K?
How I Use Them: Every query logs its results. I have a test set of 500 queries with known-good results. CI runs evaluation on every PR.
Stage 7: Optimization
The Tutorial Version: "It's fast enough." (on your laptop)
The Reality: Production has different standards.
My optimizations:
- Multi-tier caching: LRU (hot) + Redis (distributed)
- Query batching: Combine similar queries
- Connection pooling: Reuse database connections
- Async processing: Non-blocking I/O everywhere
Results:
- Before optimization: 180ms average latency
- After optimization: 19-45ms average latency
- Cache hit rate: 40%+ (superior to typical 20-30%)
The Performance Numbers That Actually Matter
When people ask "how fast is your RAG system?" they usually mean latency. But production cares about more:
| Metric | What It Measures | My Target | My Result |
|---|---|---|---|
| p50 Latency | Typical response time | Under 50ms | 19ms |
| p99 Latency | Worst-case (matters!) | Under 200ms | 85ms |
| Throughput | Requests/second | 500 QPS | 1000+ QPS |
| Cache Hit Rate | Efficiency | 30% | 40%+ |
| Error Rate | Reliability | Under 0.1% | 0.02% |
The p99 Trap: Average latency is misleading. If your p50 is 20ms but p99 is 2 seconds, 1 in 100 users has a terrible experience. Monitor percentiles.
Why Most RAG Systems Fail in Production
Failure Mode 1: No Caching
Vector operations are expensive. Every query hits the embedding model, then the vector DB. Without caching:
- Latency spikes under load
- Costs explode with traffic
- You're re-computing identical queries
Fix: Multi-tier caching. Hot queries in memory (LRU), warm queries in Redis, cold queries hit the full pipeline.
Failure Mode 2: Single-Point Retrieval
Vector search alone isn't enough. It misses:
- Exact matches (proper nouns, IDs)
- Filtered queries ("movies from 2020")
- Negation ("not horror")
Fix: Hybrid search combining vector + keyword + metadata filtering.
Failure Mode 3: No Monitoring
"Works on my machine" is the production ML meme. Without monitoring:
- You don't know when quality degrades
- You can't debug user complaints
- You're blind to drift
Fix: Log everything. Track latency, relevance scores, cache rates. Alert on anomalies.
Failure Mode 4: Ignoring Cold Start
First query for a new term has no cache. First request of the day warms up connections. Cold start can be 10x slower than steady state.
Fix: Cache warming on startup. Pre-compute embeddings for common queries. Connection pooling.
What I'd Do Differently
Building CineRAG taught me lessons I'll apply to every future RAG project:
-
Start with evaluation: Define success metrics before writing code. I added evaluation late and had to retrofit.
-
Cache earlier: I treated caching as optimization. It should be architecture.
-
Build hybrid from day one: I started with pure vector search, then added keyword. Hybrid should be the default.
-
Invest in query processing: 15% relevance gain from query preprocessing. Should have done more.
-
Test at scale: My laptop handled 10 QPS fine. Production needs 1000+. Test with realistic load early.
The RAG Stack I Recommend
After building CineRAG, here's my recommended stack for production RAG:
| Component | My Choice | Why |
|---|---|---|
| Vector DB | Qdrant | Self-hostable, great filtering, fast |
| Embeddings | Sentence-Transformers | Free, local, good quality |
| Keyword Search | BM25 (Elasticsearch/custom) | Exact match capability |
| Cache | Redis + LRU | Distributed + in-memory layers |
| Backend | FastAPI | Async, auto-docs, Python ecosystem |
| Monitoring | Custom + Prometheus | Full visibility |
| Deployment | Docker + K8s | Portable, scalable |
Notable Absence: I'm not using LangChain in production. It's great for prototyping but adds abstraction where I want control.
RAG Beyond Documents
The biggest lesson from CineRAG: RAG isn't just for document Q&A.
The same patterns work for:
- Product search: "Comfortable running shoes for flat feet"
- Music discovery: "Jazz that sounds like late-night city vibes"
- Job matching: "Engineering roles with mentorship culture"
- Support routing: Match tickets to the right agent
Anywhere you have structured content and natural language queries, RAG applies.
Getting Started
If you're building a production RAG system, here's my advice:
- Define your metrics before writing code
- Start hybrid (vector + keyword) from day one
- Cache aggressively (it's not premature optimization)
- Test under load early and often
- Monitor everything in production
The gap between demo and production is real—but it's bridgeable. The techniques aren't secret. They just require engineering discipline.
I wrote a deeper breakdown of CineRAG with architecture diagrams and code samples in the case study. There's also a checklist I use before any RAG deployment.
If you're building something similar and hitting walls, reach out—I'm always happy to talk shop.



