Introduction
Retrieval-Augmented Generation (RAG) has become the go-to pattern for building AI-powered search, Q&A, and recommendation systems. But the gap between a working prototype and a production system is vast.
After building CineRAG—a movie recommendation engine handling 1000+ QPS with sub-50ms latency—I've compiled this checklist of everything you need to verify before deploying a RAG system.
Use this as a pre-deployment review. If you can't confidently check off most of these items, you're not ready for production.
Phase 1: Data Ingestion
1. Data Sources Are Documented
✅ What to verify:
- All data sources are identified and documented
- Data formats, schemas, and update frequencies are known
- Access credentials are securely stored
- Data lineage is traceable
Why it matters: You can't debug retrieval issues if you don't know where the data came from.
2. Data Enrichment Is Automated
✅ What to verify:
- External data sources (APIs, databases) are integrated
- Enrichment pipeline handles failures gracefully
- Rate limits and quotas are respected
- Fallback values exist for missing enrichments
Example: For CineRAG, I enriched MovieLens data with TMDB API (posters, descriptions, cast). Without this, embeddings were too sparse.
3. Data Quality Checks Are In Place
✅ What to verify:
- Automated validation for null values, outliers, and schema changes
- Tests run on every data refresh
- Alerts trigger when quality thresholds are breached
- Bad records are quarantined, not silently dropped
Why it matters: Garbage in, garbage out. Bad data creates bad embeddings creates bad results.
4. Data Versioning Is Implemented
✅ What to verify:
- Data snapshots are versioned and stored
- Can reproduce any index by referencing its data version
- Rollback to previous data version is possible
Why it matters: "The results were better last week" requires knowing what data powered last week's index.
Phase 2: Embedding & Indexing
5. Embedding Strategy Is Documented
✅ What to verify:
- What text is embedded (title only? title + description? metadata?)
- Chunking strategy for long documents
- Embedding model choice and version
- Dimensionality and similarity metric
Best practice: Document your embedding formula:
embedding = embed(f"{title}. {description}. Genres: {genres}. Keywords: {keywords}")
6. Embedding Model Is Appropriate
✅ What to verify:
- Model is evaluated on your domain (not just general benchmarks)
- Latency is acceptable for your use case
- Cost is sustainable at scale
- Model is versioned (avoid drift from API updates)
Trade-offs:
| Model | Latency | Cost | Quality |
|---|---|---|---|
| OpenAI ada-002 | ~100ms | $0.0001/1K tokens | Excellent |
| Sentence-Transformers | ~10ms | Free (self-hosted) | Very Good |
| Cohere embed | ~50ms | $0.0001/1K tokens | Excellent |
7. Chunking Strategy Is Validated
✅ What to verify:
- Chunk size balances context and retrieval precision
- Overlapping chunks prevent boundary issues
- Document structure is preserved (headings, sections)
- Edge cases tested (very short/long documents)
For CineRAG: Each movie is one "chunk" (atomic items). For documents, I recommend 512 tokens with 50-token overlap.
8. Vector Index Is Optimized
✅ What to verify:
- Index type matches your scale (HNSW for under 1M vectors, IVF for larger)
- Index parameters are tuned (ef, M for HNSW)
- Recall is measured and acceptable
- Rebuild strategy exists for index drift
Why it matters: Default index settings prioritize build speed, not query performance. Tune for your use case.
Phase 3: Query Processing
9. Query Preprocessing Is Robust
✅ What to verify:
- Normalization (lowercase, punctuation handling)
- Spell correction for common errors
- Synonym expansion (optional but valuable)
- Input validation and sanitization
Impact: Query preprocessing improved CineRAG relevance by 15%.
10. Intent Detection Is Implemented
✅ What to verify:
- Different query types are identified (search, filter, similarity)
- Routing logic is documented
- Fallback behavior exists for ambiguous queries
- Intent accuracy is measured
Example intents:
- "action movies from 2020" → filter query
- "movies like Inception" → similarity query
- "best Tom Hanks performances" → hybrid query
11. Query Expansion Is Tested
✅ What to verify:
- Synonyms improve recall without hurting precision
- Expansion doesn't over-generalize (keep it relevant)
- Performance impact is acceptable
- Expansion can be disabled for exact-match needs
Phase 4: Retrieval
12. Hybrid Search Is Implemented
✅ What to verify:
- Vector search for semantic matching
- Keyword search (BM25) for exact matches
- Weighted combination is tuned
- Each component can be debugged independently
Why hybrid: Pure vector search missed exact matches (actor names, titles) 23% of the time for CineRAG.
13. Metadata Filtering Works
✅ What to verify:
- Filters are applied efficiently (before vector search if possible)
- Filter combinations are tested
- Empty results are handled gracefully
- Filter UI matches backend capabilities
Example filters:
- Year range: 2015-2020
- Minimum rating: 7.0+
- Genres: include "Action", exclude "Horror"
14. Reranking Is Evaluated
✅ What to verify:
- Reranking model improves relevance metrics
- Latency overhead is acceptable (typically 50-100ms)
- Reranking is optional/tunable
- Fallback exists if reranking fails
When to use: Reranking shines when initial retrieval returns 20+ candidates. Cross-encoder reranking can improve NDCG by 10-20%.
15. Result Diversity Is Considered
✅ What to verify:
- Results aren't all from the same cluster
- MMR (Maximal Marginal Relevance) or similar is implemented
- Diversity level is tunable
- Business rules for diversity are documented
Phase 5: Caching & Performance
16. Multi-Tier Caching Is Implemented
✅ What to verify:
- Hot cache (in-memory LRU) for frequent queries
- Warm cache (Redis) for distributed access
- Cache invalidation strategy is defined
- Cache hit rate is monitored
Target: 30%+ cache hit rate. CineRAG achieves 40%+.
17. Latency Targets Are Met
✅ What to verify:
- p50, p95, p99 latencies are measured
- Targets are defined for each percentile
- Latency is monitored in production
- Alerts trigger on degradation
Typical targets:
- p50: under 50ms
- p95: under 100ms
- p99: under 200ms
18. Throughput Is Load Tested
✅ What to verify:
- System handles 10x expected traffic
- Graceful degradation under overload
- Auto-scaling is configured (if applicable)
- Load tests run regularly
19. Cold Start Is Addressed
✅ What to verify:
- First query latency is acceptable
- Cache warming on startup
- Connection pools are pre-initialized
- Lazy loading doesn't cause timeouts
Phase 6: Evaluation & Monitoring
20. Relevance Metrics Are Tracked
✅ What to verify:
- NDCG, MAP, MRR, Recall@K are implemented
- Metrics run on a test set regularly
- Thresholds for acceptable performance are defined
- Alerts trigger on metric degradation
Why it matters: Without metrics, "did we break something?" is unanswerable.
21. Query Logs Are Captured
✅ What to verify:
- All queries logged (anonymized if needed)
- Results and relevance scores logged
- User feedback (if available) is captured
- Logs are searchable and queryable
22. Error Handling Is Comprehensive
✅ What to verify:
- Errors return helpful messages (not stack traces)
- Fallback results for failed retrievals
- Circuit breakers for external dependencies
- Error rates are monitored
23. Dashboards Exist
✅ What to verify:
- Query volume and latency
- Cache hit rates
- Error rates
- Relevance metrics over time
Phase 7: Deployment & Operations
24. Containerization Is Complete
✅ What to verify:
- All components are containerized
- Dependencies are pinned
- Health checks are implemented
- Environment configuration is externalized
CineRAG: Single docker-compose up brings up entire stack (API, Qdrant, Redis, frontend).
25. Rollback Plan Exists
✅ What to verify:
- Previous version can be deployed in under 5 minutes
- Data rollback procedure is documented
- Post-incident review process exists
- Runbooks for common failures
Quick Reference Checklist
Copy this for your next deployment review:
## RAG Deployment Checklist
### Data Ingestion
- [ ] Data sources documented
- [ ] Enrichment pipeline automated
- [ ] Quality checks in place
- [ ] Data versioning implemented
### Embedding & Indexing
- [ ] Embedding strategy documented
- [ ] Model evaluated on domain
- [ ] Chunking validated
- [ ] Vector index optimized
### Query Processing
- [ ] Preprocessing robust
- [ ] Intent detection implemented
- [ ] Query expansion tested
### Retrieval
- [ ] Hybrid search implemented
- [ ] Metadata filtering works
- [ ] Reranking evaluated
- [ ] Result diversity considered
### Performance
- [ ] Multi-tier caching implemented
- [ ] Latency targets met (p50 under 50ms, p99 under 200ms)
- [ ] Throughput load tested
- [ ] Cold start addressed
### Evaluation & Monitoring
- [ ] Relevance metrics tracked (NDCG, MAP, MRR)
- [ ] Query logs captured
- [ ] Error handling comprehensive
- [ ] Dashboards exist
### Deployment
- [ ] Containerization complete
- [ ] Rollback plan exists
RAG Architecture Template
For reference, here's the production architecture I recommend:
┌─────────────────────────────────────────────────────────────────┐
│ Production RAG Architecture │
└─────────────────────────────────────────────────────────────────┘
┌─────────────┐
│ Client │
│ (Web/Mobile)│
└──────┬──────┘
│
▼
┌─────────────┐
│ Load Balancer│
└──────┬──────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ API Pod │ │ API Pod │ │ API Pod │
│ (FastAPI) │ │ (FastAPI) │ │ (FastAPI) │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└─────────────────────┼─────────────────────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Redis │ │ Qdrant │ │ Monitoring│
│ (Cache) │ │(Vectors) │ │(Prometheus)
└───────────┘ └───────────┘ └───────────┘
Technology Recommendations
Based on CineRAG and other production deployments:
| Component | Recommended | Alternative | Notes |
|---|---|---|---|
| Vector DB | Qdrant | Pinecone, Weaviate | Qdrant for self-hosting, Pinecone for managed |
| Embeddings | Sentence-Transformers | OpenAI, Cohere | ST for cost/latency, OpenAI for max quality |
| Keyword Search | BM25 | Elasticsearch | Custom BM25 for simple cases, ES for complex |
| Cache | Redis | Memcached | Redis for persistence and data structures |
| Backend | FastAPI | Flask, Django | FastAPI for async and auto-docs |
| Deployment | Kubernetes | Docker Compose | K8s for scale, Compose for simplicity |
Conclusion
Production RAG systems require engineering discipline across seven phases: ingestion, embedding, query processing, retrieval, caching, evaluation, and deployment.
This checklist captures lessons from building real systems. Use it to catch issues before they become production incidents.
The most common failures:
- No caching (latency spikes under load)
- Pure vector search (misses exact matches)
- No evaluation metrics (can't measure quality)
- Ignoring cold start (first query is terrible)
Address these, and you're ahead of 90% of RAG deployments.
Building a RAG system? Let's discuss your architecture and requirements.
Check out the CineRAG case study for a complete implementation example.