Skip to content
Back to Resources
Best Practices

The RAG Engineering Checklist: 25 Things to Verify Before Deploying Your Retrieval System

Dr. Jody-Ann Jones
December 10, 2025
15 min read

A comprehensive checklist for building production-ready RAG systems. Covers ingestion, embedding, retrieval, caching, evaluation, and deployment.

RAG
LLM
Vector Search
Production
Checklist
MLOps

Introduction

Retrieval-Augmented Generation (RAG) has become the go-to pattern for building AI-powered search, Q&A, and recommendation systems. But the gap between a working prototype and a production system is vast.

After building CineRAG—a movie recommendation engine handling 1000+ QPS with sub-50ms latency—I've compiled this checklist of everything you need to verify before deploying a RAG system.

Use this as a pre-deployment review. If you can't confidently check off most of these items, you're not ready for production.


Phase 1: Data Ingestion

1. Data Sources Are Documented

What to verify:

  • All data sources are identified and documented
  • Data formats, schemas, and update frequencies are known
  • Access credentials are securely stored
  • Data lineage is traceable

Why it matters: You can't debug retrieval issues if you don't know where the data came from.


2. Data Enrichment Is Automated

What to verify:

  • External data sources (APIs, databases) are integrated
  • Enrichment pipeline handles failures gracefully
  • Rate limits and quotas are respected
  • Fallback values exist for missing enrichments

Example: For CineRAG, I enriched MovieLens data with TMDB API (posters, descriptions, cast). Without this, embeddings were too sparse.


3. Data Quality Checks Are In Place

What to verify:

  • Automated validation for null values, outliers, and schema changes
  • Tests run on every data refresh
  • Alerts trigger when quality thresholds are breached
  • Bad records are quarantined, not silently dropped

Why it matters: Garbage in, garbage out. Bad data creates bad embeddings creates bad results.


4. Data Versioning Is Implemented

What to verify:

  • Data snapshots are versioned and stored
  • Can reproduce any index by referencing its data version
  • Rollback to previous data version is possible

Why it matters: "The results were better last week" requires knowing what data powered last week's index.


Phase 2: Embedding & Indexing

5. Embedding Strategy Is Documented

What to verify:

  • What text is embedded (title only? title + description? metadata?)
  • Chunking strategy for long documents
  • Embedding model choice and version
  • Dimensionality and similarity metric

Best practice: Document your embedding formula:

embedding = embed(f"{title}. {description}. Genres: {genres}. Keywords: {keywords}")

6. Embedding Model Is Appropriate

What to verify:

  • Model is evaluated on your domain (not just general benchmarks)
  • Latency is acceptable for your use case
  • Cost is sustainable at scale
  • Model is versioned (avoid drift from API updates)

Trade-offs:

ModelLatencyCostQuality
OpenAI ada-002~100ms$0.0001/1K tokensExcellent
Sentence-Transformers~10msFree (self-hosted)Very Good
Cohere embed~50ms$0.0001/1K tokensExcellent

7. Chunking Strategy Is Validated

What to verify:

  • Chunk size balances context and retrieval precision
  • Overlapping chunks prevent boundary issues
  • Document structure is preserved (headings, sections)
  • Edge cases tested (very short/long documents)

For CineRAG: Each movie is one "chunk" (atomic items). For documents, I recommend 512 tokens with 50-token overlap.


8. Vector Index Is Optimized

What to verify:

  • Index type matches your scale (HNSW for under 1M vectors, IVF for larger)
  • Index parameters are tuned (ef, M for HNSW)
  • Recall is measured and acceptable
  • Rebuild strategy exists for index drift

Why it matters: Default index settings prioritize build speed, not query performance. Tune for your use case.


Phase 3: Query Processing

9. Query Preprocessing Is Robust

What to verify:

  • Normalization (lowercase, punctuation handling)
  • Spell correction for common errors
  • Synonym expansion (optional but valuable)
  • Input validation and sanitization

Impact: Query preprocessing improved CineRAG relevance by 15%.


10. Intent Detection Is Implemented

What to verify:

  • Different query types are identified (search, filter, similarity)
  • Routing logic is documented
  • Fallback behavior exists for ambiguous queries
  • Intent accuracy is measured

Example intents:

  • "action movies from 2020" → filter query
  • "movies like Inception" → similarity query
  • "best Tom Hanks performances" → hybrid query

11. Query Expansion Is Tested

What to verify:

  • Synonyms improve recall without hurting precision
  • Expansion doesn't over-generalize (keep it relevant)
  • Performance impact is acceptable
  • Expansion can be disabled for exact-match needs

Phase 4: Retrieval

12. Hybrid Search Is Implemented

What to verify:

  • Vector search for semantic matching
  • Keyword search (BM25) for exact matches
  • Weighted combination is tuned
  • Each component can be debugged independently

Why hybrid: Pure vector search missed exact matches (actor names, titles) 23% of the time for CineRAG.


13. Metadata Filtering Works

What to verify:

  • Filters are applied efficiently (before vector search if possible)
  • Filter combinations are tested
  • Empty results are handled gracefully
  • Filter UI matches backend capabilities

Example filters:

  • Year range: 2015-2020
  • Minimum rating: 7.0+
  • Genres: include "Action", exclude "Horror"

14. Reranking Is Evaluated

What to verify:

  • Reranking model improves relevance metrics
  • Latency overhead is acceptable (typically 50-100ms)
  • Reranking is optional/tunable
  • Fallback exists if reranking fails

When to use: Reranking shines when initial retrieval returns 20+ candidates. Cross-encoder reranking can improve NDCG by 10-20%.


15. Result Diversity Is Considered

What to verify:

  • Results aren't all from the same cluster
  • MMR (Maximal Marginal Relevance) or similar is implemented
  • Diversity level is tunable
  • Business rules for diversity are documented

Phase 5: Caching & Performance

16. Multi-Tier Caching Is Implemented

What to verify:

  • Hot cache (in-memory LRU) for frequent queries
  • Warm cache (Redis) for distributed access
  • Cache invalidation strategy is defined
  • Cache hit rate is monitored

Target: 30%+ cache hit rate. CineRAG achieves 40%+.


17. Latency Targets Are Met

What to verify:

  • p50, p95, p99 latencies are measured
  • Targets are defined for each percentile
  • Latency is monitored in production
  • Alerts trigger on degradation

Typical targets:

  • p50: under 50ms
  • p95: under 100ms
  • p99: under 200ms

18. Throughput Is Load Tested

What to verify:

  • System handles 10x expected traffic
  • Graceful degradation under overload
  • Auto-scaling is configured (if applicable)
  • Load tests run regularly

19. Cold Start Is Addressed

What to verify:

  • First query latency is acceptable
  • Cache warming on startup
  • Connection pools are pre-initialized
  • Lazy loading doesn't cause timeouts

Phase 6: Evaluation & Monitoring

20. Relevance Metrics Are Tracked

What to verify:

  • NDCG, MAP, MRR, Recall@K are implemented
  • Metrics run on a test set regularly
  • Thresholds for acceptable performance are defined
  • Alerts trigger on metric degradation

Why it matters: Without metrics, "did we break something?" is unanswerable.


21. Query Logs Are Captured

What to verify:

  • All queries logged (anonymized if needed)
  • Results and relevance scores logged
  • User feedback (if available) is captured
  • Logs are searchable and queryable

22. Error Handling Is Comprehensive

What to verify:

  • Errors return helpful messages (not stack traces)
  • Fallback results for failed retrievals
  • Circuit breakers for external dependencies
  • Error rates are monitored

23. Dashboards Exist

What to verify:

  • Query volume and latency
  • Cache hit rates
  • Error rates
  • Relevance metrics over time

Phase 7: Deployment & Operations

24. Containerization Is Complete

What to verify:

  • All components are containerized
  • Dependencies are pinned
  • Health checks are implemented
  • Environment configuration is externalized

CineRAG: Single docker-compose up brings up entire stack (API, Qdrant, Redis, frontend).


25. Rollback Plan Exists

What to verify:

  • Previous version can be deployed in under 5 minutes
  • Data rollback procedure is documented
  • Post-incident review process exists
  • Runbooks for common failures

Quick Reference Checklist

Copy this for your next deployment review:

## RAG Deployment Checklist

### Data Ingestion
- [ ] Data sources documented
- [ ] Enrichment pipeline automated
- [ ] Quality checks in place
- [ ] Data versioning implemented

### Embedding & Indexing
- [ ] Embedding strategy documented
- [ ] Model evaluated on domain
- [ ] Chunking validated
- [ ] Vector index optimized

### Query Processing
- [ ] Preprocessing robust
- [ ] Intent detection implemented
- [ ] Query expansion tested

### Retrieval
- [ ] Hybrid search implemented
- [ ] Metadata filtering works
- [ ] Reranking evaluated
- [ ] Result diversity considered

### Performance
- [ ] Multi-tier caching implemented
- [ ] Latency targets met (p50 under 50ms, p99 under 200ms)
- [ ] Throughput load tested
- [ ] Cold start addressed

### Evaluation & Monitoring
- [ ] Relevance metrics tracked (NDCG, MAP, MRR)
- [ ] Query logs captured
- [ ] Error handling comprehensive
- [ ] Dashboards exist

### Deployment
- [ ] Containerization complete
- [ ] Rollback plan exists

RAG Architecture Template

For reference, here's the production architecture I recommend:

┌─────────────────────────────────────────────────────────────────┐
│                    Production RAG Architecture                   │
└─────────────────────────────────────────────────────────────────┘

                        ┌─────────────┐
                        │   Client    │
                        │ (Web/Mobile)│
                        └──────┬──────┘
                               │
                               ▼
                        ┌─────────────┐
                        │ Load Balancer│
                        └──────┬──────┘
                               │
         ┌─────────────────────┼─────────────────────┐
         │                     │                     │
         ▼                     ▼                     ▼
   ┌───────────┐         ┌───────────┐         ┌───────────┐
   │  API Pod  │         │  API Pod  │         │  API Pod  │
   │ (FastAPI) │         │ (FastAPI) │         │ (FastAPI) │
   └─────┬─────┘         └─────┬─────┘         └─────┬─────┘
         │                     │                     │
         └─────────────────────┼─────────────────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
              ▼                ▼                ▼
        ┌───────────┐   ┌───────────┐   ┌───────────┐
        │   Redis   │   │  Qdrant   │   │ Monitoring│
        │  (Cache)  │   │(Vectors)  │   │(Prometheus)
        └───────────┘   └───────────┘   └───────────┘

Technology Recommendations

Based on CineRAG and other production deployments:

ComponentRecommendedAlternativeNotes
Vector DBQdrantPinecone, WeaviateQdrant for self-hosting, Pinecone for managed
EmbeddingsSentence-TransformersOpenAI, CohereST for cost/latency, OpenAI for max quality
Keyword SearchBM25ElasticsearchCustom BM25 for simple cases, ES for complex
CacheRedisMemcachedRedis for persistence and data structures
BackendFastAPIFlask, DjangoFastAPI for async and auto-docs
DeploymentKubernetesDocker ComposeK8s for scale, Compose for simplicity

Conclusion

Production RAG systems require engineering discipline across seven phases: ingestion, embedding, query processing, retrieval, caching, evaluation, and deployment.

This checklist captures lessons from building real systems. Use it to catch issues before they become production incidents.

The most common failures:

  1. No caching (latency spikes under load)
  2. Pure vector search (misses exact matches)
  3. No evaluation metrics (can't measure quality)
  4. Ignoring cold start (first query is terrible)

Address these, and you're ahead of 90% of RAG deployments.


Building a RAG system? Let's discuss your architecture and requirements.

Check out the CineRAG case study for a complete implementation example.

Written by

Dr. Jody-Ann Jones

Need help with your data systems?

Let's discuss how we can help you build production-grade AI and data systems.