Introduction
The gap between a working Jupyter notebook and a production ML system is enormous. We've deployed dozens of ML systems over the years, and we've learned (often the hard way) that certain failure modes repeat across projects.
This checklist distills those lessons into 20 concrete items to verify before deploying your model. It's organized by phase: Data, Model, Infrastructure, and Operations.
Use this as a pre-deployment review. If you can't confidently check off most of these items, you're not ready for production.
Phase 1: Data
1. Data Quality Checks Are Automated
✅ What to verify:
- Automated tests for nulls, outliers, and schema changes
- Tests run on every batch of new data
- Alerts trigger when checks fail
Why it matters: Models trained on clean data will fail spectacularly when fed garbage. Automate checks with tools like Great Expectations or dbt tests.
Example:
# Great Expectations example
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=18, max_value=100)
2. Training/Serving Skew Is Minimized
✅ What to verify:
- Feature computation logic is identical in training and inference
- No time-travel leaks (using future data to predict the past)
- Same data preprocessing pipeline in both environments
Why it matters: Your model performs great in training because it's seeing slightly different data than production. Classic trap.
Best practice: Use feature stores (Feast, Tecton) or shared feature pipelines.
3. Data Versioning Is In Place
✅ What to verify:
- Training data snapshots are versioned and stored
- Can reproduce any model by referencing its training data version
- Data lineage is tracked
Why it matters: "It worked last week" is a common refrain. Without data versioning, you can't debug regressions.
Tools: DVC, MLflow Data, Delta Lake
4. Input Data Has Schema Validation
✅ What to verify:
- Inference requests are validated against a schema
- Unexpected fields are rejected (or logged)
- Type mismatches cause clear errors
Why it matters: A missing field can crash your API or silently produce wrong predictions.
Example (Pydantic):
from pydantic import BaseModel
class PredictionRequest(BaseModel):
user_id: int
features: dict[str, float]
Phase 2: Model
5. Model Metrics Are Clearly Defined
✅ What to verify:
- Metrics align with business outcomes (not just accuracy)
- Metrics are tracked over time
- Thresholds for acceptable performance are documented
Why it matters: Precision, recall, F1—what actually matters to your business? If churn costs $10K per customer, optimize for catching high-value customers.
6. Model Has Been Tested on Edge Cases
✅ What to verify:
- Tested on minority classes, rare inputs, and adversarial examples
- Behavior on null/missing features is understood
- No catastrophic failures on out-of-distribution data
Why it matters: Models behave unpredictably at the edges. Test them explicitly.
Example edge cases:
- All-zero input vectors
- Extremely large or small feature values
- Inputs with missing required features
7. Inference Latency Meets Requirements
✅ What to verify:
- p50, p95, p99 latencies measured under realistic load
- Batch prediction vs. real-time trade-offs understood
- Model size is optimized (quantization, pruning if needed)
Why it matters: A 10-second prediction is useless in a live customer interaction.
Target latencies:
- Real-time APIs: <100ms p95
- Batch scoring: depends on SLA (hourly? daily?)
8. Model Explainability Is Sufficient
✅ What to verify:
- Can explain individual predictions (SHAP, LIME)
- Feature importance is documented
- Non-technical stakeholders can interpret outputs
Why it matters: "The model said so" isn't acceptable in regulated industries or when debugging errors.
Phase 3: Infrastructure
9. Model Versioning Is Implemented
✅ What to verify:
- Models are versioned and stored in a registry
- Each deployment references a specific model version
- Rollback to previous version is simple
Why it matters: New model performing worse? Roll back immediately.
Tools: MLflow Model Registry, Weights & Biases, custom S3 + metadata DB
10. Prediction API Has Rate Limiting
✅ What to verify:
- Rate limits protect against abuse and runaway requests
- Graceful degradation under load (return cached results or errors, don't crash)
Why it matters: A DDoS or a bug in a client can take down your entire service.
11. Containerization & Reproducibility
✅ What to verify:
- Model + dependencies are containerized (Docker)
- Environment is reproducible (pinned versions)
- No "works on my machine" issues
Why it matters: If you can't reproduce the environment, you can't debug production issues.
12. Horizontal Scaling Is Possible
✅ What to verify:
- Can add more replicas to handle increased load
- No single points of failure
- Load balancer distributes traffic
Why it matters: Traffic spikes happen. Plan for 10x your normal load.
Phase 4: Operations
13. Monitoring Dashboards Exist
✅ What to verify:
- Track prediction volume, latency, error rates
- Business metrics (conversion, churn, revenue) if applicable
- Dashboards are reviewed regularly
Why it matters: If you're not monitoring it, you don't know when it breaks.
Key metrics:
- Requests per second
- Prediction latency (p50, p95, p99)
- Error rate
- Model confidence distribution
14. Model Drift Detection Is Active
✅ What to verify:
- Input distribution drift is monitored (feature drift)
- Prediction distribution drift is tracked
- Alerts trigger when drift exceeds thresholds
Why it matters: Models degrade over time as data changes. Catch it early.
Tools: Evidently AI, WhyLabs, custom statistical tests
15. Alerts Are Actionable
✅ What to verify:
- Alerts go to the right people
- Runbooks exist for common alerts
- Alert fatigue is minimized (no spam)
Why it matters: Too many alerts? People ignore them. Too few? Issues go unnoticed.
Good alert: "Churn model latency p99 >500ms for 5 minutes → Check DB connection"
16. Logging Is Comprehensive
✅ What to verify:
- Log inputs, outputs, timestamps, model version
- Logs are searchable and queryable
- PII is redacted
Why it matters: Debugging production issues requires visibility into what the model saw and predicted.
17. Rollback Plan Exists
✅ What to verify:
- Can roll back to previous model version in <5 minutes
- Process is documented and tested
- Post-mortems are written after incidents
Why it matters: New model causing issues? Revert fast, debug later.
18. A/B Testing Framework Is Ready
✅ What to verify:
- Can deploy new models to a subset of traffic
- Metrics are compared between control and treatment
- Statistical significance is calculated
Why it matters: Don't deploy a new model to 100% of users without validating it first.
19. Retraining Pipeline Is Automated
✅ What to verify:
- Model retraining is scheduled (daily? weekly?)
- New models are evaluated before deployment
- Human approval gate exists for production
Why it matters: Models degrade. Automate retraining or you'll fall behind.
20. Business Impact Is Measured
✅ What to verify:
- ML system's impact on KPIs is tracked
- ROI is understood (e.g., "Churn model saves $X/month")
- Stakeholders receive regular updates
Why it matters: If you can't measure impact, you can't justify the investment.
Conclusion
Production ML is more than training a model—it's building a reliable, monitored, maintainable system that creates business value.
Use this checklist as a pre-flight check before deploying. The more items you can confidently check off, the fewer 3am pages you'll get.
Need help building production ML systems? We've deployed dozens of models and know where the gotchas are. Get in touch.
Bonus: Downloadable Checklist
## Production ML Deployment Checklist
### Data
- [ ] Data quality checks automated
- [ ] Training/serving skew minimized
- [ ] Data versioning in place
- [ ] Input schema validation
### Model
- [ ] Metrics align with business outcomes
- [ ] Edge cases tested
- [ ] Latency meets requirements
- [ ] Explainability sufficient
### Infrastructure
- [ ] Model versioning implemented
- [ ] API rate limiting enabled
- [ ] Containerized & reproducible
- [ ] Horizontally scalable
### Operations
- [ ] Monitoring dashboards exist
- [ ] Model drift detection active
- [ ] Alerts are actionable
- [ ] Logging comprehensive
- [ ] Rollback plan documented
- [ ] A/B testing ready
- [ ] Retraining automated
- [ ] Business impact measured
Copy this and use it in your next deployment review!