The Production ML Checklist: 20 Things to Verify Before Deploying Your Model

Introduction

The gap between a working Jupyter notebook and a production ML system is enormous. We've deployed dozens of ML systems over the years, and we've learned (often the hard way) that certain failure modes repeat across projects.

This checklist distills those lessons into 20 concrete items to verify before deploying your model. It's organized by phase: Data, Model, Infrastructure, and Operations.

Use this as a pre-deployment review. If you can't confidently check off most of these items, you're not ready for production.

Phase 1: Data

1. Data Quality Checks Are Automated

✅ What to verify:

Automated tests for nulls, outliers, and schema changes
Tests run on every batch of new data
Alerts trigger when checks fail

Why it matters: Models trained on clean data will fail spectacularly when fed garbage. Automate checks with tools like Great Expectations or dbt tests.

Example:

# Great Expectations example
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=18, max_value=100)

2. Training/Serving Skew Is Minimized

✅ What to verify:

Feature computation logic is identical in training and inference
No time-travel leaks (using future data to predict the past)
Same data preprocessing pipeline in both environments

Why it matters: Your model performs great in training because it's seeing slightly different data than production. Classic trap.

Best practice: Use feature stores (Feast, Tecton) or shared feature pipelines.

3. Data Versioning Is In Place

✅ What to verify:

Training data snapshots are versioned and stored
Can reproduce any model by referencing its training data version
Data lineage is tracked

Why it matters: "It worked last week" is a common refrain. Without data versioning, you can't debug regressions.

Tools: DVC, MLflow Data, Delta Lake

4. Input Data Has Schema Validation

✅ What to verify:

Inference requests are validated against a schema
Unexpected fields are rejected (or logged)
Type mismatches cause clear errors

Why it matters: A missing field can crash your API or silently produce wrong predictions.

Example (Pydantic):

from pydantic import BaseModel

class PredictionRequest(BaseModel):
    user_id: int
    features: dict[str, float]

Phase 2: Model

5. Model Metrics Are Clearly Defined

✅ What to verify:

Metrics align with business outcomes (not just accuracy)
Metrics are tracked over time
Thresholds for acceptable performance are documented

Why it matters: Precision, recall, F1—what actually matters to your business? If churn costs $10K per customer, optimize for catching high-value customers.

6. Model Has Been Tested on Edge Cases

✅ What to verify:

Tested on minority classes, rare inputs, and adversarial examples
Behavior on null/missing features is understood
No catastrophic failures on out-of-distribution data

Why it matters: Models behave unpredictably at the edges. Test them explicitly.

Example edge cases:

All-zero input vectors
Extremely large or small feature values
Inputs with missing required features

7. Inference Latency Meets Requirements

✅ What to verify:

p50, p95, p99 latencies measured under realistic load
Batch prediction vs. real-time trade-offs understood
Model size is optimized (quantization, pruning if needed)

Why it matters: A 10-second prediction is useless in a live customer interaction.

Target latencies:

Real-time APIs: <100ms p95
Batch scoring: depends on SLA (hourly? daily?)

8. Model Explainability Is Sufficient

✅ What to verify:

Can explain individual predictions (SHAP, LIME)
Feature importance is documented
Non-technical stakeholders can interpret outputs

Why it matters: "The model said so" isn't acceptable in regulated industries or when debugging errors.

Phase 3: Infrastructure

9. Model Versioning Is Implemented

✅ What to verify:

Models are versioned and stored in a registry
Each deployment references a specific model version
Rollback to previous version is simple

Why it matters: New model performing worse? Roll back immediately.

Tools: MLflow Model Registry, Weights & Biases, custom S3 + metadata DB

10. Prediction API Has Rate Limiting

✅ What to verify:

Rate limits protect against abuse and runaway requests
Graceful degradation under load (return cached results or errors, don't crash)

Why it matters: A DDoS or a bug in a client can take down your entire service.

11. Containerization & Reproducibility

✅ What to verify:

Model + dependencies are containerized (Docker)
Environment is reproducible (pinned versions)
No "works on my machine" issues

Why it matters: If you can't reproduce the environment, you can't debug production issues.

12. Horizontal Scaling Is Possible

✅ What to verify:

Can add more replicas to handle increased load
No single points of failure
Load balancer distributes traffic

Why it matters: Traffic spikes happen. Plan for 10x your normal load.

Phase 4: Operations

13. Monitoring Dashboards Exist

✅ What to verify:

Track prediction volume, latency, error rates
Business metrics (conversion, churn, revenue) if applicable
Dashboards are reviewed regularly

Why it matters: If you're not monitoring it, you don't know when it breaks.

Key metrics:

Requests per second
Prediction latency (p50, p95, p99)
Error rate
Model confidence distribution

14. Model Drift Detection Is Active

✅ What to verify:

Input distribution drift is monitored (feature drift)
Prediction distribution drift is tracked
Alerts trigger when drift exceeds thresholds

Why it matters: Models degrade over time as data changes. Catch it early.

Tools: Evidently AI, WhyLabs, custom statistical tests

15. Alerts Are Actionable

✅ What to verify:

Alerts go to the right people
Runbooks exist for common alerts
Alert fatigue is minimized (no spam)

Why it matters: Too many alerts? People ignore them. Too few? Issues go unnoticed.

Good alert: "Churn model latency p99 >500ms for 5 minutes → Check DB connection"

16. Logging Is Comprehensive

✅ What to verify:

Log inputs, outputs, timestamps, model version
Logs are searchable and queryable
PII is redacted

Why it matters: Debugging production issues requires visibility into what the model saw and predicted.

17. Rollback Plan Exists

✅ What to verify:

Can roll back to previous model version in <5 minutes
Process is documented and tested
Post-mortems are written after incidents

Why it matters: New model causing issues? Revert fast, debug later.

18. A/B Testing Framework Is Ready

✅ What to verify:

Can deploy new models to a subset of traffic
Metrics are compared between control and treatment
Statistical significance is calculated

Why it matters: Don't deploy a new model to 100% of users without validating it first.

19. Retraining Pipeline Is Automated

✅ What to verify:

Model retraining is scheduled (daily? weekly?)
New models are evaluated before deployment
Human approval gate exists for production

Why it matters: Models degrade. Automate retraining or you'll fall behind.

20. Business Impact Is Measured

✅ What to verify:

ML system's impact on KPIs is tracked
ROI is understood (e.g., "Churn model saves $X/month")
Stakeholders receive regular updates

Why it matters: If you can't measure impact, you can't justify the investment.

Conclusion

Production ML is more than training a model—it's building a reliable, monitored, maintainable system that creates business value.

Use this checklist as a pre-flight check before deploying. The more items you can confidently check off, the fewer 3am pages you'll get.

Need help building production ML systems? We've deployed dozens of models and know where the gotchas are. Get in touch.

Bonus: Downloadable Checklist

## Production ML Deployment Checklist

### Data

- [ ] Data quality checks automated
- [ ] Training/serving skew minimized
- [ ] Data versioning in place
- [ ] Input schema validation

### Model

- [ ] Metrics align with business outcomes
- [ ] Edge cases tested
- [ ] Latency meets requirements
- [ ] Explainability sufficient

### Infrastructure

- [ ] Model versioning implemented
- [ ] API rate limiting enabled
- [ ] Containerized & reproducible
- [ ] Horizontally scalable

### Operations

- [ ] Monitoring dashboards exist
- [ ] Model drift detection active
- [ ] Alerts are actionable
- [ ] Logging comprehensive
- [ ] Rollback plan documented
- [ ] A/B testing ready
- [ ] Retraining automated
- [ ] Business impact measured

Copy this and use it in your next deployment review!

Introduction

Phase 1: Data

1. Data Quality Checks Are Automated

2. Training/Serving Skew Is Minimized

3. Data Versioning Is In Place

4. Input Data Has Schema Validation

Phase 2: Model

5. Model Metrics Are Clearly Defined

6. Model Has Been Tested on Edge Cases

7. Inference Latency Meets Requirements

8. Model Explainability Is Sufficient

Phase 3: Infrastructure

9. Model Versioning Is Implemented

10. Prediction API Has Rate Limiting

11. Containerization & Reproducibility

12. Horizontal Scaling Is Possible

Phase 4: Operations

13. Monitoring Dashboards Exist

14. Model Drift Detection Is Active

15. Alerts Are Actionable

16. Logging Is Comprehensive

17. Rollback Plan Exists

18. A/B Testing Framework Is Ready

19. Retraining Pipeline Is Automated

20. Business Impact Is Measured

Conclusion

Bonus: Downloadable Checklist

Need help with your data systems?