Skip to content
Back to Resources
Best Practices

The Production ML Checklist: 20 Things to Verify Before Deploying Your Model

The Data Sensei Team
March 1, 2025
10 min read

A practical checklist drawn from deploying dozens of ML systems. Use this to catch issues before they cause 3am pages.

Machine Learning
MLOps
Production
Checklist

Introduction

The gap between a working Jupyter notebook and a production ML system is enormous. We've deployed dozens of ML systems over the years, and we've learned (often the hard way) that certain failure modes repeat across projects.

This checklist distills those lessons into 20 concrete items to verify before deploying your model. It's organized by phase: Data, Model, Infrastructure, and Operations.

Use this as a pre-deployment review. If you can't confidently check off most of these items, you're not ready for production.


Phase 1: Data

1. Data Quality Checks Are Automated

What to verify:

  • Automated tests for nulls, outliers, and schema changes
  • Tests run on every batch of new data
  • Alerts trigger when checks fail

Why it matters: Models trained on clean data will fail spectacularly when fed garbage. Automate checks with tools like Great Expectations or dbt tests.

Example:

# Great Expectations example
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=18, max_value=100)

2. Training/Serving Skew Is Minimized

What to verify:

  • Feature computation logic is identical in training and inference
  • No time-travel leaks (using future data to predict the past)
  • Same data preprocessing pipeline in both environments

Why it matters: Your model performs great in training because it's seeing slightly different data than production. Classic trap.

Best practice: Use feature stores (Feast, Tecton) or shared feature pipelines.


3. Data Versioning Is In Place

What to verify:

  • Training data snapshots are versioned and stored
  • Can reproduce any model by referencing its training data version
  • Data lineage is tracked

Why it matters: "It worked last week" is a common refrain. Without data versioning, you can't debug regressions.

Tools: DVC, MLflow Data, Delta Lake


4. Input Data Has Schema Validation

What to verify:

  • Inference requests are validated against a schema
  • Unexpected fields are rejected (or logged)
  • Type mismatches cause clear errors

Why it matters: A missing field can crash your API or silently produce wrong predictions.

Example (Pydantic):

from pydantic import BaseModel

class PredictionRequest(BaseModel):
    user_id: int
    features: dict[str, float]

Phase 2: Model

5. Model Metrics Are Clearly Defined

What to verify:

  • Metrics align with business outcomes (not just accuracy)
  • Metrics are tracked over time
  • Thresholds for acceptable performance are documented

Why it matters: Precision, recall, F1—what actually matters to your business? If churn costs $10K per customer, optimize for catching high-value customers.


6. Model Has Been Tested on Edge Cases

What to verify:

  • Tested on minority classes, rare inputs, and adversarial examples
  • Behavior on null/missing features is understood
  • No catastrophic failures on out-of-distribution data

Why it matters: Models behave unpredictably at the edges. Test them explicitly.

Example edge cases:

  • All-zero input vectors
  • Extremely large or small feature values
  • Inputs with missing required features

7. Inference Latency Meets Requirements

What to verify:

  • p50, p95, p99 latencies measured under realistic load
  • Batch prediction vs. real-time trade-offs understood
  • Model size is optimized (quantization, pruning if needed)

Why it matters: A 10-second prediction is useless in a live customer interaction.

Target latencies:

  • Real-time APIs: <100ms p95
  • Batch scoring: depends on SLA (hourly? daily?)

8. Model Explainability Is Sufficient

What to verify:

  • Can explain individual predictions (SHAP, LIME)
  • Feature importance is documented
  • Non-technical stakeholders can interpret outputs

Why it matters: "The model said so" isn't acceptable in regulated industries or when debugging errors.


Phase 3: Infrastructure

9. Model Versioning Is Implemented

What to verify:

  • Models are versioned and stored in a registry
  • Each deployment references a specific model version
  • Rollback to previous version is simple

Why it matters: New model performing worse? Roll back immediately.

Tools: MLflow Model Registry, Weights & Biases, custom S3 + metadata DB


10. Prediction API Has Rate Limiting

What to verify:

  • Rate limits protect against abuse and runaway requests
  • Graceful degradation under load (return cached results or errors, don't crash)

Why it matters: A DDoS or a bug in a client can take down your entire service.


11. Containerization & Reproducibility

What to verify:

  • Model + dependencies are containerized (Docker)
  • Environment is reproducible (pinned versions)
  • No "works on my machine" issues

Why it matters: If you can't reproduce the environment, you can't debug production issues.


12. Horizontal Scaling Is Possible

What to verify:

  • Can add more replicas to handle increased load
  • No single points of failure
  • Load balancer distributes traffic

Why it matters: Traffic spikes happen. Plan for 10x your normal load.


Phase 4: Operations

13. Monitoring Dashboards Exist

What to verify:

  • Track prediction volume, latency, error rates
  • Business metrics (conversion, churn, revenue) if applicable
  • Dashboards are reviewed regularly

Why it matters: If you're not monitoring it, you don't know when it breaks.

Key metrics:

  • Requests per second
  • Prediction latency (p50, p95, p99)
  • Error rate
  • Model confidence distribution

14. Model Drift Detection Is Active

What to verify:

  • Input distribution drift is monitored (feature drift)
  • Prediction distribution drift is tracked
  • Alerts trigger when drift exceeds thresholds

Why it matters: Models degrade over time as data changes. Catch it early.

Tools: Evidently AI, WhyLabs, custom statistical tests


15. Alerts Are Actionable

What to verify:

  • Alerts go to the right people
  • Runbooks exist for common alerts
  • Alert fatigue is minimized (no spam)

Why it matters: Too many alerts? People ignore them. Too few? Issues go unnoticed.

Good alert: "Churn model latency p99 >500ms for 5 minutes → Check DB connection"


16. Logging Is Comprehensive

What to verify:

  • Log inputs, outputs, timestamps, model version
  • Logs are searchable and queryable
  • PII is redacted

Why it matters: Debugging production issues requires visibility into what the model saw and predicted.


17. Rollback Plan Exists

What to verify:

  • Can roll back to previous model version in <5 minutes
  • Process is documented and tested
  • Post-mortems are written after incidents

Why it matters: New model causing issues? Revert fast, debug later.


18. A/B Testing Framework Is Ready

What to verify:

  • Can deploy new models to a subset of traffic
  • Metrics are compared between control and treatment
  • Statistical significance is calculated

Why it matters: Don't deploy a new model to 100% of users without validating it first.


19. Retraining Pipeline Is Automated

What to verify:

  • Model retraining is scheduled (daily? weekly?)
  • New models are evaluated before deployment
  • Human approval gate exists for production

Why it matters: Models degrade. Automate retraining or you'll fall behind.


20. Business Impact Is Measured

What to verify:

  • ML system's impact on KPIs is tracked
  • ROI is understood (e.g., "Churn model saves $X/month")
  • Stakeholders receive regular updates

Why it matters: If you can't measure impact, you can't justify the investment.


Conclusion

Production ML is more than training a model—it's building a reliable, monitored, maintainable system that creates business value.

Use this checklist as a pre-flight check before deploying. The more items you can confidently check off, the fewer 3am pages you'll get.

Need help building production ML systems? We've deployed dozens of models and know where the gotchas are. Get in touch.


Bonus: Downloadable Checklist

## Production ML Deployment Checklist

### Data
- [ ] Data quality checks automated
- [ ] Training/serving skew minimized
- [ ] Data versioning in place
- [ ] Input schema validation

### Model
- [ ] Metrics align with business outcomes
- [ ] Edge cases tested
- [ ] Latency meets requirements
- [ ] Explainability sufficient

### Infrastructure
- [ ] Model versioning implemented
- [ ] API rate limiting enabled
- [ ] Containerized & reproducible
- [ ] Horizontally scalable

### Operations
- [ ] Monitoring dashboards exist
- [ ] Model drift detection active
- [ ] Alerts are actionable
- [ ] Logging comprehensive
- [ ] Rollback plan documented
- [ ] A/B testing ready
- [ ] Retraining automated
- [ ] Business impact measured

Copy this and use it in your next deployment review!

Written by

The Data Sensei Team

Need help with your data systems?

Let's discuss how we can help you build production-grade AI and data systems.