ML Serving Platform
Architecture Visual
ML Serving Platform
Data Scientist laptops are not Production environments. To serve models at scale, you need a system that tracks model versions, monitors data drift, and scales inference on GPUs.
Architecture
- Model Registry (MLflow): The “Git” for models. Tracks experiments, parameters, and artifacts (
model.pkl). - Serving Layer (KServe/Seldon): Standardized inference protocol (V2) that abstracts the framework (PyTorch, TF, XGBoost).
- Inference Gateway: Handles Canary rollouts (Traffic Splitting) between Model V1 and Model V2.
- Monitoring: Detecting “Drift” (Model accuracy degrading over time because real-world data changed).
Use Cases
- Fraud Detection: Scoring transactions in real-time (< 50ms).
- Recommendation Engine: A/B testing two personalization algorithms on live users.
- Computer Vision: Resizing and batching images for specialized GPU instances.
Implementation Guide
We will use MLflow to track a model and a custom Python wrapper to serve it.
Prerequisites
- Python 3.9+
- Docker
- MLflow installed locally
Step 1: Training & Tracking (MLflow)
/* train.py */
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
mlflow.set_tracking_uri("http://localhost:5000")
def train():
with mlflow.start_run():
params = {"n_estimators": 100, "max_depth": 5}
model = RandomForestRegressor(**params)
model.fit(X_train, y_train)
# Log Params & Model
mlflow.log_params(params)
mlflow.sklearn.log_model(model, "my_model")
print("Model saved to MLflow")
if __name__ == "__main__":
train()
Step 2: Serving Wrapper
In production, we often wrap the model in a consistent API.
/* serve.py */
import mlflow.pyfunc
import pandas as pd
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
# Load model from Registry (in real life, fetch from S3)
model_uri = "models:/my_model/Production"
model = mlflow.pyfunc.load_model(model_uri)
class InferenceRequest(BaseModel):
data: list
@app.post("/predict")
async def predict(req: InferenceRequest):
df = pd.DataFrame(req.data)
prediction = model.predict(df)
return {"prediction": prediction.tolist()}
Step 3: Deployment Strategy (Canary)
Using Istio or SMI to split traffic.
/* istio/virtual-service.yaml */
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: fraud-model
spec:
hosts:
- fraud-model
http:
- route:
- destination:
host: fraud-model-v1
weight: 90
- destination:
host: fraud-model-v2
weight: 10
Production Readiness Checklist
[ ] Model Registry: Ensure every model in production can be traced back to the specific Git Commit and Dataset used to train it. (Lineage). [ ] Batching: If using GPUs, ensure your serving layer batches requests (e.g., waiting 10ms to group 50 requests) to maximize throughput. [ ] Drift Detection: Compare the statistical distribution of inputs in Production vs Training. Alert if they diverge. [ ] Fallback: If the Model Service times out, do you have a heuristic fallback? (e.g., “Default to Not Fraud”). [ ] Shadow Mode: Deploy the new model (V2) to receive traffic but discard its results, just to test latency and errors before promoting it.
Cloud Cost Estimator
Dynamic Pricing Calculator