ai
advanced

ML Serving Platform

Architecture Visual

graph TD classDef actor fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#000 classDef gateway fill:#e0f2fe,stroke:#0284c7,stroke-width:2px,color:#000 classDef network fill:#f0f9ff,stroke:#0ea5e9,stroke-width:1px,stroke-dasharray: 5 5,color:#000 classDef service fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#000 classDef database fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#000 classDef function fill:#f3e8ff,stroke:#9333ea,stroke-width:2px,color:#000 subgraph serving ["Serving Infrastructure"] direction TB inference_api("<div style='text-align: center;'><img src='/icons/inframap/edge.png' style='width: 48px; height: 48px; object-fit: contain;' /><br/><b>Inference API</b></div>"):::gateway model_server("<div style='text-align: center;'><img src='/icons/inframap/ai.png' style='width: 48px; height: 48px; object-fit: contain;' /><br/><b>Model Server</b></div>"):::service ab_testing("<div style='text-align: center;'><img src='/icons/inframap/analytics.png' style='width: 48px; height: 48px; object-fit: contain;' /><br/><b>A/B Testing</b></div>"):::service end subgraph data_layer ["Data Layer"] direction TB model_registry("<div style='text-align: center;'><img src='/icons/inframap/storage.png' style='width: 48px; height: 48px; object-fit: contain;' /><br/><b>Model Registry</b></div>"):::database feature_store("<div style='text-align: center;'><img src='/icons/inframap/database.png' style='width: 48px; height: 48px; object-fit: contain;' /><br/><b>Feature Store</b></div>"):::database end subgraph ops ["MLOps"] direction TB monitoring("<div style='text-align: center;'><img src='/icons/inframap/analytics.png' style='width: 48px; height: 48px; object-fit: contain;' /><br/><b>Monitoring Stack</b></div>"):::service training_pipeline("<div style='text-align: center;'><img src='/icons/inframap/container.png' style='width: 48px; height: 48px; object-fit: contain;' /><br/><b>Training Pipeline</b></div>"):::service end clients("<div style='text-align: center;'><img src='/icons/inframap/client.png' style='width: 48px; height: 48px; object-fit: contain;' /><br/><b>API Clients</b></div>"):::actor clients --> inference_api inference_api --> model_server inference_api --> feature_store model_server --> model_registry monitoring --> inference_api monitoring --> model_server ab_testing --> model_server training_pipeline --> model_registry training_pipeline --> feature_store

ML Serving Platform

Data Scientist laptops are not Production environments. To serve models at scale, you need a system that tracks model versions, monitors data drift, and scales inference on GPUs.

Architecture

  • Model Registry (MLflow): The “Git” for models. Tracks experiments, parameters, and artifacts (model.pkl).
  • Serving Layer (KServe/Seldon): Standardized inference protocol (V2) that abstracts the framework (PyTorch, TF, XGBoost).
  • Inference Gateway: Handles Canary rollouts (Traffic Splitting) between Model V1 and Model V2.
  • Monitoring: Detecting “Drift” (Model accuracy degrading over time because real-world data changed).

Use Cases

  • Fraud Detection: Scoring transactions in real-time (< 50ms).
  • Recommendation Engine: A/B testing two personalization algorithms on live users.
  • Computer Vision: Resizing and batching images for specialized GPU instances.

Implementation Guide

We will use MLflow to track a model and a custom Python wrapper to serve it.

Prerequisites

  • Python 3.9+
  • Docker
  • MLflow installed locally

Step 1: Training & Tracking (MLflow)

/* train.py */
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor

mlflow.set_tracking_uri("http://localhost:5000")

def train():
    with mlflow.start_run():
        params = {"n_estimators": 100, "max_depth": 5}
        model = RandomForestRegressor(**params)
        model.fit(X_train, y_train)
        
        # Log Params & Model
        mlflow.log_params(params)
        mlflow.sklearn.log_model(model, "my_model")
        
        print("Model saved to MLflow")

if __name__ == "__main__":
    train()

Step 2: Serving Wrapper

In production, we often wrap the model in a consistent API.

/* serve.py */
import mlflow.pyfunc
import pandas as pd
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

# Load model from Registry (in real life, fetch from S3)
model_uri = "models:/my_model/Production"
model = mlflow.pyfunc.load_model(model_uri)

class InferenceRequest(BaseModel):
    data: list

@app.post("/predict")
async def predict(req: InferenceRequest):
    df = pd.DataFrame(req.data)
    prediction = model.predict(df)
    return {"prediction": prediction.tolist()}

Step 3: Deployment Strategy (Canary)

Using Istio or SMI to split traffic.

/* istio/virtual-service.yaml */
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: fraud-model
spec:
  hosts:
  - fraud-model
  http:
  - route:
    - destination:
        host: fraud-model-v1
      weight: 90
    - destination:
        host: fraud-model-v2
      weight: 10

Production Readiness Checklist

[ ] Model Registry: Ensure every model in production can be traced back to the specific Git Commit and Dataset used to train it. (Lineage). [ ] Batching: If using GPUs, ensure your serving layer batches requests (e.g., waiting 10ms to group 50 requests) to maximize throughput. [ ] Drift Detection: Compare the statistical distribution of inputs in Production vs Training. Alert if they diverge. [ ] Fallback: If the Model Service times out, do you have a heuristic fallback? (e.g., “Default to Not Fraud”). [ ] Shadow Mode: Deploy the new model (V2) to receive traffic but discard its results, just to test latency and errors before promoting it.

Cloud Cost Estimator

Dynamic Pricing Calculator

$0 / month
MVP (1x) Startup (5x) Growth (20x) Scale (100x)
MVP Level
Compute Resources
$ 15
Database Storage
$ 25
Load Balancer
$ 10
CDN / Bandwidth
$ 5
* Estimates vary by provider & region
0%
Your Progress 0 of 0 steps