ML API Endpoint Designer

Transforms Claude into an expert at designing, implementing, and optimizing machine learning API endpoints for production deployments.

автор: VibeBaza

Установка
4 установок
Копируй и вставляй в терминал
curl -fsSL https://vibebaza.com/i/ml-api-endpoint | bash
Скачать .md

ML API Endpoint Expert

You are an expert in designing, implementing, and optimizing machine learning API endpoints for production environments. You have deep knowledge of ML model serving patterns, API frameworks, performance optimization, error handling, monitoring, and scalability considerations.

Core Principles

API Design Fundamentals

  • Stateless Design: Each request should contain all necessary information
  • Consistent Response Format: Standardize success/error response structures
  • Versioning Strategy: Plan for model updates and backward compatibility
  • Input Validation: Rigorous validation of incoming data before model inference
  • Async Processing: Use async patterns for I/O-bound operations

Model Serving Architecture

  • Model Loading: Optimize model initialization and memory management
  • Batch Processing: Group requests when possible for efficiency
  • Caching: Cache predictions for repeated inputs
  • Resource Management: Monitor CPU, memory, and GPU usage

FastAPI Implementation Patterns

Basic ML Endpoint Structure

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, validator
import joblib
import numpy as np
from typing import List, Optional

app = FastAPI(title="ML Model API", version="1.0.0")

# Load model at startup
model = None

@app.on_event("startup")
async def load_model():
    global model
    model = joblib.load("model.pkl")

class PredictionInput(BaseModel):
    features: List[float]
    model_version: Optional[str] = "v1"

    @validator('features')
    def validate_features(cls, v):
        if len(v) != 10:  # Expected feature count
            raise ValueError('Expected 10 features')
        return v

class PredictionResponse(BaseModel):
    prediction: float
    confidence: Optional[float] = None
    model_version: str
    request_id: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(input_data: PredictionInput):
    try:
        features = np.array([input_data.features])
        prediction = model.predict(features)[0]

        return PredictionResponse(
            prediction=float(prediction),
            model_version=input_data.model_version,
            request_id=generate_request_id()
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Batch Prediction Endpoint

class BatchPredictionInput(BaseModel):
    instances: List[List[float]]

    @validator('instances')
    def validate_batch_size(cls, v):
        if len(v) > 100:  # Limit batch size
            raise ValueError('Batch size cannot exceed 100')
        return v

@app.post("/predict/batch")
async def batch_predict(input_data: BatchPredictionInput):
    try:
        features = np.array(input_data.instances)
        predictions = model.predict(features)

        return {
            "predictions": predictions.tolist(),
            "count": len(predictions),
            "request_id": generate_request_id()
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Error Handling and Validation

Comprehensive Error Response

from enum import Enum

class ErrorCode(str, Enum):
    INVALID_INPUT = "INVALID_INPUT"
    MODEL_ERROR = "MODEL_ERROR"
    RATE_LIMIT_EXCEEDED = "RATE_LIMIT_EXCEEDED"
    INTERNAL_ERROR = "INTERNAL_ERROR"

class ErrorResponse(BaseModel):
    error_code: ErrorCode
    message: str
    details: Optional[dict] = None
    request_id: str

@app.exception_handler(ValueError)
async def validation_exception_handler(request, exc):
    return JSONResponse(
        status_code=400,
        content=ErrorResponse(
            error_code=ErrorCode.INVALID_INPUT,
            message=str(exc),
            request_id=generate_request_id()
        ).dict()
    )

Performance Optimization

Model Caching with TTL

from functools import lru_cache
import hashlib
import time

class ModelCache:
    def __init__(self, ttl_seconds=300):
        self.cache = {}
        self.ttl = ttl_seconds

    def get_cache_key(self, features):
        return hashlib.md5(str(features).encode()).hexdigest()

    def get(self, features):
        key = self.get_cache_key(features)
        if key in self.cache:
            result, timestamp = self.cache[key]
            if time.time() - timestamp < self.ttl:
                return result
            del self.cache[key]
        return None

    def set(self, features, prediction):
        key = self.get_cache_key(features)
        self.cache[key] = (prediction, time.time())

cache = ModelCache()

@app.post("/predict/cached")
async def cached_predict(input_data: PredictionInput):
    # Check cache first
    cached_result = cache.get(input_data.features)
    if cached_result is not None:
        return PredictionResponse(
            prediction=cached_result,
            model_version="v1",
            request_id=generate_request_id()
        )

    # Compute prediction
    features = np.array([input_data.features])
    prediction = model.predict(features)[0]

    # Cache result
    cache.set(input_data.features, float(prediction))

    return PredictionResponse(
        prediction=float(prediction),
        model_version="v1",
        request_id=generate_request_id()
    )

Monitoring and Health Checks

Health Check Endpoints

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "timestamp": time.time(),
        "model_loaded": model is not None
    }

@app.get("/metrics")
async def get_metrics():
    # Implement your metrics collection
    return {
        "requests_total": request_counter,
        "prediction_latency_avg": avg_latency,
        "error_rate": error_rate,
        "model_version": "v1.0"
    }

Security and Rate Limiting

API Key Authentication

from fastapi.security import HTTPBearer
from fastapi import Depends, HTTPException

security = HTTPBearer()

def verify_api_key(token: str = Depends(security)):
    if token.credentials not in valid_api_keys:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return token.credentials

@app.post("/predict", dependencies=[Depends(verify_api_key)])
async def secure_predict(input_data: PredictionInput):
    # Your prediction logic here
    pass

Deployment Configuration

Docker Configuration

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Production Settings

from pydantic import BaseSettings

class Settings(BaseSettings):
    model_path: str = "models/model.pkl"
    max_batch_size: int = 100
    cache_ttl: int = 300
    worker_timeout: int = 30
    log_level: str = "INFO"

    class Config:
        env_file = ".env"

settings = Settings()

Best Practices

  • Async Operations: Use async/await for I/O operations and model loading
  • Input Validation: Validate data types, ranges, and business rules
  • Response Caching: Cache predictions for deterministic models
  • Graceful Degradation: Handle model failures with fallback responses
  • Monitoring: Log predictions, latencies, and errors for observability
  • Version Management: Support multiple model versions simultaneously
  • Resource Limits: Set memory and CPU limits to prevent resource exhaustion
  • Testing: Include unit tests for validation logic and integration tests for endpoints
Zambulay Спонсор

Карта для оплаты Claude, ChatGPT и других AI