DevOps & MLOps

The Complete Guide to ML Model Serving

Architectures, Optimizations & Operations for Production AI

From training to serving—the real challenge begins when your model goes live. Master the discipline that converts frozen weights into production-grade, revenue-generating AI systems.

Interactive Components

Serving Paradigms

Optimization Techniques

Runtime Comparisons

Serving Paradigms Security & TEEs GPU Hardware GPU Virtualization Cost Economics Serving Runtimes Optimization Speculative Decoding Edge & Mobile Deployment RAG & Agents

The lifecycle of a machine learning model splits into two distinct phases: the computationally intensive process of training and the operational, reliability-focused phase of serving (inference).

While training optimizes weights via gradient descent, serving applies these frozen capabilities to new data in production. In 2025, model serving has evolved from simple REST API wrappers to complex distributed systems orchestrating multi-modal LLMs, autonomous agents, and real-time decision engines.

What Makes Serving Different

Training is about optimization—minimizing loss over hours or weeks. Serving is about reliability—responding to millions of requests with consistent latency, managing GPU memory efficiently, and maintaining uptime SLAs. The infrastructure, metrics, and engineering challenges are fundamentally different.

Serving Paradigms

The selection of a serving paradigm is the foundational architectural decision that dictates infrastructure costs, latency profiles, and system complexity. Four primary paradigms dominate the modern landscape.

Serving Paradigm Explorer

Select a paradigm to explore its characteristics

Real-Time (Online)

Trigger

User Action / API Call

Latency Priority

Critical (ms to s)

Throughput

Variable (bursty)

Data Freshness

Immediate

Common Use Cases

Fraud detectionChatbotsDynamic pricingRecommendation feeds

Architecture Pattern

Client

→

Load Balancer

→

GPU Cluster

→

Response

Real-Time Metrics

For LLMs, real-time performance is quantified by — latency before the user sees the first character—and , which governs generation smoothness. A typical chatbot SLA is TTFT under 200ms.

Tail Latency

The critical engineering challenge is minimizing — ensuring even the 99th percentile of requests are served quickly despite "noisy neighbor" effects in multi-tenant cloud environments.

Privacy & Confidential Computing

Traditional security protects data at rest (disk encryption) and in transit (TLS). But the unique vulnerability of ML inference lies in data in use. To perform matrix multiplications, data must be decrypted in memory—creating exposure windows.

addresses this gap by isolating computation within hardware-protected enclaves called . The NVIDIA H100 was the first GPU architecture to support this natively.

Zero-Trust Inference Flow

Hardware-based security with Confidential Computing

Step 1: Enclave Initialization

Inference server boots inside a Confidential VM (CVM) with AMD SEV-SNP or Intel TDX protection.

Trusted Execution Environment Architecture

ClientEncrypted Query

→→→

mTLS

TEE Boundary

SEV-SNP

CPU

↔

H100

GPU TEE

Encrypted Model Weights

←←←

Result

AttestationService

Data-in-Use Protection

Memory is encrypted at the hardware level—inaccessible to OS, hypervisor, or cloud admins.

Zero Trust Model

Infrastructure provider is assumed untrusted. Client verifies enclave integrity before sending data.

GDPR/HIPAA Compliant

Process sensitive data in public cloud while maintaining technical compliance with data sovereignty requirements.

CPU-Based Isolation Technologies

AMD SEV-SNP

provides VM-level isolation where memory is opaque to the hypervisor. Used for "Confidential Containers" in Kubernetes.

Intel TDX

creates secure "Trust Domains" with encrypted memory protected from the platform. Powers Azure Confidential VMs.

The GPU Hierarchy

Hardware selection dominates the economics of model serving. The remains premier for LLMs, but its scarcity and cost drive adoption of alternatives like and .

GPU Hierarchy Compare

Compare performance, cost, and capabilities

Compare with:

Flagship

H100

NVIDIA H100 SXM5

👑

Memory80GB HBM3

FP16 Performance1979 TFLOPS

FP8 Performance3958 TFLOPS

Price/Hour$8.50

Best For

LLM trainingLarge model inferenceConfidential computing

Features

Transformer EngineFP8 nativeConfidential ComputingNVLink 4.0

High-End

A100

NVIDIA A100 80GB

Memory80GB HBM2e

FP16 Performance

312 TFLOPS(534% faster)

Price/Hour

$3.21(62% cheaper)

Best For

TrainingHigh-throughput inferenceMulti-instance (MIG)

Cost/Performance Analysis

H100 delivers 233 TFLOPS/$

A100 delivers 97 TFLOPS/$

GPU Virtualization & Partitioning

To maximize ROI on expensive GPUs, single-tenant allocation is often replaced by multi-tenancy. Three mechanisms exist: , , and .

GPU Virtualization Explorer

Compare MIG, Time-Slicing, and MPS strategies

Multi-Instance GPU

Mechanism

Hardware partitioning of SMs, memory, and cache

Isolation Level

Hardware

Max Instances

Up to 7 instances

Overhead

Very Low

Performance Guarantee

Fault Isolation

Best For

Multi-tenant SaaS, strict SLAs, mixed workload clusters

Limitations

•Only A100/H100
•Fixed partition sizes
•Reduces available memory per instance

Visual Representation

Physical GPU (H100/A100)

MIG1

MIG2

MIG3

MIG4

MIG5

MIG6

MIG7

Hardware-isolated partitions with dedicated resources

MIG

Best for: Multi-tenant SaaS with strict SLAs

100%

Performance Isolation

Time-Slicing

Best for: Dev environments, bursty workloads

~60%

Effective Utilization

MPS

Best for: Many small inference jobs

Faster than Time-Slicing

Token Economics vs. Iron Economics

The prevailing public model—paying per token—scales linearly. The millionth token costs the same as the first. Private hosting involves high upfront costs but drives marginal inference cost toward zero.

The break-even arrives faster than most expect. For organizations generating over 100K-1M requests/day, self-hosting typically wins on total cost of ownership.

Inference Cost Calculator

API pricing vs. self-hosting economics

Workload Profile

Daily Requests500K

10K10M

Avg Input Tokens/Request800

Avg Output Tokens/Request400

Model Class

Self-Host Config2x L40S

GPU Cost/Hour$5.60

API Provider

Monthly Cost$72.0K

Yearly Cost$864.0K

Cost/1K tokens$4.0000

Self-Hosted

Monthly Cost$9.0K

Yearly Cost$108.4K

GPU Cost/Month$4.0K

Self-hosting saves 87% annually

At 500K requests/day, save $755.6K/year by self-hosting

Caching

Semantic caching can reduce API calls by 20-30% for common queries.

Model Routing

Route simple queries to cheap models, complex ones to frontier models.

Spot Instances

Use spot/preemptible instances for 60-90% savings on batch workloads.

* Estimates based on typical cloud pricing. Actual costs vary by provider, region, and volume discounts.

Serving Runtimes Deep Dive

The "Serving Runtime" executes the neural network's computational graph. In 2025, the ecosystem has bifurcated: LLM-specific engines like and optimize for autoregressive generation, while general-purpose runtimes handle diverse workloads.

The Memory Bottleneck

LLM inference is memory-bound, not compute-bound during decoding. Each token requires reading the entire from memory. (vLLM's innovation) manages this memory like OS virtual memory, eliminating fragmentation for 2-4x higher throughput.

Serving Runtime Benchmark

Compare performance, flexibility, and features

LLM Serving

Versatile Large Language Model

Latency

85/100

Throughput

95/100

Flexibility

80/100

Ease of Use

90/100

Key Innovation

PagedAttention eliminates KV cache fragmentation, achieving 2-4x higher throughput

Features

PagedAttentionContinuous BatchingOpenAI-compatible APISpeculative Decoding

Best For

High-throughput LLM serving, production chatbots, API endpoints

Limitations

LLM-specificLimited custom logicGPU-only

Runtime	Latency	Throughput	Flexibility	Focus
vLLM	85	95	80	LLM Serving
TensorRT-LLM	98	98	50	Maximum Performance
TGI	80	85	75	Production-Ready
TorchServe	70	70	95	PyTorch Native
Ray Serve	75	80	100	Composite AI
ONNX Runtime	80	75	90	Universal

When to Use What

vLLM

High-throughput production LLM serving. OpenAI-compatible API.

TensorRT-LLM

Lowest latency on NVIDIA silicon. Compiler-based optimization.

Ray Serve

Composite AI, multi-model pipelines, fractional GPU allocation.

Quantization & Optimization

To make serving economically viable, engineers compress models through quantization—reducing precision from 16-bit to 8-bit or 4-bit. Modern techniques like and preserve accuracy while dramatically reducing memory footprint.

Quantization Explorer

Compare precision trade-offs and memory savings

Model Size (Parameters)70B

7B (Llama-7B)70B405B (Llama-405B)

Memory Footprint

FP16 (Baseline)140.0 GB

AWQ (4-bit)35.0 GB

Saved: 105.0 GB (75% reduction)

GPU Compatibility

24GB

A10G / RTX 4090

Too Large

48GB

L40S / A6000

Fits

80GB

A100 / H100

Fits

AWQ (4-bit)

Technique

Activation-aware weight quantization protecting salient weights

Hardware Required

Any GPU (CUDA)

Accuracy Retention

97%

Speedup3x

Bits4-bit

FP8: Best Quality

Native H100 format. Nearly lossless with 2x compression. Use when you have the hardware.

AWQ: Best Balance

4x compression with 97% accuracy. Protects important weights. Most versatile choice.

GPTQ: Consumer HW

4-bit inference on RTX 4090s. Hessian-based optimization for consumer deployments.

FP8: The New Standard

is native to H100 GPUs. Unlike INT8, its floating-point nature naturally represents neural network value ranges. Nearly lossless with 2x compression.

S-LoRA: Multi-Tenant Efficiency

hosts thousands of adapters on a single GPU, swapping them into VRAM on-demand. One endpoint, infinite tasks.

Speculative Decoding

exploits the memory-bandwidth bottleneck. A small draft model generates candidate tokens, then a large verifier validates them in parallel—achieving 2-3x speedup by trading compute for bandwidth.

Speculative Decoding Demo

Watch draft + verify parallelism in action

Draft Model (7B)

Fast

Generates candidate tokens quickly but with lower accuracy

Press Start to begin

Verifier Model (70B)

Accurate

Verifies all draft tokens in parallel (single forward pass)

Generated Output

Output will appear here...

1. Draft Phase

Small model (e.g., 7B) quickly generates N candidate tokens autoregressively.

2. Verify Phase

Large model (e.g., 70B) verifies ALL candidates in a single parallel forward pass.

3. Accept/Reject

Verified tokens are accepted. Rejected tokens trigger correction. 2-3x overall speedup.

Why It Works

LLM decoding is memory-bound during single-token generation— the GPU has excess compute capacity. Speculative decoding leverages this by trading extra compute (verifying multiple tokens) for reduced memory bandwidth (fewer large-model forward passes).

Edge AI & Mobile Inference

pushes computation to the device—eliminating network latency and keeping data private. The constraints are severe: limited battery, thermal envelopes, and memory capacity.

ExecuTorch

. Ahead-of-Time compilation to a lightweight (<50KB) C++ executor.

Android, iOS, Embedded

CoreML

. Leverages the Neural Engine (ANE). INT4 support enables 7B models on iPhone 15 Pro.

iOS, macOS, visionOS

TensorFlow Lite

. Uses Delegates to offload to GPU, DSP, or NPU.

Android, Linux, IoT

Quantization for Edge

Post-Training Quantization (PTQ)

Convert FP32 → INT8 after training. Simple but may degrade accuracy for sensitive models.

Quantization-Aware Training (QAT)

. Recovers up to 96% of PTQ's accuracy loss. Essential for mobile.

Deployment Strategies & MLOps

Deploying a model is just the beginning. "Day 2" operations focus on reliability, observability, and safe rollouts. Risk management relies on established patterns: , , and Blue/Green switching.

ML Deployment Strategies

Compare risk management approaches for model updates

Canary Rollout

Gradual rollouts, iterative improvements, gathering user metrics

MEDIUM RISK

Rollback Speed

Fast

Cost Overhead

~5-10% extra compute

Risk Level

Medium

Advantages

+Limited blast radius
+Real user feedback
+Gradual confidence building
+Lower cost than shadow

Limitations

-Some users affected by issues
-Requires traffic splitting infra
-Metric aggregation complexity
-Rollback affects some users

Traffic Flow

Canary Traffic5%

Traffic Split

95%
Stable

5%
Canary

Recommendation

For ML model deployments, start with Shadow Deployment for new architectures, then use Canary Rollouts for iterative improvements. Reserve Blue/Green for major version changes when you have high confidence.

ML Observability Stack

System Metrics

Latency (P50/P99), throughput, GPU utilization, VRAM saturation.

Model Metrics

—data drift and concept drift monitoring.

LLM Metrics

Hallucination rate, token usage, trace visualization (Langfuse, Arize).

RAG & Agentic Architectures

Modern serving rarely involves a single model endpoint. systems ground LLM responses in proprietary data, while agents introduce loops, decision-making, and multi-step tool execution.

RAG Pipeline Architecture

Retrieval-Augmented Generation serving flow

Stage 1

Query

User query is embedded into a vector representation

Query Embedding

query = "What are the side effects of ibuprofen?"
embedding = embed_model.encode(query)  # [0.023, -0.412, ...]

📊

Vector DB

Milvus / Pinecone

🔍

Embedding Model

E5 / BGE / Cohere

⚖️

Reranker

Cross-Encoder / Cohere

🤖

LLM

GPT-4 / Claude / Llama

Hybrid Search

(semantic) with sparse BM25 (keyword) for superior retrieval. Don't miss exact term matches.

Durable Execution

Agents may run for hours. persists state after every step—if the worker crashes, resume exactly where you left off.

LangGraph: Agents as State Machines

models agents as nodes and edges—a graph of actions and decisions. Deployed to Kubernetes with persistent checkpointing (Postgres), it enables long-running, stateful "human-in-the-loop" interactions. The standard for production agent architectures.

Conclusion: From Weights to Value

The landscape of ML Model Serving in 2025 is defined by specialization and security. The "Flask app wrapper" era is over.

Successful deployment now requires: leveraging Confidential Computing for sensitive data, specialized engines like vLLM and TensorRT-LLM for GPU efficiency, quantization for memory optimization, and durable orchestration for agentic workflows.

As models scale in size and capability, the serving layer remains the critical bridge— converting potential intelligence into tangible business value.

Training creates potential.
Serving creates value.

Table of Contents

What Makes Serving Different

Serving Paradigms

Serving Paradigm Explorer

Real-Time (Online)

Architecture Pattern

Real-Time Metrics

Tail Latency

Privacy & Confidential Computing

Zero-Trust Inference Flow

Step 1: Enclave Initialization

Trusted Execution Environment Architecture

Data-in-Use Protection

Zero Trust Model

GDPR/HIPAA Compliant

CPU-Based Isolation Technologies

AMD SEV-SNP

Intel TDX

The GPU Hierarchy

GPU Hierarchy Compare

H100

A100

Cost/Performance Analysis

GPU Virtualization & Partitioning

GPU Virtualization Explorer

Multi-Instance GPU

Visual Representation

Token Economics vs. Iron Economics

Inference Cost Calculator

Workload Profile

Model Class

API Provider

Self-Hosted

Self-hosting saves 87% annually

Serving Runtimes Deep Dive

The Memory Bottleneck

Serving Runtime Benchmark

Versatile Large Language Model

When to Use What

Quantization & Optimization

Quantization Explorer

Memory Footprint

GPU Compatibility

AWQ (4-bit)

FP8: Best Quality

AWQ: Best Balance

GPTQ: Consumer HW

FP8: The New Standard

S-LoRA: Multi-Tenant Efficiency

Speculative Decoding

Speculative Decoding Demo

Draft Model (7B)

Verifier Model (70B)

Generated Output

1. Draft Phase

2. Verify Phase

3. Accept/Reject

Edge AI & Mobile Inference

ExecuTorch

CoreML

TensorFlow Lite

Quantization for Edge

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Deployment Strategies & MLOps

ML Deployment Strategies

Canary Rollout

Advantages

Limitations

Traffic Flow

Recommendation

ML Observability Stack

RAG & Agentic Architectures

RAG Pipeline Architecture

Query

Vector DB

Embedding Model

Reranker

LLM

Hybrid Search