Distributed Inference

12. Distributed Inference#

Distributed inference enables large language models (LLMs) to serve predictions efficiently across multiple GPUs or nodes. As models grow in size, ranging from tens to hundreds of billions of parameters, single-GPU deployment becomes impractical. This section introduces how distributed inference is achieved using the open-source library vLLM.

12.1. Overview#

vLLM is a high-performance inference engine designed to serve large transformer models efficiently. It supports:

Pipeline Parallelism (PP): Splits the model layers across GPUs or nodes.
Tensor Parallelism (TP): Splits individual computations within a layer.

These features allow for scalable, memory-efficient, and high-throughput inference, making it ideal for deploying very large models in production environments or on AI clusters.

12.2. Supported Models#

The following models have been tested and configured for distributed inference using vLLM on AI clusters:

Model	Model Size	Hugging Face	Deployment Guide
Llama 3.1	70B	HF Link	Llama 3.1 Deployment
Llama 3.1	405B	HF Link	Llama 3.1 Deployment
DeepSeek-R1	671B	HF Link	DeepSeek-R1 Deployment
DeepSeek-R1-0528	671B	HF Link	DeepSeek-R1-0528 Deployment

Each model’s deployment page includes:

Instructions for launching vLLM with multi-GPU settings.
Example configurations for PP and/or TP.
Tips for optimizing throughput and latency.

12.3. Best Practices#

Use a fast interconnect between GPUs or nodes (e.g., NVLink, InfiniBand).
Fine-tune the balance between PP and TP for your hardware.
Monitor memory usage and load balancing to avoid bottlenecks.

For full source code, documentation, and additional details, visit the GitHub repository.

GitHub Repository: KempnerInstitute/distributed-inference-vllm

Note

Stay up to date with the repository for new model support and configuration guides.