https://ozyphus.github.io/ml_infrastructure_guide.html

Introduction: Building expertise in scalable, high-performance machine learning (ML) infrastructure requires a broad understanding of compute, storage, networking, and benchmarking for ML workloads. These systems must support intense training and inference demands for tasks like large language models (LLMs) and computer vision (CV) while also meeting business goals (throughput, latency) in a cost-effective way. Below, we provide curated learning resources (blogs, papers, videos) and then dive into key topics with in-depth explanations and examples. This guide covers everything from GPU hardware basics to advanced cluster design, training vs. inference considerations (cloud and edge), and how to plan resources (e.g. GPUs needed, network, storage) for different scenarios. Citations are included for further reading.

Recommended Resources (Blogs, Papers, Videos)

Key Topics and In-Depth Guide

With the above resources as a foundation, let’s break down the core topics you need to master. For each topic, we provide an explanation and connect it to practical examples in LLMs (NLP) and Computer Vision, as requested. We also cover both training and inference, and include a mix of conceptual and hands-on knowledge.

  1. High-Performance Compute for ML (GPUs and Specialized Hardware)

GPUs vs CPUs: Modern ML training is dominated by graphics processing units (GPUs) due to their massively parallel architecture. While a CPU might have a handful of powerful cores, a GPU contains thousands of smaller cores that can perform arithmetic operations in parallel. This makes GPUs extremely efficient for the linear algebra (matrix multiply-accumulate) at the heart of deep learning. Two key advantages of GPUs are: (a) much higher memory bandwidth and (b) many parallel ALUs (arithmetic units). A typical CPU might have ~90 GB/s memory bandwidth, whereas a modern GPU like NVIDIA A100 has on the order of 2,000 GB/s memory bandwidth[1]. This means a GPU can load the large model weights and data batches much faster, keeping the computation units busy. Additionally, GPUs are designed with large on-chip registers and caches near the execution units, so they can keep more data “close” to where computations happen[15]. In contrast, a CPU core is faster for single-thread tasks, but can’t compete when it comes to processing a huge batch of data concurrently. A common analogy: a CPU is like a fast race car that can carry one passenger, while a GPU is like a bus – slower per thread, but it can carry dozens of passengers in one trip[15]. For ML, where we need to perform the same operation on many data points (e.g. multiply millions of weights by activations), the GPU “bus” is ideal.

Compute Requirements for Training: Training an ML model involves iteratively processing large datasets through the model and updating millions or billions of parameters. This is extremely compute-intensive. For example, training a cutting-edge LLM like GPT-3 (175 billion parameters) from scratch required thousands of GPU hours – OpenAI’s GPT-3 was trained on a cluster of 1024 GPUs over ~1 month, and estimates put the compute cost in the millions of dollars (one source estimated ~$4.6M for GPT-3; and GPT-4’s training cost has been estimated at over \$100 million)[16],%2520fine). To plan infrastructure, one needs to estimate flops (floating-point operations) and memory needed. Guideline: For dense models, training in full precision often requires 2-3× the model’s parameter size in GPU memory (to store gradients, optimizer state, etc.). For instance, a model with 1 billion parameters (around 4 GB in FP32) might need ~8–12 GB GPU memory per copy for training. That’s why GPUs with large VRAM (24 GB, 40 GB, even 80 GB) are preferred for big models. Moreover, when the model or batch size can’t fit on one GPU, distributed training is used – splitting the workload across many GPUs. Techniques like data parallelism (each GPU gets different data) and model parallelism (split the model layers or weights across GPUs) allow scaling to models that don’t fit in one GPU’s memory.

Compute Requirements for Inference: For inference (using a trained model to make predictions), compute needs depend on the model size and the desired latency. A large LLM with tens of billions of parameters might require a powerful GPU (or several) to serve queries with low latency. One practical formula to estimate GPU memory needed for an LLM inference is: Memory ≈ (Number of parameters × bytes per parameter / compression factor) × overhead[17]. For example, a 70B-parameter model in 16-bit precision (2 bytes/parameter) requires roughly 70B × 2 bytes ≈ 140 billion bytes, plus overhead for activations, etc. In one cited example, a 70B LLaMA model was estimated to need on the order of ~21 GB at FP16 (or ~42 GB at FP32) for just the model weights[18]. In practice, you’d likely need multiple GPUs to host such a model with comfortable headroom – indeed, even an 80 GB A100 GPU might be insufficient to run the model with large batch sizes or long sequences without splitting it[19]. This is why serving GPT-3 class models often involves model sharding across GPUs or using specialized hardware. By contrast, a smaller computer vision model like ResNet-50 (25 million parameters, ~100 MB) easily fits in one GPU memory; the bottleneck for CV inference is usually throughput (images per second) rather than memory. CV models can also leverage lower precision (INT8) and small batch sizes to run efficiently on edge devices.

GPUs and AI Chips: The dominant compute platform for ML is GPUs (NVIDIA in particular), but there are other specialized chips worth knowing: - TPUs (Tensor Processing Units): Google’s custom ML accelerators used in their cloud, optimized for matrix multiplies. - FPGAs: Reconfigurable hardware that can be tailored for specific models; occasionally used for low-latency inference or on-prem appliances. - ASICs: Application-specific integrated circuits, e.g. Tesla Dojo for autopilot vision, or Graphcore IPUs – designed to accelerate ML workloads with different architecture. Understanding these isn’t strictly required for a generalist, but as an expert you should be aware of their existence and trade-offs. For instance, GPUs are highly programmable and supported by popular frameworks (PyTorch, TensorFlow), making them versatile. TPUs offer high throughput but mostly within Google’s ecosystem. FPGAs/ASICs can yield efficiency gains for fixed workloads but are harder to program and less flexible.