Hardware

GPU VRAM for LLM Inference: Practical Sizing for Engineering Teams

Choosing the right hardware for LLM inference isn't just about raw speed. We break down VRAM requirements, cost-per-token trade-offs, and why memory bandwidth is the silent bottleneck in your AI stack.

Krapton Engineering
Reviewed by a senior engineer4 min read
Share
GPU VRAM for LLM Inference: Practical Sizing for Engineering Teams

For engineering teams deploying LLMs, the conversation has shifted from "how many GPUs do we need" to "how much VRAM can we afford to keep active." In our recent work architecting inference pipelines, we found that throughput isn't constrained by raw FLOPS as often as it is by memory bandwidth and VRAM capacity. When you are serving models like Llama 3 or Mistral, the hardware choice dictates your entire cost-per-token model.

TL;DR: VRAM capacity determines if a model fits, but memory bandwidth determines if it serves users fast enough. For inference, prioritize VRAM capacity first to avoid offloading to system RAM, then focus on memory bandwidth to reduce latency.

Key takeaways

Close-up of two NVIDIA RTX 2080 graphics cards with dual fans, high-performance hardware.
Photo by Nana Dua on Pexels
  • VRAM is the hard limit: Your model weights must fit entirely in VRAM to avoid the performance cliff of CPU-offloading.
  • Bandwidth matters more than TFLOPS: For inference (unlike training), memory bandwidth is the primary factor limiting your tokens-per-second.
  • Quantization is mandatory: 4-bit or 8-bit quantization is now industry standard for production inference to balance quality and cost.
  • Cloud vs. On-Prem: Renting H100s is cost-effective for high-traffic spikes; purchasing RTX 4090s or A6000s remains the winner for steady-state, lower-latency internal tooling.

The VRAM Math: Why Capacity Dictates Performance

Detailed view of a GeForce RTX graphics card installed in a computer setup, highlighting modern technology.
Photo by Matheus Bertelli on Pexels

In a recent client engagement involving a custom RAG (Retrieval-Augmented Generation) pipeline, we observed a 10x latency jump when a model spilled over from GPU VRAM into system RAM. This happens because the PCIe bus speed is orders of magnitude slower than the HBM3 or GDDR6X bandwidth on a GPU. When your model size exceeds your VRAM, you aren't just losing speed—you are breaking the user experience.

To calculate your minimum VRAM, use the rule of thumb: Model Parameters (in billions) × 2 bytes (for FP16) + Context Window Overhead. For a 70B parameter model, you need at least 140GB of VRAM just to load the weights. If you are quantizing to 4-bit, you can reduce that footprint significantly, but you must account for the KV cache, which grows linearly with your context window length.

Hardware Comparison for AI Inference

Hardware ClassTypical VRAMBest ForCost Profile
Consumer (RTX 4090)24GBLocal dev, fine-tuning, small-scale inferenceLow CapEx
Prosumer (RTX 6000 Ada)48GBMid-sized models, production edgeMid-range
Enterprise (A100/H100)80GB+High-concurrency LLM APIsHigh OpEx/Rental
Apple Silicon (M3/M4 Max/Ultra)Up to 128GB+ (Unified)Local inference, dev workstationsHigh CapEx

The Memory Bandwidth Bottleneck

As of 2026, memory bandwidth is the true arbiter of your tokens-per-second (TPS). While many engineers focus on CUDA core counts, inference is memory-bound. You are essentially streaming the model weights through the memory controller for every single token generated. According to NVIDIA’s developer documentation, maximizing memory throughput is the most effective way to optimize inference latency for autoregressive models.

In our own internal testing, we found that using an RTX 4090 with faster GDDR6X memory often outperformed older enterprise cards with higher raw compute but slower memory clocks for simple text generation tasks. If your application requires low-latency responses, prioritize memory bus width and clock speed over raw CUDA core count.

Apple Silicon: The Dev Workstation Wildcard

Apple’s Unified Memory Architecture (UMA) is a game-changer for local AI development. Because the GPU shares memory with the CPU, you can load models that would require multiple enterprise GPUs on a single workstation. We frequently use M-series Max chips to test quantized models locally before pushing to AWS cloud infrastructure.

However, Apple Silicon is not a replacement for data center-grade hardware. The latency is higher compared to H100s, and you lack the massive parallel throughput required for high-concurrency production endpoints. It is, however, the best tool for local experimentation and fine-tuning scripts.

When NOT to use this approach

Avoid building a high-concurrency production inference cluster on local workstations or consumer-grade hardware. While tempting for cost savings, you will face severe stability issues, power delivery failures, and lack of support for enterprise-grade orchestration tools like Kubernetes with GPU scheduling. If you need to serve thousands of concurrent requests, stick to managed cloud instances or dedicated server-grade hardware.

Common Pitfalls in Inference Setup

  • Ignoring KV Cache: Developers often calculate memory for weights but forget the KV cache. As your context window increases (e.g., to 128k tokens), the KV cache can consume several gigabytes of VRAM.
  • PCIe Bottlenecks: If you are running multi-GPU setups on consumer motherboards, ensure you have enough PCIe lanes. A constrained bus will throttle your model loading and inter-GPU communication.
  • Power Constraints: Enterprise GPUs require dedicated power delivery. We have seen production rollouts fail because a standard rack power distribution unit (PDU) could not handle the transient power spikes of high-end GPUs.

Next Steps for Your AI Infrastructure

Selecting the right hardware is only half the battle. Once the hardware is in place, you need to optimize your serving layer—whether that’s vLLM, TGI, or a custom Triton implementation—to fully utilize that bandwidth. Whether you are building an on-prem cluster or optimizing your cloud spend, having the right architecture is critical.

Building AI infra or apps? Get an engineering consult from Krapton. We help teams navigate hardware selection and cloud optimization. Hire a dedicated Krapton team to streamline your AI development.

About the author

Krapton Engineering is a team of principal-level developers and architects who have spent years shipping high-performance web applications, AI-integrated SaaS products, and complex backend infrastructure. We focus on pragmatic, performance-first engineering.

hardwaregpuai hardwareapple siliconnvidiainferencellmdeveloper hardware
About the author

Krapton Engineering

Krapton Engineering is a team of principal-level developers and architects who have spent years shipping high-performance web applications, AI-integrated SaaS products, and complex backend infrastructure.