GPT-OSS-120B inferencing: which GPUs make sense to host it in 2026?

· 6 min read

Running GPT-OSS-120B in production sounds like a pure compute problem. In practice, it's a memory problem first, then everything else. DevOps teams want predictable latency and clean scaling. CTOs want a platform choice that won't stall delivery. CFOs want a cost line they can defend.

GPT-OSS-120B is a 117B-parameter Mixture-of-Experts model, yet only about 5.1B parameters are active per token. That lowers compute compared with dense 120B models, but it doesn't magically remove VRAM pressure. Weights, KV cache (especially with long context), and batching still drive GPU choice.

This guide gives a practical shortlist (best, good, budget) and simple sizing rules for single-GPU and multi-GPU setups.

https://www.youtube.com/watch?v=YfKdj7GtJ80

What GPT-OSS-120B demands from a GPU (VRAM first, then bandwidth, then compute)

Modern illustration of a high-end GPU card like H100 in a server rack, with glowing VRAM modules highlighted in blue using clean shapes and a controlled blue-grey palette.

If you only remember one thing, make it this: VRAM sets the floor. GPT-OSS-120B uses an MoE Transformer (36 layers, 128 experts, top-4 routing), supports up to 128k context, and uses RoPE. Those details matter because they change how memory is used, even when active compute stays modest.

Quantisation helps a lot. In particular, MXFP4-style quantisation on MoE layers is designed to bring the "fits-on-one-GPU" path into reach for 80GB cards in many real deployments. Still, "fits" depends on your context length and concurrency, not just the weight file.

For the authoritative spec summary, start with the GPT-OSS-120B model card, then work backwards into your VRAM budget.

Two practical rules of thumb:

  • 80GB VRAM is the clean target for a single-GPU deployment with sensible quantisation and headroom.
  • If you try to squeeze onto smaller cards, you often pay for it with shorter context, tiny batches, and harder ops.

Latency and throughput pull you in different directions. Low latency (fast time-to-first-token) likes single-GPU setups with enough VRAM to avoid paging and model sharding. High throughput (many tokens per second across many users) benefits from batching, but batching raises KV cache and eats memory.

A plain-English VRAM checklist for real workloads

Use this checklist before you pick a GPU:

  • Model weights: smaller with quantisation, larger with FP16.
  • KV cache: grows with context length, it's the silent VRAM killer at 64k to 128k.
  • Batching and concurrency: more parallel requests need more cache and workspace memory.
  • Safety margin: plan for 10 to 20 percent headroom to avoid fragmentation crashes.
  • Framework overhead: the serving stack needs VRAM too (allocator, graphs, kernels).

A quick example: a customer chat service at 8k context with modest batching often leaves comfortable headroom on an 80GB card when quantised. In contrast, an internal analytics workflow pushing 64k context can crowd VRAM quickly, even with fewer users, because the KV cache dominates.

If long context is non-negotiable, budget VRAM for KV cache first, then decide how much batching you can afford.

Why memory bandwidth and interconnect matter once VRAM is covered

After you clear VRAM, memory bandwidth decides how quickly the GPU can feed its compute units. That shows up as smoother tokens-per-second under load, especially when you run multi-tenant inference with batching.

Interconnect matters when you split the model across GPUs. NVLink (or similar high-speed links) reduces the tax you pay when layers and activations hop between cards. On the other hand, single-GPU deployments avoid most interconnect pain, which is why 80GB-class cards are operationally attractive.

Don't ignore the rest of the box, though. A 16-core (or better) server CPU, 128GB or more system RAM, and fast NVMe storage reduce cold-start and model load time. They won't fix a VRAM shortage, but they stop avoidable bottlenecks.

GPU options that make sense for hosting GPT-OSS-120B in 2026

Modern illustration comparing four GPUs (H100, MI300X, A100, RTX) side by side on a table, with icons showing VRAM bars and performance arrows, clean shapes, controlled colour palette of greys, blues, oranges, strong horizontal composition, no text, no people, no extra objects.

In 2026, the "sensible" list is short because GPT-OSS-120B is big enough that VRAM determines your options. The goal is simple: keep the model on one GPU if you can, then scale with more replicas. Only split across multiple GPUs when you must.

Below is a decision-focused shortlist, tuned for US sourcing and cloud availability realities.

Best all-rounder for production: NVIDIA H100 80GB

H100 80GB is the default recommendation for teams that need predictable latency. It's a clean single-GPU target for quantised GPT-OSS-120B, and it has strong software support across common inference stacks.

It's a great fit for customer-facing chat, tool use, and steady multi-tenant traffic where batching improves cost per token. The hourly rate is high, but fewer GPUs, simpler deployment, and fewer weird failure modes can lower total cost of ownership.

Strong alternative: AMD MI300X (80GB+) when price or supply is better

MI300X is attractive when your bottleneck is memory capacity and bandwidth, or when you can source AMD capacity faster. For long-context workloads, the extra VRAM variants can give you breathing room that changes the design from "multi-GPU required" to "single GPU per replica".

The trade-off is rarely raw hardware. It's the software path and the team's experience. A short proof-of-concept helps you validate kernels, quantisation support, and stability under your exact load. If you want a vendor-neutral comparison to set expectations, see this MI300X vs H100 inference discussion.

Common "good enough" choice: NVIDIA A100 80GB

A100 80GB remains widely deployed in clouds and data centres, so it's often easier to procure. It's also a known quantity for driver stability and tooling.

Compared with H100, you should expect lower throughput at the same load, and higher latency once you push concurrency. Still, it can be a smart choice when H100 stock is tight, or when the project needs stable, well-tested configs over peak performance.

Budget and edge deployments: consumer RTX cards (what works, what breaks)

Consumer RTX cards can run GPT-OSS-120B for pilots, demos, and internal tools, but VRAM limits bite fast. Long context, higher concurrency, and strict latency targets will feel painful sooner than people expect.

If you go this route, set guardrails early: keep context shorter, keep batch sizes small, and accept that you're learning, not scaling. Also plan an upgrade path to an 80GB-class card once usage becomes real. For reference files, configs, and community notes, the GPT-OSS-120B Hugging Face page is a useful starting point.

How to choose the right setup for your team (latency, users, and cost)

Modern illustration of a simple Kubernetes cluster dashboard showing GPU nodes scaling replicas, with charts for latency and users in a clean green and blue palette.

Hardware choice gets easier when you agree on three numbers: target context length, peak concurrent users, and acceptable latency. From there, the simplest cost model usually wins: keep each replica self-contained on one GPU, then scale replicas horizontally.

Here's a quick matrix for planning:

GoalRecommended setupWhy it worksWhat to watch
Lowest ops riskSingle 80GB GPU per replicaNo sharding, stable latencyGPU availability, hourly cost
Must run with less VRAM2 to 4 GPUs per replica (model split)Fits when 80GB cards aren't availableInterconnect overhead, more failure points
Cheapest way to learnConsumer RTX pilotFast feedback for apps and promptsContext limits, unpredictable latency

DevOps details matter at runtime. GPU memory fragmentation can creep up after many loads. Model warm-up affects first request latency. Autoscaling needs a real signal (queue depth, GPU utilisation, p95 latency). Monitoring should include VRAM usage, KV cache growth, and token throughput.

Stack8s fits well here because it lets you run privately, then place inference where it's cheapest while meeting data and latency needs. That helps when US regions, on-prem, and other providers all have different GPU supply and pricing.

A simple sizing recipe you can copy

  • Pick a target context length (for example, 8k for chat, 64k for analysis).
  • Estimate peak concurrency and whether you can batch requests.
  • Decide your latency target (p95 matters more than averages).
  • Choose a quantisation level that keeps quality acceptable.
  • Select the GPU class, aim for one 80GB GPU per replica where possible.
  • Keep 10 to 20 percent VRAM headroom for stability.

Deployment patterns that reduce risk on Stack8s

Start with one 80GB GPU per replica, then scale out replicas as users grow. That keeps rollbacks simple and avoids multi-GPU complexity until you truly need it.

Next, use multi-region or hybrid placement when data rules or latency require it. With GitOps-style releases and private registries, you can roll forward safely, and keep your inference stack consistent across clouds and on-prem.

For broader context on picking inference-optimised GPUs and how teams think about cost and latency, this guide on GPUs for LLM inference workloads is a helpful reference.

Conclusion

For GPT-OSS-120B inference, VRAM sets the floor. If you can, choose an 80GB-class GPU and keep each replica on a single card. In 2026, H100 80GB is the safest production pick, while MI300X and A100 80GB are strong alternatives depending on supply, pricing, and team experience. Consumer RTX cards make sense for pilots, but they hit limits fast with long context and concurrency.

Next step: run a small load test using your real context length and peak users, then scale by adding replicas across regions or providers as demand grows.