Skip to content

LLM Recommendations for On-Premises Deployment

Quick reference for selecting models based on available GPU hardware.

Methodology

Disclaimer: This list does not replace testing and verification on the target hardware. Our own testing just included basic "smoke testing" on 1x H100 or 1x L40s.

Hardware Tiers

Tier Example GPUs Total VRAM Typical Use
Enterprise 4x H100 320 GB High-volume RAG, multiple concurrent users
Professional 4x L40S 192 GB Medium RAG workloads, team usage
Workstation 2x RTX 4090 48 GB Light RAG, single-user scenarios
Entry 1x L4 or 1x 24GB GPU 24 GB Testing, low-volume usage

4x H100 (320 GB VRAM)

Best for: High-volume RAG, large context windows, production workloads.

Model Quantization Context Concurrent Users HuggingFace
Llama 3.3 70B Instruct FP8 65k 10-12 Link
Llama 3.3 70B Instruct FP8 32k 20-25 Link
Qwen2.5 72B Instruct Int8 65k 10-12 Link
DeepSeek R1 Distill 70B FP8 32k 10-12 Link

Notes:

  • Recommended: Llama 3.3 70B FP8 with 65k context for RAG-heavy workloads
  • FP8/Int8 preserves quality while halving memory vs FP16; prefer over AWQ when VRAM allows
  • Qwen2.5 72B shows faster response times than Llama 3.3 70B in testing
  • Alternative: Qwen2.5 72B AWQ if maximum context/concurrency is needed over precision
  • Qwen 3 large models: Only Qwen 3 32B tested; larger Qwen 3 variants (if available) not yet validated with TGI

4x L40S (192 GB VRAM)

Best for: Team usage, medium RAG workloads.

Model Quantization Context Concurrent Users HuggingFace
Llama 3.3 70B Instruct FP8 65k 5-6 Link
Llama 3.3 70B Instruct FP8 32k 10-12 Link
Qwen2.5 72B Instruct Int8 32k 5-6 Link
Qwen 3 32B BNB-4bit 32k 8-10 Link

Notes:

  • Recommended: Llama 3.3 70B FP8 for best quality/context balance
  • Qwen 3 32B requires TGI 3.3.0+ and /no_think suffix to disable thinking mode
  • Context and concurrency scale with tensor parallelism across 4 GPUs

2x RTX 4090 (48 GB VRAM)

Best for: Single-user, light RAG, development.

Model Quantization Context Concurrent Users HuggingFace
Llama 3.1 8B Instruct FP8 128k 1-2 Link
DeepSeek R1 Distill 8B AWQ 32k 1-2 Link
Mistral 7B Instruct v0.3 FP16 32k 1-2 Link

Notes:

  • 8B models fit comfortably with room for large contexts
  • EETQ quantization improves tokens/sec for Llama 3.1 8B
  • DeepSeek R1 requires backend handling of thinking tokens

1x L4 / 24 GB GPU

Best for: Testing, demos, low-volume single user.

Model Quantization Context Concurrent Users HuggingFace
Qwen3 4B Instruct FP16 30k 1 Link
Llama 3.1 8B Instruct FP8 32k 1 Link
Mistral 7B Instruct v0.3 FP8 16k 1 Link

Notes:

  • Qwen3 4B has best throughput (23 tokens/sec) and long-context reliability
  • All models validated for German language support and EU compliance
  • Max practical context around 30k tokens

Quick Selection Guide

Your Situation Recommended Model
Large context RAG (65k) Llama 3.3 70B FP8 on 4x H100 or 4x L40S
Balanced quality and concurrency Llama 3.3 70B FP8 with 32k context
Faster response times Qwen2.5 72B Int8 on 4x H100
Limited VRAM, need good results Llama 3.1 8B FP8
Minimum hardware, testing Qwen3 4B Instruct
Need reasoning capabilities DeepSeek R1 Distill (8B or 70B)

VRAM Requirements Reference

Approximate VRAM per model (inference only, excludes KV cache overhead):

Model FP16 FP8/Int8 Int4/AWQ
70-72B ~145 GB ~75 GB ~38 GB
32B ~68 GB ~35 GB ~18 GB
8B ~18 GB ~10 GB ~5 GB
4B ~10 GB ~5 GB ~3 GB

Context length significantly increases memory usage. Longer contexts require additional VRAM for KV cache.

Context Size vs Concurrent Users

There is a direct trade-off between context window size and the number of concurrent users:

  • KV cache grows with context: Each token in the context requires memory for key-value cache. A 65k context uses roughly 4x the KV cache memory of a 16k context.
  • Concurrent requests multiply memory: Each concurrent user needs their own KV cache allocation.
  • Practical formula: Available VRAM = Model weights + (Context size × Concurrent users × KV cache per token)

Example for Llama 3.3 70B FP8 on 4x H100 (320 GB):

Context Max Concurrent Users Use Case
65k 10-12 Large document RAG
32k 20-25 Standard RAG, team usage
16k 40-50 Short queries, high concurrency

Recommendation: For RAG workloads, prioritize context size over concurrent users. A 65k context allows processing larger documents and more retrieved chunks, improving answer quality.

KV cache optimization: Most inference engines support KV cache quantization, which can roughly double concurrent user capacity at a slight quality cost. The estimates above assume default (FP16) KV cache.


GPU Allocation: LLM vs RAG Server

The basebox stack includes two GPU-capable components:

Component GPU Usage Purpose
LLM Server (TGI) High VRAM, latency-sensitive Text generation, chat
RAG Server Low-moderate GPU usage Embedding, reranking, OCR

Recommendation: Dedicate GPUs to LLM

In most deployments, run the RAG server in CPU mode and allocate all GPUs to the LLM:

Hardware LLM Allocation RAG Mode Notes
4x H100 4x H100 CPU H100 is overkill for RAG; CPU handles embedding/reranking fine
4x L40S 4x L40S CPU Keep all GPUs for LLM context and concurrency
2x RTX 4090 2x RTX 4090 CPU No VRAM to spare for RAG
1x 24GB 1x 24GB CPU Single GPU needed entirely for LLM

When to Use GPU for RAG

Enable GPU mode for RAG only if:

  1. OCR is required for scanned PDFs or image files
  2. Reranking performance is critical and CPU is too slow
  3. Dedicated RAG hardware is available (e.g., separate machine or spare GPU)

If OCR is needed on a multi-GPU setup (e.g., 4x L40S), you can either:

  • Dedicate 1 GPU to RAG: Predictable performance, but LLM uses only 3 GPUs
  • Share 1 GPU: RAG runs on one of the LLM GPUs; works if document ingestion is infrequent, but OCR spikes may briefly affect LLM latency

Known Issues

  • Llama 3.3 70B AWQ: May show high perplexity; tune repetition penalty and temperature
  • DeepSeek R1 models: Backend must filter thinking tokens from responses
  • Qwen 3 32B: Only BNB-4bit works with TGI; append /no_think to user prompts
  • DeepSeek V3: Does not load with current TGI versions; check later TGI versions
  • GGUF models: Require Llama.cpp backend compilation on target machine; not recommended