Licensed to be used in conjunction with basebox, only.
LLM Recommendations for On-Premises Deployment
Quick reference for selecting models based on available GPU hardware.
Methodology
Disclaimer: This list does not replace testing and verification on the target hardware. Our own testing just included basic "smoke testing" on 1x H100 or 1x L40s.
Hardware Tiers
| Tier | Example GPUs | Total VRAM | Typical Use |
|---|---|---|---|
| Enterprise | 4x H100 | 320 GB | High-volume RAG, multiple concurrent users |
| Professional | 4x L40S | 192 GB | Medium RAG workloads, team usage |
| Workstation | 2x RTX 4090 | 48 GB | Light RAG, single-user scenarios |
| Entry | 1x L4 or 1x 24GB GPU | 24 GB | Testing, low-volume usage |
Recommended Models by Hardware
4x H100 (320 GB VRAM)
Best for: High-volume RAG, large context windows, production workloads.
| Model | Quantization | Context | Concurrent Users | HuggingFace |
|---|---|---|---|---|
| Llama 3.3 70B Instruct | FP8 | 65k | 10-12 | Link |
| Llama 3.3 70B Instruct | FP8 | 32k | 20-25 | Link |
| Qwen2.5 72B Instruct | Int8 | 65k | 10-12 | Link |
| DeepSeek R1 Distill 70B | FP8 | 32k | 10-12 | Link |
Notes:
- Recommended: Llama 3.3 70B FP8 with 65k context for RAG-heavy workloads
- FP8/Int8 preserves quality while halving memory vs FP16; prefer over AWQ when VRAM allows
- Qwen2.5 72B shows faster response times than Llama 3.3 70B in testing
- Alternative: Qwen2.5 72B AWQ if maximum context/concurrency is needed over precision
- Qwen 3 large models: Only Qwen 3 32B tested; larger Qwen 3 variants (if available) not yet validated with TGI
4x L40S (192 GB VRAM)
Best for: Team usage, medium RAG workloads.
| Model | Quantization | Context | Concurrent Users | HuggingFace |
|---|---|---|---|---|
| Llama 3.3 70B Instruct | FP8 | 65k | 5-6 | Link |
| Llama 3.3 70B Instruct | FP8 | 32k | 10-12 | Link |
| Qwen2.5 72B Instruct | Int8 | 32k | 5-6 | Link |
| Qwen 3 32B | BNB-4bit | 32k | 8-10 | Link |
Notes:
- Recommended: Llama 3.3 70B FP8 for best quality/context balance
- Qwen 3 32B requires TGI 3.3.0+ and
/no_thinksuffix to disable thinking mode - Context and concurrency scale with tensor parallelism across 4 GPUs
2x RTX 4090 (48 GB VRAM)
Best for: Single-user, light RAG, development.
| Model | Quantization | Context | Concurrent Users | HuggingFace |
|---|---|---|---|---|
| Llama 3.1 8B Instruct | FP8 | 128k | 1-2 | Link |
| DeepSeek R1 Distill 8B | AWQ | 32k | 1-2 | Link |
| Mistral 7B Instruct v0.3 | FP16 | 32k | 1-2 | Link |
Notes:
- 8B models fit comfortably with room for large contexts
- EETQ quantization improves tokens/sec for Llama 3.1 8B
- DeepSeek R1 requires backend handling of thinking tokens
1x L4 / 24 GB GPU
Best for: Testing, demos, low-volume single user.
| Model | Quantization | Context | Concurrent Users | HuggingFace |
|---|---|---|---|---|
| Qwen3 4B Instruct | FP16 | 30k | 1 | Link |
| Llama 3.1 8B Instruct | FP8 | 32k | 1 | Link |
| Mistral 7B Instruct v0.3 | FP8 | 16k | 1 | Link |
Notes:
- Qwen3 4B has best throughput (23 tokens/sec) and long-context reliability
- All models validated for German language support and EU compliance
- Max practical context around 30k tokens
Quick Selection Guide
| Your Situation | Recommended Model |
|---|---|
| Large context RAG (65k) | Llama 3.3 70B FP8 on 4x H100 or 4x L40S |
| Balanced quality and concurrency | Llama 3.3 70B FP8 with 32k context |
| Faster response times | Qwen2.5 72B Int8 on 4x H100 |
| Limited VRAM, need good results | Llama 3.1 8B FP8 |
| Minimum hardware, testing | Qwen3 4B Instruct |
| Need reasoning capabilities | DeepSeek R1 Distill (8B or 70B) |
VRAM Requirements Reference
Approximate VRAM per model (inference only, excludes KV cache overhead):
| Model | FP16 | FP8/Int8 | Int4/AWQ |
|---|---|---|---|
| 70-72B | ~145 GB | ~75 GB | ~38 GB |
| 32B | ~68 GB | ~35 GB | ~18 GB |
| 8B | ~18 GB | ~10 GB | ~5 GB |
| 4B | ~10 GB | ~5 GB | ~3 GB |
Context length significantly increases memory usage. Longer contexts require additional VRAM for KV cache.
Context Size vs Concurrent Users
There is a direct trade-off between context window size and the number of concurrent users:
- KV cache grows with context: Each token in the context requires memory for key-value cache. A 65k context uses roughly 4x the KV cache memory of a 16k context.
- Concurrent requests multiply memory: Each concurrent user needs their own KV cache allocation.
- Practical formula:
Available VRAM = Model weights + (Context size × Concurrent users × KV cache per token)
Example for Llama 3.3 70B FP8 on 4x H100 (320 GB):
| Context | Max Concurrent Users | Use Case |
|---|---|---|
| 65k | 10-12 | Large document RAG |
| 32k | 20-25 | Standard RAG, team usage |
| 16k | 40-50 | Short queries, high concurrency |
Recommendation: For RAG workloads, prioritize context size over concurrent users. A 65k context allows processing larger documents and more retrieved chunks, improving answer quality.
KV cache optimization: Most inference engines support KV cache quantization, which can roughly double concurrent user capacity at a slight quality cost. The estimates above assume default (FP16) KV cache.
GPU Allocation: LLM vs RAG Server
The basebox stack includes two GPU-capable components:
| Component | GPU Usage | Purpose |
|---|---|---|
| LLM Server (TGI) | High VRAM, latency-sensitive | Text generation, chat |
| RAG Server | Low-moderate GPU usage | Embedding, reranking, OCR |
Recommendation: Dedicate GPUs to LLM
In most deployments, run the RAG server in CPU mode and allocate all GPUs to the LLM:
| Hardware | LLM Allocation | RAG Mode | Notes |
|---|---|---|---|
| 4x H100 | 4x H100 | CPU | H100 is overkill for RAG; CPU handles embedding/reranking fine |
| 4x L40S | 4x L40S | CPU | Keep all GPUs for LLM context and concurrency |
| 2x RTX 4090 | 2x RTX 4090 | CPU | No VRAM to spare for RAG |
| 1x 24GB | 1x 24GB | CPU | Single GPU needed entirely for LLM |
When to Use GPU for RAG
Enable GPU mode for RAG only if:
- OCR is required for scanned PDFs or image files
- Reranking performance is critical and CPU is too slow
- Dedicated RAG hardware is available (e.g., separate machine or spare GPU)
If OCR is needed on a multi-GPU setup (e.g., 4x L40S), you can either:
- Dedicate 1 GPU to RAG: Predictable performance, but LLM uses only 3 GPUs
- Share 1 GPU: RAG runs on one of the LLM GPUs; works if document ingestion is infrequent, but OCR spikes may briefly affect LLM latency
Known Issues
- Llama 3.3 70B AWQ: May show high perplexity; tune repetition penalty and temperature
- DeepSeek R1 models: Backend must filter thinking tokens from responses
- Qwen 3 32B: Only BNB-4bit works with TGI; append
/no_thinkto user prompts - DeepSeek V3: Does not load with current TGI versions; check later TGI versions
- GGUF models: Require Llama.cpp backend compilation on target machine; not recommended