Licensed to be used in conjunction with basebox, only.
Inference
Overview
The Inference Server provides Large Language Model (LLM) inference capabilities for the basebox AI platform. It hosts and serves LLM models for text generation, chat completions, and other AI-powered features. The service typically requires GPU acceleration for optimal performance.
Deployment
The Inference Server is deployed via Helm chart to Kubernetes clusters with GPU support.
Helm Chart Configuration
Basic Settings
| Parameter | Default | Description |
|---|---|---|
replicaCount |
1 |
Number of inference server pod replicas |
image.repository |
gitea.basebox.health/basebox-distribution/inference |
Container image repository |
image.pullPolicy |
IfNotPresent |
Image pull policy |
image.tag |
latest |
Image tag to deploy |
fullnameOverride |
inference |
Override the full name of the deployment |
Service Configuration
| Parameter | Default | Description |
|---|---|---|
service.type |
ClusterIP |
Kubernetes service type |
service.port |
8080 |
Service port |
Resource Management
| Parameter | Default | Description |
|---|---|---|
resources |
{} |
CPU/memory/GPU resource requests and limits |
autoscaling.enabled |
false |
Enable horizontal pod autoscaling |
autoscaling.minReplicas |
1 |
Minimum number of replicas |
autoscaling.maxReplicas |
100 |
Maximum number of replicas |
autoscaling.targetCPUUtilizationPercentage |
80 |
Target CPU utilization for scaling |
Health Checks
| Parameter | Default | Description |
|---|---|---|
livenessProbe |
{} |
Liveness probe configuration (disabled by default) |
readinessProbe |
{} |
Readiness probe configuration (disabled by default) |
Environment Variables
Core Configuration
| Variable | Default | Description |
|---|---|---|
API_KEY |
Required | API key for authentication |
HOSTNAME |
0.0.0.0 |
Server hostname to bind to |
PORT |
8080 |
Server port |
Model Configuration
| Variable | Default | Description |
|---|---|---|
MODEL_ID |
meta-llama/Llama-3.1-8B-Instruct |
HuggingFace model identifier |
MAX_INPUT_TOKENS |
16000 |
Maximum input token length |
HF_TOKEN |
Required | HuggingFace API token for model downloads |
GPU Configuration
| Variable | Default | Description |
|---|---|---|
NUM_GPUS |
1 |
Number of GPUs to use |
SHM_SIZE |
16gb |
Shared memory size for GPU operations |
Cache Configuration
| Variable | Description |
|---|---|
NUMBA_CACHE_DIR |
Numba JIT compilation cache directory |
TRITON_CACHE_DIR |
Triton kernel cache directory |
HF_HOME |
HuggingFace home directory (optional) |
TRANSFORMERS_CACHE |
Transformers model cache directory (optional) |
HF_HUB_OFFLINE |
Enable offline mode (use cached models only) |
HF_HUB_ENABLE_HF_TRANSFER |
Enable faster HuggingFace transfers |
Storage Configuration
Model Cache Volumes
For persistent model storage and cache:
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: inference-models
- name: cache-dir
persistentVolumeClaim:
claimName: inference-cache
volumeMounts:
- name: model-cache
mountPath: /data/.cache/huggingface
- name: cache-dir
mountPath: /data
Configuration Examples
Production Configuration
# values-production.yaml
replicaCount: 2
image:
repository: gitea.basebox.health/basebox-distribution/inference
tag: "v1.2.3"
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 8080
resources:
requests:
cpu: 4000m
memory: 32Gi
nvidia.com/gpu: 2
limits:
cpu: 8000m
memory: 64Gi
nvidia.com/gpu: 2
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 180
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 5
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: inference-models
- name: cache-dir
emptyDir:
sizeLimit: 50Gi
volumeMounts:
- name: model-cache
mountPath: /data/.cache/huggingface
- name: cache-dir
mountPath: /data
nodeSelector:
nvidia.com/gpu: "true"
gpu-type: "a100"
env:
API_KEY: "<generate-secure-api-key>"
HOSTNAME: "0.0.0.0"
PORT: "8080"
# Model configuration
MODEL_ID: "meta-llama/Llama-3.1-70B-Instruct"
MAX_INPUT_TOKENS: "32000"
HF_TOKEN: "<your-huggingface-token>"
# GPU configuration
NUM_GPUS: "2"
SHM_SIZE: "32gb"
# Cache configuration
NUMBA_CACHE_DIR: "/data/numba_cache"
TRITON_CACHE_DIR: "/data/triton_cache"
HF_HOME: "/data/.cache/huggingface"
TRANSFORMERS_CACHE: "/data/.cache/huggingface/transformers"
extraObjects:
inference-models:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: inference-models
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 200Gi
Offline Mode Configuration
For environments without internet access (pre-downloaded models):
# values-offline.yaml
replicaCount: 1
resources:
requests:
cpu: 4000m
memory: 32Gi
nvidia.com/gpu: 1
limits:
cpu: 8000m
memory: 64Gi
nvidia.com/gpu: 1
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: inference-models-preloaded
volumeMounts:
- name: model-cache
mountPath: /data/.cache/huggingface
env:
API_KEY: "<secure-api-key>"
MODEL_ID: "meta-llama/Llama-3.1-8B-Instruct"
HF_TOKEN: "<huggingface-token>"
NUM_GPUS: "1"
# Offline mode
HF_HUB_OFFLINE: "1"
HF_HOME: "/data/.cache/huggingface"
TRANSFORMERS_CACHE: "/data/.cache/huggingface/transformers"
NUMBA_CACHE_DIR: "/data/numba_cache"
TRITON_CACHE_DIR: "/data/triton_cache"
Small Model Configuration
For resource-constrained environments:
# values-small.yaml
replicaCount: 1
resources:
requests:
cpu: 2000m
memory: 16Gi
nvidia.com/gpu: 1
limits:
cpu: 4000m
memory: 32Gi
nvidia.com/gpu: 1
env:
API_KEY: "<secure-api-key>"
MODEL_ID: "meta-llama/Llama-3.2-3B-Instruct"
MAX_INPUT_TOKENS: "8000"
HF_TOKEN: "<huggingface-token>"
NUM_GPUS: "1"
SHM_SIZE: "8gb"
Multi-GPU Configuration
For large models requiring multiple GPUs:
# values-multi-gpu.yaml
replicaCount: 1
resources:
requests:
cpu: 8000m
memory: 128Gi
nvidia.com/gpu: 4
limits:
cpu: 16000m
memory: 256Gi
nvidia.com/gpu: 4
nodeSelector:
nvidia.com/gpu: "true"
gpu-type: "a100"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
env:
API_KEY: "<secure-api-key>"
MODEL_ID: "meta-llama/Llama-3.1-405B-Instruct"
MAX_INPUT_TOKENS: "64000"
HF_TOKEN: "<huggingface-token>"
NUM_GPUS: "4"
SHM_SIZE: "64gb"
HOSTNAME: "0.0.0.0"
PORT: "8080"
Installation
Prerequisites
- Kubernetes cluster (1.23+)
- Helm 3.x
- GPU nodes with NVIDIA GPU Operator
- Storage provisioner with dynamic provisioning
- HuggingFace account with API token
- Sufficient storage for model files (50GB-500GB depending on model)
Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--wait
Pre-download Models (Optional)
For faster startup and offline deployments:
# Create a job to download models
kubectl create job model-download \
--image=gitea.basebox.health/basebox-distribution/inference:latest \
--namespace basebox \
-- python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')"
Install Inference Server
# Install with custom values
helm install inference oci://hub.basebox.ai/helm/inference \
--values values-production.yaml \
--namespace basebox \
--create-namespace
# Verify installation
kubectl get pods -n basebox -l app.kubernetes.io/name=inference
Upgrade
helm upgrade inference oci://hub.basebox.ai/helm/inference \
--values values-production.yaml \
--namespace basebox
Uninstall
helm uninstall inference --namespace basebox
# Delete PVCs if needed
kubectl delete pvc -n basebox inference-models
Verification
Check Deployment
# Check pods
kubectl get pods -n basebox -l app.kubernetes.io/name=inference
# View logs
kubectl logs -n basebox -l app.kubernetes.io/name=inference --tail=100 -f
# Check GPU allocation
kubectl describe pod -n basebox -l app.kubernetes.io/name=inference | grep -A5 "nvidia.com/gpu"
Test Inference Endpoint
# Port forward
kubectl port-forward -n basebox svc/inference 8080:8080
# Test API (example with curl)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-API-Key: your-api-key" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
Monitor Model Loading
Model loading can take several minutes on first startup:
# Watch logs for model download progress
kubectl logs -n basebox -l app.kubernetes.io/name=inference -f | grep -i "download\|loading"
GPU Configuration
GPU Requirements
- NVIDIA GPU with CUDA support (Compute Capability 7.0+)
- Sufficient VRAM for model:
- 7B models: 16GB+ VRAM
- 13B models: 24GB+ VRAM
- 70B models: 80GB+ VRAM (A100)
- 405B models: 320GB+ VRAM (4x A100)
GPU Resource Allocation
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
# For multi-GPU
resources:
requests:
nvidia.com/gpu: 4
limits:
nvidia.com/gpu: 4
Verify GPU Access
# Check GPU inside pod
kubectl exec -n basebox <pod-name> -- nvidia-smi
# Check GPU utilization
kubectl exec -n basebox <pod-name> -- nvidia-smi dmon
Model Selection
Supported Models
Model Size Guidelines
| Model Size | VRAM Required | GPU Type | Use Case |
|---|---|---|---|
| 1B-3B | 8-16GB | T4, L4 | Small deployments, testing |
| 7B-13B | 16-24GB | A10, L40 | General purpose |
| 70B | 80GB | A100 | High quality responses |
| 405B | 320GB | 4x A100 | Maximum quality |
Integration with Other Services
AISRV Integration
AISRV connects to the inference server for LLM capabilities:
# AISRV configuration
env:
AISRV_LLM_URL: "http://inference:8080"
AISRV_LLM_CHAT_ENDPOINT: "/v1/chat/completions"
AISRV_LLM_API_KEY: "<must-match-inference-api-key>"
AISRV_LLM_MODEL: "meta-llama/Llama-3.1-8B-Instruct"
Security Considerations
API Key Management
- Generate cryptographically secure API keys
- Store API keys in Kubernetes secrets
- Rotate API keys regularly
- Use different keys per environment
Model Access
- Protect HuggingFace tokens as secrets
- Use private model repositories for proprietary models
- Implement access logging for audit trails
Network Security
- Use
ClusterIPfor internal-only access - Implement Network Policies
- Rate limiting at ingress level (if exposed externally)
Resource Isolation
- Use node selectors for GPU nodes
- Implement resource quotas
- Monitor GPU memory usage
Performance Tuning
Shared Memory
Increase shared memory for better GPU performance:
env:
SHM_SIZE: "32gb"
# Also configure at pod level
podSecurityContext:
fsGroup: 2000
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 32Gi
volumeMounts:
- name: dshm
mountPath: /dev/shm
Batch Processing
For high throughput:
- Increase MAX_INPUT_TOKENS for longer contexts
- Enable batching in model server configuration
Model Loading Time
Reduce startup time:
- Pre-download models to persistent volumes
- Use HF_HUB_OFFLINE=1 with cached models
- Use faster storage (NVMe SSD)
Monitoring and Metrics
Health Checks
Configure health probes:
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 300 # Allow time for model loading
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 180
periodSeconds: 10
GPU Monitoring
Monitor GPU metrics: - GPU utilization - VRAM usage - Temperature - Power consumption
Use NVIDIA DCGM for Prometheus integration.
Application Metrics
Monitor inference performance: - Request latency - Tokens per second - Queue depth - Error rates
Model Download and Caching
Pre-downloading Models
Create a PVC and download models before deployment:
# model-download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
namespace: basebox
spec:
template:
spec:
containers:
- name: downloader
image: gitea.basebox.health/basebox-distribution/inference:latest
env:
- name: HF_TOKEN
value: "<your-token>"
- name: HF_HOME
value: "/models"
command:
- python
- -c
- |
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-3.1-8B-Instruct"
AutoModelForCausalLM.from_pretrained(model_id)
AutoTokenizer.from_pretrained(model_id)
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: inference-models
restartPolicy: OnFailure
Offline Deployment
After pre-downloading:
env:
HF_HUB_OFFLINE: "1"
HF_HOME: "/models"
TRANSFORMERS_CACHE: "/models/transformers"
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: inference-models
volumeMounts:
- name: model-cache
mountPath: /models
Cost Optimization
GPU Selection
- T4/L4: Cost-effective for smaller models (7B-13B)
- A10/L40: Balanced performance/cost for medium models
- A100: Required for large models, expensive