Copyright © 2022- basebox GmbH, all rights reserved.
Licensed to be used in conjunction with basebox, only.

Inference

Overview

The Inference Server provides Large Language Model (LLM) inference capabilities for the basebox AI platform. It hosts and serves LLM models for text generation, chat completions, and other AI-powered features. The service typically requires GPU acceleration for optimal performance.

Deployment

The Inference Server is deployed via Helm chart to Kubernetes clusters with GPU support.

Helm Chart Configuration

Basic Settings

Parameter	Default	Description
`replicaCount`	`1`	Number of inference server pod replicas
`image.repository`	`gitea.basebox.health/basebox-distribution/inference`	Container image repository
`image.pullPolicy`	`IfNotPresent`	Image pull policy
`image.tag`	`latest`	Image tag to deploy
`fullnameOverride`	`inference`	Override the full name of the deployment

Service Configuration

Parameter	Default	Description
`service.type`	`ClusterIP`	Kubernetes service type
`service.port`	`8080`	Service port

Resource Management

Parameter	Default	Description
`resources`	`{}`	CPU/memory/GPU resource requests and limits
`autoscaling.enabled`	`false`	Enable horizontal pod autoscaling
`autoscaling.minReplicas`	`1`	Minimum number of replicas
`autoscaling.maxReplicas`	`100`	Maximum number of replicas
`autoscaling.targetCPUUtilizationPercentage`	`80`	Target CPU utilization for scaling

Health Checks

Parameter	Default	Description
`livenessProbe`	`{}`	Liveness probe configuration (disabled by default)
`readinessProbe`	`{}`	Readiness probe configuration (disabled by default)

Environment Variables

Core Configuration

Variable	Default	Description
`API_KEY`	Required	API key for authentication
`HOSTNAME`	`0.0.0.0`	Server hostname to bind to
`PORT`	`8080`	Server port

Model Configuration

Variable	Default	Description
`MODEL_ID`	`meta-llama/Llama-3.1-8B-Instruct`	HuggingFace model identifier
`MAX_INPUT_TOKENS`	`16000`	Maximum input token length
`HF_TOKEN`	Required	HuggingFace API token for model downloads

GPU Configuration

Variable	Default	Description
`NUM_GPUS`	`1`	Number of GPUs to use
`SHM_SIZE`	`16gb`	Shared memory size for GPU operations

Cache Configuration

Variable	Description
`NUMBA_CACHE_DIR`	Numba JIT compilation cache directory
`TRITON_CACHE_DIR`	Triton kernel cache directory
`HF_HOME`	HuggingFace home directory (optional)
`TRANSFORMERS_CACHE`	Transformers model cache directory (optional)
`HF_HUB_OFFLINE`	Enable offline mode (use cached models only)
`HF_HUB_ENABLE_HF_TRANSFER`	Enable faster HuggingFace transfers

Storage Configuration

Model Cache Volumes

For persistent model storage and cache:

volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: inference-models
  - name: cache-dir
    persistentVolumeClaim:
      claimName: inference-cache

volumeMounts:
  - name: model-cache
    mountPath: /data/.cache/huggingface
  - name: cache-dir
    mountPath: /data

Configuration Examples

Production Configuration

# values-production.yaml
replicaCount: 2

image:
  repository: gitea.basebox.health/basebox-distribution/inference
  tag: "v1.2.3"
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 8080

resources:
  requests:
    cpu: 4000m
    memory: 32Gi
    nvidia.com/gpu: 2
  limits:
    cpu: 8000m
    memory: 64Gi
    nvidia.com/gpu: 2

livenessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 300
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 180
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 5

volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: inference-models
  - name: cache-dir
    emptyDir:
      sizeLimit: 50Gi

volumeMounts:
  - name: model-cache
    mountPath: /data/.cache/huggingface
  - name: cache-dir
    mountPath: /data

nodeSelector:
  nvidia.com/gpu: "true"
  gpu-type: "a100"

env:
  API_KEY: "<generate-secure-api-key>"
  HOSTNAME: "0.0.0.0"
  PORT: "8080"

  # Model configuration
  MODEL_ID: "meta-llama/Llama-3.1-70B-Instruct"
  MAX_INPUT_TOKENS: "32000"
  HF_TOKEN: "<your-huggingface-token>"

  # GPU configuration
  NUM_GPUS: "2"
  SHM_SIZE: "32gb"

  # Cache configuration
  NUMBA_CACHE_DIR: "/data/numba_cache"
  TRITON_CACHE_DIR: "/data/triton_cache"
  HF_HOME: "/data/.cache/huggingface"
  TRANSFORMERS_CACHE: "/data/.cache/huggingface/transformers"

extraObjects:
  inference-models:
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: inference-models
    spec:
      accessModes:
        - ReadWriteOnce
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 200Gi

Offline Mode Configuration

For environments without internet access (pre-downloaded models):

# values-offline.yaml
replicaCount: 1

resources:
  requests:
    cpu: 4000m
    memory: 32Gi
    nvidia.com/gpu: 1
  limits:
    cpu: 8000m
    memory: 64Gi
    nvidia.com/gpu: 1

volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: inference-models-preloaded

volumeMounts:
  - name: model-cache
    mountPath: /data/.cache/huggingface

env:
  API_KEY: "<secure-api-key>"
  MODEL_ID: "meta-llama/Llama-3.1-8B-Instruct"
  HF_TOKEN: "<huggingface-token>"
  NUM_GPUS: "1"

  # Offline mode
  HF_HUB_OFFLINE: "1"
  HF_HOME: "/data/.cache/huggingface"
  TRANSFORMERS_CACHE: "/data/.cache/huggingface/transformers"
  NUMBA_CACHE_DIR: "/data/numba_cache"
  TRITON_CACHE_DIR: "/data/triton_cache"

Small Model Configuration

For resource-constrained environments:

# values-small.yaml
replicaCount: 1

resources:
  requests:
    cpu: 2000m
    memory: 16Gi
    nvidia.com/gpu: 1
  limits:
    cpu: 4000m
    memory: 32Gi
    nvidia.com/gpu: 1

env:
  API_KEY: "<secure-api-key>"
  MODEL_ID: "meta-llama/Llama-3.2-3B-Instruct"
  MAX_INPUT_TOKENS: "8000"
  HF_TOKEN: "<huggingface-token>"
  NUM_GPUS: "1"
  SHM_SIZE: "8gb"

Multi-GPU Configuration

For large models requiring multiple GPUs:

# values-multi-gpu.yaml
replicaCount: 1

resources:
  requests:
    cpu: 8000m
    memory: 128Gi
    nvidia.com/gpu: 4
  limits:
    cpu: 16000m
    memory: 256Gi
    nvidia.com/gpu: 4

nodeSelector:
  nvidia.com/gpu: "true"
  gpu-type: "a100"

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

env:
  API_KEY: "<secure-api-key>"
  MODEL_ID: "meta-llama/Llama-3.1-405B-Instruct"
  MAX_INPUT_TOKENS: "64000"
  HF_TOKEN: "<huggingface-token>"
  NUM_GPUS: "4"
  SHM_SIZE: "64gb"
  HOSTNAME: "0.0.0.0"
  PORT: "8080"

Installation

Prerequisites

Kubernetes cluster (1.23+)
Helm 3.x
GPU nodes with NVIDIA GPU Operator
Storage provisioner with dynamic provisioning
HuggingFace account with API token
Sufficient storage for model files (50GB-500GB depending on model)

Install NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --wait

Pre-download Models (Optional)

For faster startup and offline deployments:

# Create a job to download models
kubectl create job model-download \
  --image=gitea.basebox.health/basebox-distribution/inference:latest \
  --namespace basebox \
  -- python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')"

Install Inference Server

# Install with custom values
helm install inference oci://hub.basebox.ai/helm/inference \
  --values values-production.yaml \
  --namespace basebox \
  --create-namespace

# Verify installation
kubectl get pods -n basebox -l app.kubernetes.io/name=inference

Upgrade

helm upgrade inference oci://hub.basebox.ai/helm/inference \
  --values values-production.yaml \
  --namespace basebox

Uninstall

helm uninstall inference --namespace basebox

# Delete PVCs if needed
kubectl delete pvc -n basebox inference-models

Verification

Check Deployment

# Check pods
kubectl get pods -n basebox -l app.kubernetes.io/name=inference

# View logs
kubectl logs -n basebox -l app.kubernetes.io/name=inference --tail=100 -f

# Check GPU allocation
kubectl describe pod -n basebox -l app.kubernetes.io/name=inference | grep -A5 "nvidia.com/gpu"

Test Inference Endpoint

# Port forward
kubectl port-forward -n basebox svc/inference 8080:8080

# Test API (example with curl)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-api-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Monitor Model Loading

Model loading can take several minutes on first startup:

# Watch logs for model download progress
kubectl logs -n basebox -l app.kubernetes.io/name=inference -f | grep -i "download\|loading"

GPU Configuration

GPU Requirements

NVIDIA GPU with CUDA support (Compute Capability 7.0+)
Sufficient VRAM for model:
7B models: 16GB+ VRAM
13B models: 24GB+ VRAM
70B models: 80GB+ VRAM (A100)
405B models: 320GB+ VRAM (4x A100)

GPU Resource Allocation

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

# For multi-GPU
resources:
  requests:
    nvidia.com/gpu: 4
  limits:
    nvidia.com/gpu: 4

Verify GPU Access

# Check GPU inside pod
kubectl exec -n basebox <pod-name> -- nvidia-smi

# Check GPU utilization
kubectl exec -n basebox <pod-name> -- nvidia-smi dmon

Model Selection

Supported Models

Model Size Guidelines

Model Size	VRAM Required	GPU Type	Use Case
1B-3B	8-16GB	T4, L4	Small deployments, testing
7B-13B	16-24GB	A10, L40	General purpose
70B	80GB	A100	High quality responses
405B	320GB	4x A100	Maximum quality

Integration with Other Services

AISRV Integration

AISRV connects to the inference server for LLM capabilities:

# AISRV configuration
env:
  AISRV_LLM_URL: "http://inference:8080"
  AISRV_LLM_CHAT_ENDPOINT: "/v1/chat/completions"
  AISRV_LLM_API_KEY: "<must-match-inference-api-key>"
  AISRV_LLM_MODEL: "meta-llama/Llama-3.1-8B-Instruct"

Security Considerations

API Key Management

Generate cryptographically secure API keys
Store API keys in Kubernetes secrets
Rotate API keys regularly
Use different keys per environment

Model Access

Protect HuggingFace tokens as secrets
Use private model repositories for proprietary models
Implement access logging for audit trails

Network Security

Use ClusterIP for internal-only access
Implement Network Policies
Rate limiting at ingress level (if exposed externally)

Resource Isolation

Use node selectors for GPU nodes
Implement resource quotas
Monitor GPU memory usage

Performance Tuning

Shared Memory

Increase shared memory for better GPU performance:

env:
  SHM_SIZE: "32gb"

# Also configure at pod level
podSecurityContext:
  fsGroup: 2000

volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 32Gi

volumeMounts:
  - name: dshm
    mountPath: /dev/shm

Batch Processing

For high throughput: - Increase MAX_INPUT_TOKENS for longer contexts - Enable batching in model server configuration

Model Loading Time

Reduce startup time: - Pre-download models to persistent volumes - Use HF_HUB_OFFLINE=1 with cached models - Use faster storage (NVMe SSD)

Monitoring and Metrics

Health Checks

Configure health probes:

livenessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 300  # Allow time for model loading
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 180
  periodSeconds: 10

GPU Monitoring

Monitor GPU metrics: - GPU utilization - VRAM usage - Temperature - Power consumption

Use NVIDIA DCGM for Prometheus integration.

Application Metrics

Monitor inference performance: - Request latency - Tokens per second - Queue depth - Error rates

Model Download and Caching

Pre-downloading Models

Create a PVC and download models before deployment:

# model-download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: model-download
  namespace: basebox
spec:
  template:
    spec:
      containers:
      - name: downloader
        image: gitea.basebox.health/basebox-distribution/inference:latest
        env:
        - name: HF_TOKEN
          value: "<your-token>"
        - name: HF_HOME
          value: "/models"
        command:
        - python
        - -c
        - |
          from transformers import AutoModelForCausalLM, AutoTokenizer
          model_id = "meta-llama/Llama-3.1-8B-Instruct"
          AutoModelForCausalLM.from_pretrained(model_id)
          AutoTokenizer.from_pretrained(model_id)
        volumeMounts:
        - name: model-cache
          mountPath: /models
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: inference-models
      restartPolicy: OnFailure

Offline Deployment

After pre-downloading:

env:
  HF_HUB_OFFLINE: "1"
  HF_HOME: "/models"
  TRANSFORMERS_CACHE: "/models/transformers"

volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: inference-models

volumeMounts:
  - name: model-cache
    mountPath: /models

Cost Optimization

GPU Selection

T4/L4: Cost-effective for smaller models (7B-13B)
A10/L40: Balanced performance/cost for medium models
A100: Required for large models, expensive