Skip to content

Inference

Overview

The Inference Server provides Large Language Model (LLM) inference capabilities for the basebox AI platform. It hosts and serves LLM models for text generation, chat completions, and other AI-powered features. The service typically requires GPU acceleration for optimal performance.

Deployment

The Inference Server is deployed via Helm chart to Kubernetes clusters with GPU support.

Helm Chart Configuration

Basic Settings

Parameter Default Description
replicaCount 1 Number of inference server pod replicas
image.repository gitea.basebox.health/basebox-distribution/inference Container image repository
image.pullPolicy IfNotPresent Image pull policy
image.tag latest Image tag to deploy
fullnameOverride inference Override the full name of the deployment

Service Configuration

Parameter Default Description
service.type ClusterIP Kubernetes service type
service.port 8080 Service port

Resource Management

Parameter Default Description
resources {} CPU/memory/GPU resource requests and limits
autoscaling.enabled false Enable horizontal pod autoscaling
autoscaling.minReplicas 1 Minimum number of replicas
autoscaling.maxReplicas 100 Maximum number of replicas
autoscaling.targetCPUUtilizationPercentage 80 Target CPU utilization for scaling

Health Checks

Parameter Default Description
livenessProbe {} Liveness probe configuration (disabled by default)
readinessProbe {} Readiness probe configuration (disabled by default)

Environment Variables

Core Configuration

Variable Default Description
API_KEY Required API key for authentication
HOSTNAME 0.0.0.0 Server hostname to bind to
PORT 8080 Server port

Model Configuration

Variable Default Description
MODEL_ID meta-llama/Llama-3.1-8B-Instruct HuggingFace model identifier
MAX_INPUT_TOKENS 16000 Maximum input token length
HF_TOKEN Required HuggingFace API token for model downloads

GPU Configuration

Variable Default Description
NUM_GPUS 1 Number of GPUs to use
SHM_SIZE 16gb Shared memory size for GPU operations

Cache Configuration

Variable Description
NUMBA_CACHE_DIR Numba JIT compilation cache directory
TRITON_CACHE_DIR Triton kernel cache directory
HF_HOME HuggingFace home directory (optional)
TRANSFORMERS_CACHE Transformers model cache directory (optional)
HF_HUB_OFFLINE Enable offline mode (use cached models only)
HF_HUB_ENABLE_HF_TRANSFER Enable faster HuggingFace transfers

Storage Configuration

Model Cache Volumes

For persistent model storage and cache:

volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: inference-models
  - name: cache-dir
    persistentVolumeClaim:
      claimName: inference-cache

volumeMounts:
  - name: model-cache
    mountPath: /data/.cache/huggingface
  - name: cache-dir
    mountPath: /data

Configuration Examples

Production Configuration

# values-production.yaml
replicaCount: 2

image:
  repository: gitea.basebox.health/basebox-distribution/inference
  tag: "v1.2.3"
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 8080

resources:
  requests:
    cpu: 4000m
    memory: 32Gi
    nvidia.com/gpu: 2
  limits:
    cpu: 8000m
    memory: 64Gi
    nvidia.com/gpu: 2

livenessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 300
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 180
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 5

volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: inference-models
  - name: cache-dir
    emptyDir:
      sizeLimit: 50Gi

volumeMounts:
  - name: model-cache
    mountPath: /data/.cache/huggingface
  - name: cache-dir
    mountPath: /data

nodeSelector:
  nvidia.com/gpu: "true"
  gpu-type: "a100"

env:
  API_KEY: "<generate-secure-api-key>"
  HOSTNAME: "0.0.0.0"
  PORT: "8080"

  # Model configuration
  MODEL_ID: "meta-llama/Llama-3.1-70B-Instruct"
  MAX_INPUT_TOKENS: "32000"
  HF_TOKEN: "<your-huggingface-token>"

  # GPU configuration
  NUM_GPUS: "2"
  SHM_SIZE: "32gb"

  # Cache configuration
  NUMBA_CACHE_DIR: "/data/numba_cache"
  TRITON_CACHE_DIR: "/data/triton_cache"
  HF_HOME: "/data/.cache/huggingface"
  TRANSFORMERS_CACHE: "/data/.cache/huggingface/transformers"

extraObjects:
  inference-models:
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: inference-models
    spec:
      accessModes:
        - ReadWriteOnce
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 200Gi

Offline Mode Configuration

For environments without internet access (pre-downloaded models):

# values-offline.yaml
replicaCount: 1

resources:
  requests:
    cpu: 4000m
    memory: 32Gi
    nvidia.com/gpu: 1
  limits:
    cpu: 8000m
    memory: 64Gi
    nvidia.com/gpu: 1

volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: inference-models-preloaded

volumeMounts:
  - name: model-cache
    mountPath: /data/.cache/huggingface

env:
  API_KEY: "<secure-api-key>"
  MODEL_ID: "meta-llama/Llama-3.1-8B-Instruct"
  HF_TOKEN: "<huggingface-token>"
  NUM_GPUS: "1"

  # Offline mode
  HF_HUB_OFFLINE: "1"
  HF_HOME: "/data/.cache/huggingface"
  TRANSFORMERS_CACHE: "/data/.cache/huggingface/transformers"
  NUMBA_CACHE_DIR: "/data/numba_cache"
  TRITON_CACHE_DIR: "/data/triton_cache"

Small Model Configuration

For resource-constrained environments:

# values-small.yaml
replicaCount: 1

resources:
  requests:
    cpu: 2000m
    memory: 16Gi
    nvidia.com/gpu: 1
  limits:
    cpu: 4000m
    memory: 32Gi
    nvidia.com/gpu: 1

env:
  API_KEY: "<secure-api-key>"
  MODEL_ID: "meta-llama/Llama-3.2-3B-Instruct"
  MAX_INPUT_TOKENS: "8000"
  HF_TOKEN: "<huggingface-token>"
  NUM_GPUS: "1"
  SHM_SIZE: "8gb"

Multi-GPU Configuration

For large models requiring multiple GPUs:

# values-multi-gpu.yaml
replicaCount: 1

resources:
  requests:
    cpu: 8000m
    memory: 128Gi
    nvidia.com/gpu: 4
  limits:
    cpu: 16000m
    memory: 256Gi
    nvidia.com/gpu: 4

nodeSelector:
  nvidia.com/gpu: "true"
  gpu-type: "a100"

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

env:
  API_KEY: "<secure-api-key>"
  MODEL_ID: "meta-llama/Llama-3.1-405B-Instruct"
  MAX_INPUT_TOKENS: "64000"
  HF_TOKEN: "<huggingface-token>"
  NUM_GPUS: "4"
  SHM_SIZE: "64gb"
  HOSTNAME: "0.0.0.0"
  PORT: "8080"

Installation

Prerequisites

  • Kubernetes cluster (1.23+)
  • Helm 3.x
  • GPU nodes with NVIDIA GPU Operator
  • Storage provisioner with dynamic provisioning
  • HuggingFace account with API token
  • Sufficient storage for model files (50GB-500GB depending on model)

Install NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --wait

Pre-download Models (Optional)

For faster startup and offline deployments:

# Create a job to download models
kubectl create job model-download \
  --image=gitea.basebox.health/basebox-distribution/inference:latest \
  --namespace basebox \
  -- python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')"

Install Inference Server

# Install with custom values
helm install inference oci://hub.basebox.ai/helm/inference \
  --values values-production.yaml \
  --namespace basebox \
  --create-namespace

# Verify installation
kubectl get pods -n basebox -l app.kubernetes.io/name=inference

Upgrade

helm upgrade inference oci://hub.basebox.ai/helm/inference \
  --values values-production.yaml \
  --namespace basebox

Uninstall

helm uninstall inference --namespace basebox

# Delete PVCs if needed
kubectl delete pvc -n basebox inference-models

Verification

Check Deployment

# Check pods
kubectl get pods -n basebox -l app.kubernetes.io/name=inference

# View logs
kubectl logs -n basebox -l app.kubernetes.io/name=inference --tail=100 -f

# Check GPU allocation
kubectl describe pod -n basebox -l app.kubernetes.io/name=inference | grep -A5 "nvidia.com/gpu"

Test Inference Endpoint

# Port forward
kubectl port-forward -n basebox svc/inference 8080:8080

# Test API (example with curl)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-api-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Monitor Model Loading

Model loading can take several minutes on first startup:

# Watch logs for model download progress
kubectl logs -n basebox -l app.kubernetes.io/name=inference -f | grep -i "download\|loading"

GPU Configuration

GPU Requirements

  • NVIDIA GPU with CUDA support (Compute Capability 7.0+)
  • Sufficient VRAM for model:
  • 7B models: 16GB+ VRAM
  • 13B models: 24GB+ VRAM
  • 70B models: 80GB+ VRAM (A100)
  • 405B models: 320GB+ VRAM (4x A100)

GPU Resource Allocation

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

# For multi-GPU
resources:
  requests:
    nvidia.com/gpu: 4
  limits:
    nvidia.com/gpu: 4

Verify GPU Access

# Check GPU inside pod
kubectl exec -n basebox <pod-name> -- nvidia-smi

# Check GPU utilization
kubectl exec -n basebox <pod-name> -- nvidia-smi dmon

Model Selection

Supported Models

Model Size Guidelines

Model Size VRAM Required GPU Type Use Case
1B-3B 8-16GB T4, L4 Small deployments, testing
7B-13B 16-24GB A10, L40 General purpose
70B 80GB A100 High quality responses
405B 320GB 4x A100 Maximum quality

Integration with Other Services

AISRV Integration

AISRV connects to the inference server for LLM capabilities:

# AISRV configuration
env:
  AISRV_LLM_URL: "http://inference:8080"
  AISRV_LLM_CHAT_ENDPOINT: "/v1/chat/completions"
  AISRV_LLM_API_KEY: "<must-match-inference-api-key>"
  AISRV_LLM_MODEL: "meta-llama/Llama-3.1-8B-Instruct"

Security Considerations

API Key Management

  • Generate cryptographically secure API keys
  • Store API keys in Kubernetes secrets
  • Rotate API keys regularly
  • Use different keys per environment

Model Access

  • Protect HuggingFace tokens as secrets
  • Use private model repositories for proprietary models
  • Implement access logging for audit trails

Network Security

  • Use ClusterIP for internal-only access
  • Implement Network Policies
  • Rate limiting at ingress level (if exposed externally)

Resource Isolation

  • Use node selectors for GPU nodes
  • Implement resource quotas
  • Monitor GPU memory usage

Performance Tuning

Shared Memory

Increase shared memory for better GPU performance:

env:
  SHM_SIZE: "32gb"

# Also configure at pod level
podSecurityContext:
  fsGroup: 2000

volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 32Gi

volumeMounts:
  - name: dshm
    mountPath: /dev/shm

Batch Processing

For high throughput: - Increase MAX_INPUT_TOKENS for longer contexts - Enable batching in model server configuration

Model Loading Time

Reduce startup time: - Pre-download models to persistent volumes - Use HF_HUB_OFFLINE=1 with cached models - Use faster storage (NVMe SSD)

Monitoring and Metrics

Health Checks

Configure health probes:

livenessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 300  # Allow time for model loading
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 180
  periodSeconds: 10

GPU Monitoring

Monitor GPU metrics: - GPU utilization - VRAM usage - Temperature - Power consumption

Use NVIDIA DCGM for Prometheus integration.

Application Metrics

Monitor inference performance: - Request latency - Tokens per second - Queue depth - Error rates

Model Download and Caching

Pre-downloading Models

Create a PVC and download models before deployment:

# model-download-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: model-download
  namespace: basebox
spec:
  template:
    spec:
      containers:
      - name: downloader
        image: gitea.basebox.health/basebox-distribution/inference:latest
        env:
        - name: HF_TOKEN
          value: "<your-token>"
        - name: HF_HOME
          value: "/models"
        command:
        - python
        - -c
        - |
          from transformers import AutoModelForCausalLM, AutoTokenizer
          model_id = "meta-llama/Llama-3.1-8B-Instruct"
          AutoModelForCausalLM.from_pretrained(model_id)
          AutoTokenizer.from_pretrained(model_id)
        volumeMounts:
        - name: model-cache
          mountPath: /models
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: inference-models
      restartPolicy: OnFailure

Offline Deployment

After pre-downloading:

env:
  HF_HUB_OFFLINE: "1"
  HF_HOME: "/models"
  TRANSFORMERS_CACHE: "/models/transformers"

volumes:
  - name: model-cache
    persistentVolumeClaim:
      claimName: inference-models

volumeMounts:
  - name: model-cache
    mountPath: /models

Cost Optimization

GPU Selection

  • T4/L4: Cost-effective for smaller models (7B-13B)
  • A10/L40: Balanced performance/cost for medium models
  • A100: Required for large models, expensive