Licensed to be used in conjunction with basebox, only.
BaseBox - Quick Start Guide
Get your BaseBox up and running in minutes using OCI Helm registry.
Prerequisites
- Kubernetes cluster (1.24+)
- kubectl configured
- Helm 3.8+ installed (OCI support required)
- Ingress controller (nginx recommended)
- Storage provisioner
- GPU nodes (optional, for inference service)
Step 1: Install CloudNativePG Operator
# Add CloudNativePG Helm repository
helm repo add cnpg https://cloudnative-pg.github.io/charts
helm repo update
# Install the operator
helm upgrade --install cnpg \
--namespace cnpg-system \
--create-namespace \
cnpg/cloudnative-pg
# Verify installation
kubectl wait --for=condition=Available --timeout=300s \
deployment/cnpg-cloudnative-pg \
-n cnpg-system
Step 2: Create Database Clusters
Save this as databases.yaml:
---
# Cluster for aisrv
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: aisrv-db
spec:
instances: 3
storage:
size: 10Gi
storageClass: standard # Change to your storage class
bootstrap:
initdb:
database: aisrv
owner: aisrv
secret:
name: aisrv-db-secret
postInitSQL:
- CREATE EXTENSION IF NOT EXISTS vector;
---
apiVersion: v1
kind: Secret
metadata:
name: aisrv-db-secret
type: kubernetes.io/basic-auth
stringData:
username: aisrv
password: "530D67F4B07E42F9ABFD2EB3B0E82D13" # CHANGE IN PRODUCTION
---
# Cluster for idp (Keycloak)
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: idp-db
spec:
instances: 3
storage:
size: 5Gi
storageClass: standard # Change to your storage class
bootstrap:
initdb:
database: idp
owner: idp
secret:
name: idp-db-secret
---
apiVersion: v1
kind: Secret
metadata:
name: idp-db-secret
type: kubernetes.io/basic-auth
stringData:
username: idp
password: "064AD2B3AA0C4AC389930F9A52F1DAA7" # CHANGE IN PRODUCTION
---
# Cluster for storesrv
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: storesrv-db
spec:
instances: 3
storage:
size: 10Gi
storageClass: standard # Change to your storage class
bootstrap:
initdb:
database: storesrv
owner: storesrv
secret:
name: storesrv-db-secret
---
apiVersion: v1
kind: Secret
metadata:
name: storesrv-db-secret
type: kubernetes.io/basic-auth
stringData:
username: storesrv
password: "530D67F4B07E42F9ABFD2EB3B0E82D13" # CHANGE IN PRODUCTION
Apply the databases:
kubectl apply -f databases.yaml -n basebox
# Wait for clusters to be ready
kubectl wait --for=condition=Ready --timeout=600s cluster/aisrv-db -n basebox
kubectl wait --for=condition=Ready --timeout=600s cluster/idp-db -n basebox
kubectl wait --for=condition=Ready --timeout=600s cluster/storesrv-db -n basebox
Step 3: Configure Your Values
Create my-values.yaml and customize for your environment:
global:
# Storage configuration - applies to all components
storageClass: "standard" # Change to your storage class (gp3, premium-rwo, etc.)
# Domain configuration
domain: "basebox.example.com" # Change to your domain
# Common ingress configuration
ingress:
enabled: true
className: "nginx"
tls:
enabled: true
secretName: basebox-tls
# Identity Provider (Keycloak)
idp:
database:
host: idp-db-rw
port: 5432
username: idp
password: "064AD2B3AA0C4AC389930F9A52F1DAA7" # CHANGE IN PRODUCTION
name: idp
# AI Service
aisrv:
database:
host: aisrv-db-rw
port: 5432
user: aisrv
password: "530D67F4B07E42F9ABFD2EB3B0E82D13" # CHANGE IN PRODUCTION
dbname: aisrv
# Application-level LLM configuration
env:
AISRV_LLM_MODEL: "casperhansen/llama-3.3-70b-instruct-awq"
AISRV_LLM_PROVIDER: "vLLM" # or "TGI" for text-generation-inference
AISRV_LLM_CONTEXT_SIZE: "32000" # Max context your app will request
AISRV_LLM_WORD_LIMIT: "24000" # Max output tokens
AISRV_LLM_TEMPERATURE: "0.6" # Default temperature
AISRV_LLM_TOP_P: "0.9" # Default top_p sampling
# Store Service
storesrv:
database:
host: storesrv-db-rw
port: 5432
user: storesrv
password: "530D67F4B07E42F9ABFD2EB3B0E82D13" # CHANGE IN PRODUCTION
dbname: storesrv
# RAG Service
ragsrv:
enabled: true
mode: "cpu" # Set to "gpu" if you have GPU nodes
# Uncomment if using GPU mode
# resources:
# limits:
# nvidia.com/gpu: 1
# requests:
# nvidia.com/gpu: 1
# RAG Support Service
ragsrv-support:
enabled: true
mode: "cpu" # Set to "gpu" if you have GPU nodes
# Uncomment if using GPU mode
# resources:
# limits:
# nvidia.com/gpu: 1
# requests:
# nvidia.com/gpu: 1
# Inference Service (vLLM)
inference:
enabled: true # Set to false if you don't have GPU nodes
# Persistent storage for model cache
persistence:
enabled: true
size: 200Gi # Adjust based on model size (see sizing guide below)
accessModes:
- ReadWriteOnce
storageClassName: "" # Uses global.storageClass if empty
# GPU resources (required for vLLM)
resources:
limits:
nvidia.com/gpu: 1 # Adjust based on model requirements
memory: 32Gi
requests:
nvidia.com/gpu: 1
memory: 16Gi
# vLLM environment variables
env:
# Model configuration
MODEL_NAME: "casperhansen/llama-3.3-70b-instruct-awq"
HUGGING_FACE_HUB_TOKEN: "" # Add token for gated models (e.g., Llama, Mistral)
# Cache configuration (points to persistent volume)
HF_HOME: "~/.cache/huggingface/hub"
HF_HUB_DISABLE_SYMLINKS_WARNING: "1"
# Server configuration
VLLM_HOST: "0.0.0.0"
VLLM_PORT: "8000"
# Performance settings (must accommodate AISRV settings)
VLLM_GPU_MEMORY_UTILIZATION: "0.9"
VLLM_MAX_MODEL_LEN: "32768" # Must be >= AISRV_LLM_CONTEXT_SIZE
VLLM_MAX_NUM_SEQS: "128" # Concurrent sequences
VLLM_MAX_TOKENS: "24000" # Default max output tokens
# Optimization
VLLM_DTYPE: "auto" # Auto-detect best dtype
VLLM_ENFORCE_EAGER: "false" # Enable CUDA graphs for performance
# Quantization (if using quantized models)
# VLLM_QUANTIZATION: "awq" # Uncomment for AWQ models
# Miscellaneous
VLLM_NO_USAGE_STATS: "1"
# Health check probes (allow time for model download on first start)
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 600 # 10 minutes for initial model download
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 600
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Step 4: Install BaseBox from OCI Registry
Install the Complete Platform (Umbrella Chart)
# Install latest version
helm upgrade --install basebox \
oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai \
--namespace basebox \
--create-namespace \
--values my-values.yaml \
--wait \
--timeout 15m # Increased timeout for model download
# Or install specific version
helm upgrade --install basebox \
oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai \
--version 1.2.3 \
--namespace basebox \
--create-namespace \
--values my-values.yaml \
--wait \
--timeout 15m
Note: First deployment will take longer (5-15 minutes) as the inference service downloads the model. Subsequent deployments will be much faster as the model is cached.
Step 5: Verify Installation
# Check all pods are running
kubectl get pods -n basebox
# Watch inference pod startup (model download progress)
kubectl logs -f -n basebox deployment/basebox-inference
# Check ingress
kubectl get ingress -n basebox
# Check database clusters
kubectl get clusters -n basebox
# Check PVCs (verify model cache is bound)
kubectl get pvc -n basebox
# Check all resources
kubectl get all -n basebox
Expected Pod Status
NAME READY STATUS RESTARTS AGE
basebox-aisrv-xxxxx 1/1 Running 0 5m
basebox-storesrv-xxxxx 1/1 Running 0 5m
basebox-ragsrv-xxxxx 1/1 Running 0 5m
basebox-ragsrv-support-xxxxx 1/1 Running 0 5m
basebox-inference-xxxxx 1/1 Running 0 10m # Takes longer on first start
basebox-idp-xxxxx 1/1 Running 0 5m
basebox-frontend-xxxxx 1/1 Running 0 5m
Step 6: Access Your Installation
Once all pods are running, access your installation at:
- Frontend: https://basebox.example.com
- GraphQL API: https://basebox.example.com/graphql
- Keycloak Admin: https://basebox.example.com/auth/admin
Default admin credentials (change after first login):
- Username: admin
- Password: Check your Keycloak configuration
Working with OCI Charts
View Chart Information
# Show chart details
helm show chart oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai
# Show all chart information (values, readme, etc.)
helm show all oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai
# Show default values
helm show values oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai
Pull Charts Locally
# Pull chart to local directory
helm pull oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai
# Pull and extract
helm pull oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai --untar
# Template the chart locally (without installing)
helm template basebox oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai \
--values my-values.yaml > rendered-manifests.yaml
Update Your Installation
# Upgrade to latest version
helm upgrade basebox \
oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai \
--namespace basebox \
--values my-values.yaml \
--wait \
--timeout 15m
# Upgrade to specific version
helm upgrade basebox \
oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai \
--version 1.2.4 \
--namespace basebox \
--values my-values.yaml \
--wait \
--timeout 15m
Troubleshooting
Pods not starting
# Check pod status
kubectl describe pod <pod-name> -n basebox
# Check logs
kubectl logs <pod-name> -n basebox
# Check events
kubectl get events -n basebox --sort-by='.lastTimestamp'
Inference Service Issues
Model Download Taking Too Long
# Check download progress
kubectl logs -f -n basebox deployment/basebox-inference
# Verify PVC is bound
kubectl get pvc -n basebox
# Check available storage
kubectl describe pvc basebox-inference-model-cache -n basebox
Out of Memory Errors
# Check GPU memory usage
kubectl exec -it -n basebox deployment/basebox-inference -- nvidia-smi
# Solution: Reduce VLLM_GPU_MEMORY_UTILIZATION or use smaller model
Model Not Found / Authentication Errors
# For gated models (Llama, Mistral), verify token is set
kubectl get secret -n basebox
# Add HuggingFace token
helm upgrade basebox \
oci://gitea.basebox.health/basebox-distribution/helm/basebox.ai \
--namespace basebox \
--set inference.env.HUGGING_FACE_HUB_TOKEN="hf_xxxxxxxxxxxxx" \
--reuse-values
vLLM Server Not Responding
# Check vLLM logs
kubectl logs -n basebox deployment/basebox-inference --tail=100
# Common issues:
# - Model path incorrect: Check MODEL_NAME matches HuggingFace repo
# - Context size too large: Reduce VLLM_MAX_MODEL_LEN
# - Out of GPU memory: Reduce VLLM_GPU_MEMORY_UTILIZATION or model size
Database connection issues
# Check database cluster status
kubectl get cluster -n basebox
kubectl describe cluster aisrv-db -n basebox
# Check database pods
kubectl get pods -l cnpg.io/cluster=aisrv-db -n basebox
# Test database connectivity
kubectl run -it --rm debug --image=postgres:16 --restart=Never -n basebox -- \
psql -h aisrv-db-rw -U aisrv -d aisrv
Ingress not working
# Check ingress controller
kubectl get pods -n ingress-nginx
# Check ingress resources
kubectl describe ingress -n basebox
# Check TLS certificate (if using cert-manager)
kubectl get certificate -n basebox
kubectl describe certificate basebox-tls -n basebox
Helm installation issues
# List installed releases
helm list -n basebox
# Get release status
helm status basebox -n basebox
# Get release history
helm history basebox -n basebox
# Rollback if needed
helm rollback basebox 1 -n basebox
Configuration Examples
Using a Different Model
To use a different model (e.g., Mistral-7B):
aisrv:
env:
AISRV_LLM_MODEL: "mistralai/Mistral-7B-Instruct-v0.2"
inference:
persistence:
size: 100Gi # Smaller model, less storage needed
resources:
limits:
nvidia.com/gpu: 1
memory: 24Gi
requests:
nvidia.com/gpu: 1
memory: 12Gi
env:
MODEL_NAME: "mistralai/Mistral-7B-Instruct-v0.2"
VLLM_MAX_MODEL_LEN: "32768"
HUGGING_FACE_HUB_TOKEN: "hf_xxxxxxxxxxxxx" # Required for Mistral
Using Quantized Models
For AWQ or GPTQ quantized models:
inference:
env:
MODEL_NAME: "casperhansen/llama-3.3-70b-instruct-awq"
VLLM_QUANTIZATION: "awq" # or "gptq"
VLLM_MAX_MODEL_LEN: "32768"
Multiple GPU Setup
For models requiring multiple GPUs:
inference:
resources:
limits:
nvidia.com/gpu: 2 # Use 2 GPUs
memory: 64Gi
requests:
nvidia.com/gpu: 2
memory: 32Gi
env:
VLLM_TENSOR_PARALLEL_SIZE: "2" # Split model across 2 GPUs
Database Connection Endpoints
Your applications connect to these endpoints when using cnpg:
- aisrv database:
- Read-Write:
aisrv-db-rw:5432 -
Read-Only:
aisrv-db-ro:5432 -
idp database:
- Read-Write:
idp-db-rw:5432 -
Read-Only:
idp-db-ro:5432 -
storesrv database:
- Read-Write:
storesrv-db-rw:5432 - Read-Only:
storesrv-db-ro:5432
Performance Tuning
vLLM Configuration Constraints
Ensure your settings are compatible:
Example:
- VLLM_MAX_MODEL_LEN: "32768"
- AISRV_LLM_CONTEXT_SIZE: "20000"
- AISRV_LLM_WORD_LIMIT: "12000"
- Total: 32000 tokens ✓ (fits within 32768)
GPU Memory Optimization
If experiencing OOM errors:
- Reduce
VLLM_GPU_MEMORY_UTILIZATIONfrom 0.9 to 0.8 - Reduce
VLLM_MAX_MODEL_LENto match your actual needs - Reduce
VLLM_MAX_NUM_SEQSto limit concurrent requests - Use quantized models (AWQ, GPTQ)
Storage Class Selection
Choose appropriate storage class for your cluster and/or provider: