diff --git a/docs/model-serving/generative-inference/llmisvc/llmisvc-configuration.md b/docs/model-serving/generative-inference/llmisvc/llmisvc-configuration.md index 49e1a64b2..e7e47c5ab 100644 --- a/docs/model-serving/generative-inference/llmisvc/llmisvc-configuration.md +++ b/docs/model-serving/generative-inference/llmisvc/llmisvc-configuration.md @@ -169,6 +169,44 @@ spec: --- +## LoRA Adapter Configuration + +LLMInferenceService supports Low-Rank Adaptation (LoRA) adapters for task-specific model fine-tuning. LoRA allows you to serve multiple adapted versions of a base model efficiently, reducing storage and memory requirements while enabling multi-tenant deployments. + +### Quick Example + +```yaml +spec: + model: + uri: hf://Qwen/Qwen2.5-7B-Instruct + name: Qwen/Qwen2.5-7B-Instruct + lora: + adapters: + - name: sql-adapter + uri: hf://my-org/qwen-sql-lora + - name: code-adapter + uri: s3://my-bucket/adapters/code-lora + - name: domain-adapter + uri: pvc://adapter-pvc/domain-lora +``` + +### Supported URI Schemes + +- **`hf://`** - HuggingFace Hub adapters +- **`s3://`** - S3-compatible storage (AWS S3, MinIO, Ceph) +- **`pvc://`** - PersistentVolumeClaim (pre-downloaded, air-gapped) + +### Key Benefits + +- **Storage Efficiency**: 50-500MB per adapter vs 10-100GB for full models +- **Multi-Tenancy**: Multiple task-specific models from a single deployment +- **Dynamic Switching**: Per-request adapter selection with ~1-5ms overhead +- **Automatic Integration**: Controller handles downloads, mounts, and vLLM configuration + +For detailed configuration, examples, and troubleshooting, see the **[LoRA Adapters Guide](./lora-adapters.md)**. + +--- + ## Workload Specification ### Workload Types Overview diff --git a/docs/model-serving/generative-inference/llmisvc/llmisvc-overview.md b/docs/model-serving/generative-inference/llmisvc/llmisvc-overview.md index 3a1ee0c20..1dcba9a34 100644 --- a/docs/model-serving/generative-inference/llmisvc/llmisvc-overview.md +++ b/docs/model-serving/generative-inference/llmisvc/llmisvc-overview.md @@ -91,9 +91,9 @@ Defines the LLM model source, name, and characteristics: - Model URI (HuggingFace, S3, PVC) - Model name for API requests - Scheduling criticality -- LoRA adapters (optional) +- **LoRA adapters** (optional) - Attach task-specific Low-Rank Adaptation modules for efficient multi-tenant serving -**Learn more**: [Configuration Guide](./llmisvc-configuration.md#model-specification) +**Learn more**: [Configuration Guide](./llmisvc-configuration.md#model-specification) | [LoRA Adapters](./lora-adapters.md) --- diff --git a/docs/model-serving/generative-inference/llmisvc/lora-adapters.md b/docs/model-serving/generative-inference/llmisvc/lora-adapters.md new file mode 100644 index 000000000..fc96a6fd7 --- /dev/null +++ b/docs/model-serving/generative-inference/llmisvc/lora-adapters.md @@ -0,0 +1,493 @@ +--- +sidebar_label: "LoRA Adapters" +sidebar_position: 2 +title: "LoRA Adapters for LLMInferenceService" +--- + +# LoRA Adapters for LLMInferenceService + +## Overview + +**LoRA (Low-Rank Adaptation)** is a parameter-efficient fine-tuning technique that allows you to adapt large language models to specific tasks without modifying the base model weights. LLMInferenceService provides native support for serving multiple LoRA adapters alongside a base model, enabling efficient multi-tenant deployments and task-specific model specialization. + +### Why Use LoRA Adapters? + +- **Storage Efficiency**: Share a single base model across multiple task-specific adaptations (typically 50-500MB per adapter vs 10-100GB for full models) +- **Multi-Tenancy**: Serve multiple specialized versions of the same model from a single deployment +- **Fast Iteration**: Update task-specific adapters without redeploying the base model +- **Cost Optimization**: Reduce GPU memory and storage costs compared to deploying multiple full models + +:::tip +LoRA adapters are loaded at service startup and vLLM can switch between them dynamically per request with minimal overhead (~1-5ms). +::: + +--- + +## Prerequisites + +Before configuring LoRA adapters, ensure: + +- **vLLM Runtime**: LoRA support requires vLLM (default runtime for LLMInferenceService) +- **Storage Initializer**: Enabled for `hf://` and `s3://` adapters (enabled by default) +- **Base Model Compatibility**: Your base model must be trained with the same architecture as the adapters +- **Kubernetes Resources**: Sufficient GPU memory to load base model + all adapters + +> **Note**: Each adapter typically requires 50-500MB of GPU memory depending on rank and model size. + +--- + +## Configuration + +### Basic LoRA Configuration + +Add LoRA adapters to your LLMInferenceService using the `spec.model.lora.adapters` field: + +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: LLMInferenceService +metadata: + name: my-llm-service +spec: + model: + uri: hf://Qwen/Qwen2.5-7B-Instruct + name: Qwen/Qwen2.5-7B-Instruct + lora: + adapters: + - name: sql-adapter + uri: hf://my-org/qwen-sql-lora + - name: code-adapter + uri: s3://my-bucket/adapters/code-lora +``` + +### Field Reference + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `spec.model.lora.adapters` | array | No | List of LoRA adapters to attach to the base model | +| `spec.model.lora.adapters[].name` | string | Yes | Unique adapter name used for inference requests | +| `spec.model.lora.adapters[].uri` | string | Yes | Adapter source URI (must use `hf://`, `s3://`, or `pvc://` scheme) | +| `spec.model.lora.maxRank` | integer | No | Maximum LoRA rank supported by the runtime (maps to vLLM `--max-lora-rank`). If not set, vLLM's default applies (16). | +| `spec.model.lora.maxAdapters` | integer | No | Maximum number of LoRA adapters in GPU memory simultaneously (maps to vLLM `--max-loras`). If not set, vLLM's default applies (1). | +| `spec.model.lora.maxCpuAdapters` | integer | No | Maximum number of LoRA adapters cached in CPU memory (maps to vLLM `--max-cpu-loras`). If not set, vLLM defaults this to `maxAdapters`. | + +### Constraints + +- Adapter names must be unique within a service +- Adapter names must differ from the base model name +- Adapter names are case-sensitive +- All adapters are loaded at startup (no dynamic loading) + +--- + +## Supported URI Schemes + +### HuggingFace Hub (`hf://`) + +Download adapters directly from HuggingFace Hub. + +**Format**: `hf://organization/repository` or `hf://organization/repository/subdirectory` + +```yaml +lora: + adapters: + - name: my-adapter + uri: hf://edbeeching/opt-125m-lora +``` + +**Authentication** (for private repositories): + +```yaml +template: + containers: + - name: storage-initializer + env: + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: huggingface-secret + key: token +``` + +:::note +HuggingFace adapters require the storage-initializer to be enabled (default behavior). +::: + +--- + +### S3-Compatible Storage (`s3://`) + +Use adapters from S3, MinIO, Ceph, or any S3-compatible object storage. + +**Format**: `s3://bucket-name/path/to/adapter` + +```yaml +lora: + adapters: + - name: my-adapter + uri: s3://my-bucket/adapters/domain-lora +``` + +**S3 Configuration with Credentials**: + +```yaml +template: + containers: + - name: storage-initializer + env: + - name: AWS_ACCESS_KEY_ID + valueFrom: + secretKeyRef: + name: s3-config + key: AWS_ACCESS_KEY_ID + - name: AWS_SECRET_ACCESS_KEY + valueFrom: + secretKeyRef: + name: s3-config + key: AWS_SECRET_ACCESS_KEY + - name: S3_ENDPOINT + value: "https://minio.example.com" + - name: S3_USE_HTTPS + value: "1" +``` + +**Supported S3-Compatible Providers**: +- AWS S3 +- MinIO +- Ceph Object Gateway +- Google Cloud Storage (S3 compatibility mode) +- Azure Blob Storage (S3 compatibility mode) + +--- + +### PersistentVolumeClaim (`pvc://`) + +Use pre-downloaded adapters from a Kubernetes PVC for fastest startup or air-gapped environments. + +**Format**: `pvc://pvc-name/path/within/pvc` + +```yaml +lora: + adapters: + - name: my-adapter + uri: pvc://adapter-pvc/domain-lora +``` + +**PVC Requirements**: +- PVC must exist in the same namespace +- Access mode: `ReadOnlyMany` or `ReadWriteMany` (for multiple replicas) +- Contains adapter files in safetensors or PyTorch format + +**Example PVC Setup**: + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: adapter-pvc +spec: + accessModes: + - ReadOnlyMany + resources: + requests: + storage: 10Gi + storageClassName: nfs-storage +``` + +:::tip +PVC adapters provide the fastest service startup time since no download phase is required. This is ideal for production deployments and air-gapped environments. +::: + +--- + +## Complete Examples + +### Example 1: Single HuggingFace Adapter + +Simple deployment with one public adapter for SQL code generation: + +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: LLMInferenceService +metadata: + name: qwen-sql +spec: + model: + uri: hf://Qwen/Qwen2.5-7B-Instruct + name: Qwen/Qwen2.5-7B-Instruct + lora: + adapters: + - name: sql-adapter + uri: hf://my-org/qwen-sql-lora + + replicas: 2 + + template: + containers: + - name: main + image: vllm/vllm-openai:latest + resources: + limits: + nvidia.com/gpu: "1" + cpu: "8" + memory: 32Gi +``` + +--- + +### Example 2: Multiple Adapters from Different Sources + +Multi-tenant deployment serving adapters for different tasks: + +```yaml +apiVersion: serving.kserve.io/v1alpha1 +kind: LLMInferenceService +metadata: + name: qwen-multi-tenant +spec: + model: + uri: hf://Qwen/Qwen2.5-7B-Instruct + name: Qwen/Qwen2.5-7B-Instruct + lora: + adapters: + - name: sql-adapter + uri: hf://my-org/qwen-sql-lora + - name: code-adapter + uri: s3://my-bucket/adapters/code-lora + - name: domain-adapter + uri: pvc://adapter-pvc/domain-lora + + replicas: 3 + + template: + containers: + - name: main + image: vllm/vllm-openai:latest + resources: + limits: + nvidia.com/gpu: "1" + cpu: "8" + memory: 32Gi + - name: storage-initializer + env: + - name: AWS_ACCESS_KEY_ID + valueFrom: + secretKeyRef: + name: s3-config + key: AWS_ACCESS_KEY_ID + - name: AWS_SECRET_ACCESS_KEY + valueFrom: + secretKeyRef: + name: s3-config + key: AWS_SECRET_ACCESS_KEY +``` + +--- + +## Usage at Inference Time + +### OpenAI-Compatible API + +Once deployed, select adapters by specifying the adapter name in the `model` parameter: + +**Using an Adapter**: + +```bash +curl -k https:///v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "sql-adapter", + "messages": [ + {"role": "user", "content": "Generate SQL to find all active users"} + ] + }' +``` + +**Using the Base Model** (no adapter): + +```bash +curl -k https:///v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen2.5-7B-Instruct", + "messages": [ + {"role": "user", "content": "What is Kubernetes?"} + ] + }' +``` + +:::tip +vLLM automatically switches between adapters per request with minimal latency overhead. No service restart is required to switch adapters. +::: + +--- + +## How It Works + +### Automatic Integration + +When you configure LoRA adapters, the LLMInferenceService controller automatically: + +1. **Download Phase** (`hf://` and `s3://` adapters): + - Injects storage-initializer as an init container + - Downloads all adapters in parallel + - Mounts adapters to `/mnt/lora/` + +2. **Mount Phase** (`pvc://` adapters): + - Creates volume mounts for each PVC adapter + - Mounts to `/mnt/lora/` (read-only) + - No download required + +3. **vLLM Configuration**: + - Automatically adds `--enable-lora` flag + - Sets `--max-lora-rank`, `--max-loras`, `--max-cpu-loras` only when explicitly configured in `spec.model.lora`; vLLM's own defaults apply otherwise + - Adds `--lora-modules = = ...` + +### Path Sanitization + +Adapter names are sanitized for filesystem compatibility: +- Invalid characters (`/`, `:`, etc.) are replaced with `-` +- Example: `my/adapter:v1` becomes `my-adapter-v1` + +### Resource Considerations + +**GPU Memory Usage**: +- Each adapter typically requires 50-500MB GPU memory depending on rank and model size +- Formula: `adapter_memory ≈ rank × num_layers × hidden_dim × 2 × sizeof(fp16)` +- All adapters are loaded simultaneously into GPU memory + +**Download Time**: +- Depends on adapter size and network bandwidth +- HuggingFace: typically 10-60 seconds per adapter +- S3: depends on endpoint proximity and bandwidth +- PVC: no download time (instant) + +--- + +## Advanced Configuration + +### Tuning LoRA Runtime Parameters + +Use the spec fields to configure vLLM's LoRA runtime settings: + +```yaml +spec: + model: + lora: + maxRank: 128 # increase if adapters were trained with rank > 16 (vLLM default) + maxAdapters: 3 # max adapters in GPU memory simultaneously (vLLM default: 1) + maxCpuAdapters: 6 # max adapters cached in CPU memory (vLLM default: maxAdapters) + adapters: + - name: sql-adapter + uri: hf://my-org/qwen-sql-lora +``` + +### Manual LoRA Configuration + +If you need full control, you can disable automatic configuration by including `--lora-modules` in your container args: + +```yaml +template: + containers: + - name: main + args: + - "--model" + - "/mnt/models" + - "--enable-lora" + - "--lora-modules" + - "my-adapter=/custom/path" +``` + +:::warning +When you manually specify `--lora-modules`, the controller skips automatic LoRA configuration. You are responsible for ensuring adapters are downloaded and paths are correct. +::: + +--- + +## Monitoring and Troubleshooting + +### Verification + +Check that adapters loaded successfully by viewing pod logs: + +```bash +kubectl logs -c storage-initializer +# Look for: "Successfully downloaded adapter to /mnt/lora/" + +kubectl logs -c main +# Look for: "Loading LoRA adapters" and adapter names +``` + +### Common Issues + +| Issue | Cause | Solution | +|-------|-------|----------| +| **Download failure** | Invalid HF/S3 credentials | Verify `HF_TOKEN` or S3 credentials in environment variables | +| **PVC mount failure** | PVC doesn't exist or wrong namespace | Ensure PVC exists in same namespace as LLMInferenceService | +| **Adapter not found at inference** | Adapter name mismatch | Use exact adapter name from `spec.model.lora.adapters[].name` in `model` parameter | +| **OOM errors** | Too many adapters or insufficient GPU memory | Reduce number of adapters or increase GPU memory allocation | +| **Adapter name conflict** | Duplicate adapter names | Ensure all adapter names are unique | + +### Storage Initializer Dependency + +:::warning +If you disable the storage-initializer (`storageInitializer.enabled: false`), `hf://` and `s3://` adapters will fail to download. Only `pvc://` adapters will work. +::: + +--- + +## Limitations + +### Unsupported URI Schemes + +**OCI Registries (`oci://`)**: Currently not supported for LoRA adapters. + +**Workaround**: Download the adapter to a PVC manually and use `pvc://` scheme: + +```yaml +# Pre-download job +apiVersion: batch/v1 +kind: Job +metadata: + name: download-adapter +spec: + template: + spec: + containers: + - name: downloader + image: python:3.11 + command: ["sh", "-c"] + args: + - | + pip install huggingface-hub + python -c "from huggingface_hub import snapshot_download; snapshot_download('my-org/my-lora', local_dir='/mnt/adapter')" + volumeMounts: + - name: adapter-storage + mountPath: /mnt/adapter + volumes: + - name: adapter-storage + persistentVolumeClaim: + claimName: adapter-pvc + restartPolicy: Never +``` + +--- + +## Related Documentation + +- **[Configuration Guide](./llmisvc-configuration.md)**: Detailed spec reference for LLMInferenceService +- **[Model Storage](../../storage/overview.md)**: Supported storage backends for base models +- **[Dependencies](./llmisvc-dependencies.md)**: Required infrastructure components + +--- + +## Summary + +LoRA adapters in LLMInferenceService provide: + +- ✅ **Three URI schemes**: HuggingFace Hub, S3-compatible storage, and PVC +- ✅ **Automatic integration**: Controller handles downloads, mounts, and vLLM configuration +- ✅ **Dynamic switching**: Per-request adapter selection with minimal overhead +- ✅ **Multi-tenancy**: Serve multiple task-specific models from a single deployment +- ✅ **Production-ready**: Support for private repositories, custom endpoints, and air-gapped deployments + +For complete working examples, see the [KServe samples repository](https://github.com/kserve/kserve/tree/master/docs/samples/llmisvc/lora-adapters). diff --git a/sidebars.ts b/sidebars.ts index 272e83fd1..f60adf27c 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -111,6 +111,7 @@ const sidebars: SidebarsConfig = { }, items: [ "model-serving/generative-inference/llmisvc/llmisvc-overview", + "model-serving/generative-inference/llmisvc/lora-adapters", "model-serving/generative-inference/llmisvc/llmisvc-configuration", "model-serving/generative-inference/llmisvc/llmisvc-dependencies", "model-serving/generative-inference/llmisvc/llmisvc-envoy-ai-gateway",