|
1 | 1 | # Fluid Roadmap |
2 | 2 |
|
3 | | -## Fluid 2025 Roadmap |
4 | | - |
5 | | -### **1. Data Anyway** |
6 | | -**Objective**: Enable fluid data access **regardless of infrastructure constraints** (e.g., storage types, runtime environments) without developing controller code. |
7 | | - |
8 | | -- **Unified Cache Runtime Framework** |
9 | | - - Enable integration of new cache runtimes(e.g., Cubefs, DragonFly) via a **generic Cache Runtime interface** with minimal code changes. |
10 | | - - Standardize APIs for cache engine compatibility (e.g., Alluxio, Vineyard, JuiceFS). |
11 | | -- **Adaptive Data Access**: |
12 | | - - Data Access Mode based on Scheduler's Decsion: |
13 | | - - *Shared-Kernel Nodes* → Use CSI plugins for direct mounting. |
14 | | - - *Kata Containers* → Switch to sidecar-based container. |
15 | | -- **ThinRuntime Productization**: |
16 | | - - Improve stability and performance for large-scale deployments. |
17 | | - - Minimum container permission (remove the privileged permission of FUSE Pod) |
18 | | - |
19 | | - |
20 | | -### **2. Data Anywhere** |
21 | | -**Objective**: Achieve **cross-region, cross-cluster, and cross-platform** data mobility and accessibility. |
22 | | - |
23 | | -- **Multi-Cluster Dataset Unified Management** |
24 | | - - **Global Dataset**: Create datasets pointing to the same data source across clusters. |
25 | | - - **Queue Integration**: Orchestrate dependencies between data preparation and task scheduling. |
26 | | - - **Persistent Data Mirroring** |
27 | | - - **Region-Aware Replication**: Automatically mirror datasets across clouds/regions. |
28 | | - - **Consistency Guarantees**: Support both eventual and strong consistency models. |
29 | | - |
30 | | -- **Efficient Data Prewarming & Migration** |
31 | | - - **Distributed Prewarming**: Maximize bandwidth utilization for fast data loading. |
32 | | - - **Throttling Control**: Limit bandwidth usage during prewarming to avoid saturation. |
33 | | - - **Rsync Optimization**: Improve cross-region sync efficiency. |
34 | | - |
35 | | -- **Elastic Caching & Scheduling**: |
36 | | - - **Disk-Aware Scheduling**: Optimize workload placement based on disk capacity, utilization, and locality. |
37 | | - - **Intelligent Scaling**: |
38 | | - - Recommend underutilized Pods for scaling (cost/performance-aware). |
39 | | - - Ensure cache engines adapt to dynamic throughput post-scaling. |
40 | | - - **Cloud-Agnostic Recovery**: Rebuild caches across regions using cloud disk snapshots. |
41 | | - |
42 | | -- **Observability-Driven Optimization** |
43 | | - - **Pattern Recognition**: Analyze data access patterns to auto-inject acceleration components (e.g., caching, prefetching). |
44 | | - - **Idle Dataset Detection**: Identify unused datasets via reference counting and access history. |
45 | | - |
46 | | -- **Application-Side Acceleration** |
47 | | - - **Transparent Prefetching**: |
48 | | - - Inject sidecar containers to prefetch data dynamically (e.g., Alluxio/Fluid Runtime). |
49 | | - - Auto-adjust prefetch strategies (block size, concurrency) based on access patterns. |
50 | | - - **Dynamic SDK Injection**: Attach acceleration SDKs to Pods via Fluid Admission Controller (no base image modification). |
51 | | - |
52 | | - |
53 | | -### **3. Data Anytime** |
54 | | -**Core Goal**: Ensure **real-time, adaptive, and intelligent** data availability for workloads. |
55 | | - |
56 | | -- **Temporal Workflows with Kueue**: |
57 | | - - Trigger ML jobs (TFJob, PyTorchJob) **after prewarming completes**. |
58 | | - - Automate post-job cleanup (data migration/cache eviction). |
59 | | -- **Dynamic Volume Mounting**: |
60 | | - - Support dynamic volume mounting capabilities for multi-cloud/hybrid-cloud scenarios. |
61 | | - - Enable dyanmic data mount operations in Python SDK. |
| 3 | +## Fluid 2026 Roadmap |
62 | 4 |
|
| 5 | +### 1. Data Anyway |
| 6 | + |
| 7 | +> **Objective:** Enable fluid data access **regardless of infrastructure constraints** (e.g., storage types, runtime environments) without developing controller code. |
| 8 | +
|
| 9 | +#### Generic Cache Runtime |
| 10 | + |
| 11 | +- **Pluggable Architecture:** Standardized Cache Runtime Interface for rapid integration of new engines (CubeFS, Dragonfly, Vineyard) with minimal boilerplate. |
| 12 | +- **Orchestration Based on AdvancedStatefulSet:** Migrate from StatefulSet to AdvancedStatefulSet for fine-grained Pod lifecycle management, ordered rollout, and enhanced failover capabilities. |
| 13 | + |
| 14 | +#### Runtime Dynamic Configuration |
| 15 | + |
| 16 | +- **Zero-Downtime Tuning:** Adjust cache replicas, storage media tiers (SSD/HDD/RAM), and eviction policies without Dataset reconstruction or workload restart. |
| 17 | +- **Hot Parameter Swapping:** Runtime modification of cache engine configurations (e.g., Alluxio thread pool, Jindo worker threads) for traffic spike handling. |
| 18 | + |
| 19 | +#### API Upgrade to `v1alpha2` |
| 20 | + |
| 21 | +- Standardized Conditions, `ObservedGeneration`, and phase transition semantics for improved GitOps and tooling compatibility. |
| 22 | +- Conversion webhook support for seamless `v1alpha1` → `v1alpha2` migration. |
| 23 | + |
| 24 | +#### Validation Webhook |
| 25 | + |
| 26 | +- Admission-time CRD validation with auto-correction suggestions to prevent misconfigurations. |
| 27 | +- Policy enforcement for resource quotas and security constraints. |
| 28 | + |
| 29 | +#### ThinRuntime Productization |
| 30 | + |
| 31 | +- Production-ready stability for large-scale deployments with **minimum container privileges** (eliminate privileged FUSE Pod requirements). |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +### 2. Data Anywhere |
| 36 | + |
| 37 | +> **Objective:** Achieve **cross-region, cross-cluster, and cross-platform** data mobility and accessibility. |
| 38 | +
|
| 39 | +#### LLM KV Cache Orchestration |
| 40 | + |
| 41 | +- **Disaggregated KV Cache:** Externalize vLLM/SGLang KV Cache to Fluid-managed distributed storage, enabling 10x+ throughput improvement for long-context inference. |
| 42 | +- **Cross-Pod Cache Sharing:** Live migration of KV Cache between inference instances for preemptive scheduling and spot instance tolerance. |
| 43 | +- **Mooncake Integration:** Official partnership for high-performance KV Cache backend with RDMA acceleration. |
| 44 | + |
| 45 | +#### Efficient Data Prewarming & Migration |
| 46 | + |
| 47 | +- **Distributed Prewarming:** Maximize bandwidth utilization for fast data loading. |
| 48 | +- **Throttling Control:** Limit bandwidth usage during prewarming to avoid saturation. |
| 49 | +- **Rsync Optimization:** Improve cross-region sync efficiency. |
| 50 | + |
| 51 | +#### JindoRuntime High Availability |
| 52 | + |
| 53 | +- **Master Pod Crash Recovery:** Automatic re-setup and state reconstruction after cache master failure without data loss. |
| 54 | +- **Metadata Persistence:** WAL-based metadata recovery for rapid failover. |
| 55 | + |
| 56 | +#### Observability-Driven Optimization |
| 57 | + |
| 58 | +- **Access Pattern Recognition:** ML-based analysis to auto-inject acceleration strategies (prefetching, block size optimization). |
| 59 | +- **Dataset Garbage Collection:** Idle dataset detection via reference counting and access history analysis. |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +### 3. Data Anytime |
| 64 | + |
| 65 | +> **Objective:** Ensure **real-time, adaptive, and intelligent** data availability for workloads. |
| 66 | +
|
| 67 | +#### Temporal Workflow Integration |
| 68 | + |
| 69 | +- **Kueue-Driven Pipelines:** Trigger training/inference jobs automatically upon DataLoad completion; automate post-job cache eviction and data migration. |
| 70 | +- **Event-Driven Policies:** Flexible metadata synchronization triggered by workload lifecycle events. |
| 71 | + |
| 72 | +#### Developer Experience |
| 73 | + |
| 74 | +- **Fluid kubectl Plugin:** Native CLI extension (`kubectl fluid`) for: |
| 75 | + - Dataset status inspection and health diagnostics |
| 76 | + - On-demand prewarming triggering (`kubectl fluid warmup`) |
| 77 | + - Cache performance profiling and bottleneck analysis |
| 78 | + - Runtime configuration hot-updates |
0 commit comments