2026: Add Fluid roadmap (#5672)

cheyang · web-flow · commit f966f0f6a8a6 · 2026-02-28T09:28:03.000Z
* 2026: Add Fluid roadmap

Signed-off-by: cheyang &lt;cheyang.cy@alibaba-inc.com&gt;

* 2026: Add Fluid roadmap

Signed-off-by: cheyang &lt;cheyang.cy@alibaba-inc.com&gt;

* 2026: Add Fluid roadmap

Signed-off-by: cheyang &lt;cheyang.cy@alibaba-inc.com&gt;

* 2026: Add Fluid roadmap

Signed-off-by: cheyang &lt;cheyang.cy@alibaba-inc.com&gt;

* 2026: Add Fluid roadmap

Signed-off-by: cheyang &lt;cheyang.cy@alibaba-inc.com&gt;

---------

Signed-off-by: cheyang &lt;cheyang.cy@alibaba-inc.com&gt;
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -1,62 +1,78 @@
 # Fluid Roadmap
 
-## Fluid 2025 Roadmap
-
-### **1. Data Anyway**  
-**Objective**: Enable fluid data access **regardless of infrastructure constraints** (e.g., storage types, runtime environments) without developing controller code.
-
-- **Unified Cache Runtime Framework**  
-  - Enable integration of new cache runtimes(e.g., Cubefs, DragonFly) via a **generic Cache Runtime interface** with minimal code changes.    
-  - Standardize APIs for cache engine compatibility (e.g., Alluxio, Vineyard, JuiceFS).  
-- **Adaptive Data Access**:  
-  - Data Access Mode based on Scheduler's Decsion:  
-    - *Shared-Kernel Nodes* → Use CSI plugins for direct mounting.  
-    - *Kata Containers* → Switch to sidecar-based container. 
-- **ThinRuntime Productization**:  
-  - Improve stability and performance for large-scale deployments.  
-  - Minimum container permission (remove the privileged permission of FUSE Pod)
-
-
-### **2. Data Anywhere**  
-**Objective**: Achieve **cross-region, cross-cluster, and cross-platform** data mobility and accessibility.  
-
-- **Multi-Cluster Dataset Unified Management**  
-  - **Global Dataset**: Create datasets pointing to the same data source across clusters.  
-  - **Queue Integration**: Orchestrate dependencies between data preparation and task scheduling.  
-  - **Persistent Data Mirroring**  
-    - **Region-Aware Replication**: Automatically mirror datasets across clouds/regions.  
-    - **Consistency Guarantees**: Support both eventual and strong consistency models.  
-
-- **Efficient Data Prewarming & Migration**  
-  - **Distributed Prewarming**: Maximize bandwidth utilization for fast data loading.  
-  - **Throttling Control**: Limit bandwidth usage during prewarming to avoid saturation.  
-  - **Rsync Optimization**: Improve cross-region sync efficiency.  
-
-- **Elastic Caching & Scheduling**:  
-  - **Disk-Aware Scheduling**: Optimize workload placement based on disk capacity, utilization, and locality.  
-  - **Intelligent Scaling**:  
-    - Recommend underutilized Pods for scaling (cost/performance-aware).  
-    - Ensure cache engines adapt to dynamic throughput post-scaling.  
-  - **Cloud-Agnostic Recovery**: Rebuild caches across regions using cloud disk snapshots.  
-
-- **Observability-Driven Optimization**  
-  - **Pattern Recognition**: Analyze data access patterns to auto-inject acceleration components (e.g., caching, prefetching).  
-  - **Idle Dataset Detection**: Identify unused datasets via reference counting and access history.  
-
-- **Application-Side Acceleration**  
-  - **Transparent Prefetching**:  
-    - Inject sidecar containers to prefetch data dynamically (e.g., Alluxio/Fluid Runtime).  
-    - Auto-adjust prefetch strategies (block size, concurrency) based on access patterns.  
-  - **Dynamic SDK Injection**: Attach acceleration SDKs to Pods via Fluid Admission Controller (no base image modification).  
-
-
-### **3. Data Anytime**  
-**Core Goal**: Ensure **real-time, adaptive, and intelligent** data availability for workloads.  
-
-- **Temporal Workflows with Kueue**:  
-  - Trigger ML jobs (TFJob, PyTorchJob) **after prewarming completes**.  
-  - Automate post-job cleanup (data migration/cache eviction).  
-- **Dynamic Volume Mounting**:  
-  - Support dynamic volume mounting capabilities for multi-cloud/hybrid-cloud scenarios.  
-  - Enable dyanmic data mount operations in Python SDK. 
+## Fluid 2026 Roadmap
 
+### 1. Data Anyway
+
+> **Objective:** Enable fluid data access **regardless of infrastructure constraints** (e.g., storage types, runtime environments) without developing controller code.
+
+#### Generic Cache Runtime
+
+- **Pluggable Architecture:** Standardized Cache Runtime Interface for rapid integration of new engines (CubeFS, Dragonfly, Vineyard) with minimal boilerplate.
+- **Orchestration Based on AdvancedStatefulSet:** Migrate from StatefulSet to AdvancedStatefulSet for fine-grained Pod lifecycle management, ordered rollout, and enhanced failover capabilities.
+
+#### Runtime Dynamic Configuration
+
+- **Zero-Downtime Tuning:** Adjust cache replicas, storage media tiers (SSD/HDD/RAM), and eviction policies without Dataset reconstruction or workload restart.
+- **Hot Parameter Swapping:** Runtime modification of cache engine configurations (e.g., Alluxio thread pool, Jindo worker threads) for traffic spike handling.
+
+#### API Upgrade to `v1alpha2`
+
+- Standardized Conditions, `ObservedGeneration`, and phase transition semantics for improved GitOps and tooling compatibility.
+- Conversion webhook support for seamless `v1alpha1` → `v1alpha2` migration.
+
+#### Validation Webhook
+
+- Admission-time CRD validation with auto-correction suggestions to prevent misconfigurations.
+- Policy enforcement for resource quotas and security constraints.
+
+#### ThinRuntime Productization
+
+- Production-ready stability for large-scale deployments with **minimum container privileges** (eliminate privileged FUSE Pod requirements).
+
+---
+
+### 2. Data Anywhere
+
+> **Objective:** Achieve **cross-region, cross-cluster, and cross-platform** data mobility and accessibility.
+
+#### LLM KV Cache Orchestration
+
+- **Disaggregated KV Cache:** Externalize vLLM/SGLang KV Cache to Fluid-managed distributed storage, enabling 10x+ throughput improvement for long-context inference.
+- **Cross-Pod Cache Sharing:** Live migration of KV Cache between inference instances for preemptive scheduling and spot instance tolerance.
+- **Mooncake Integration:** Official partnership for high-performance KV Cache backend with RDMA acceleration.
+
+#### Efficient Data Prewarming & Migration
+
+- **Distributed Prewarming:** Maximize bandwidth utilization for fast data loading.
+- **Throttling Control:** Limit bandwidth usage during prewarming to avoid saturation.
+- **Rsync Optimization:** Improve cross-region sync efficiency.
+
+#### JindoRuntime High Availability
+
+- **Master Pod Crash Recovery:** Automatic re-setup and state reconstruction after cache master failure without data loss.
+- **Metadata Persistence:** WAL-based metadata recovery for rapid failover.
+
+#### Observability-Driven Optimization
+
+- **Access Pattern Recognition:** ML-based analysis to auto-inject acceleration strategies (prefetching, block size optimization).
+- **Dataset Garbage Collection:** Idle dataset detection via reference counting and access history analysis.
+
+---
+
+### 3. Data Anytime
+
+> **Objective:** Ensure **real-time, adaptive, and intelligent** data availability for workloads.
+
+#### Temporal Workflow Integration
+
+- **Kueue-Driven Pipelines:** Trigger training/inference jobs automatically upon DataLoad completion; automate post-job cache eviction and data migration.
+- **Event-Driven Policies:** Flexible metadata synchronization triggered by workload lifecycle events.
+
+#### Developer Experience
+
+- **Fluid kubectl Plugin:** Native CLI extension (`kubectl fluid`) for:
+  - Dataset status inspection and health diagnostics
+  - On-demand prewarming triggering (`kubectl fluid warmup`)
+  - Cache performance profiling and bottleneck analysis
+  - Runtime configuration hot-updates