Skip to content

Commit f966f0f

Browse files
authored
2026: Add Fluid roadmap (#5672)
* 2026: Add Fluid roadmap Signed-off-by: cheyang <cheyang.cy@alibaba-inc.com> * 2026: Add Fluid roadmap Signed-off-by: cheyang <cheyang.cy@alibaba-inc.com> * 2026: Add Fluid roadmap Signed-off-by: cheyang <cheyang.cy@alibaba-inc.com> * 2026: Add Fluid roadmap Signed-off-by: cheyang <cheyang.cy@alibaba-inc.com> * 2026: Add Fluid roadmap Signed-off-by: cheyang <cheyang.cy@alibaba-inc.com> --------- Signed-off-by: cheyang <cheyang.cy@alibaba-inc.com>
1 parent d2aff7a commit f966f0f

1 file changed

Lines changed: 75 additions & 59 deletions

File tree

ROADMAP.md

Lines changed: 75 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,78 @@
11
# Fluid Roadmap
22

3-
## Fluid 2025 Roadmap
4-
5-
### **1. Data Anyway**
6-
**Objective**: Enable fluid data access **regardless of infrastructure constraints** (e.g., storage types, runtime environments) without developing controller code.
7-
8-
- **Unified Cache Runtime Framework**
9-
- Enable integration of new cache runtimes(e.g., Cubefs, DragonFly) via a **generic Cache Runtime interface** with minimal code changes.
10-
- Standardize APIs for cache engine compatibility (e.g., Alluxio, Vineyard, JuiceFS).
11-
- **Adaptive Data Access**:
12-
- Data Access Mode based on Scheduler's Decsion:
13-
- *Shared-Kernel Nodes* → Use CSI plugins for direct mounting.
14-
- *Kata Containers* → Switch to sidecar-based container.
15-
- **ThinRuntime Productization**:
16-
- Improve stability and performance for large-scale deployments.
17-
- Minimum container permission (remove the privileged permission of FUSE Pod)
18-
19-
20-
### **2. Data Anywhere**
21-
**Objective**: Achieve **cross-region, cross-cluster, and cross-platform** data mobility and accessibility.
22-
23-
- **Multi-Cluster Dataset Unified Management**
24-
- **Global Dataset**: Create datasets pointing to the same data source across clusters.
25-
- **Queue Integration**: Orchestrate dependencies between data preparation and task scheduling.
26-
- **Persistent Data Mirroring**
27-
- **Region-Aware Replication**: Automatically mirror datasets across clouds/regions.
28-
- **Consistency Guarantees**: Support both eventual and strong consistency models.
29-
30-
- **Efficient Data Prewarming & Migration**
31-
- **Distributed Prewarming**: Maximize bandwidth utilization for fast data loading.
32-
- **Throttling Control**: Limit bandwidth usage during prewarming to avoid saturation.
33-
- **Rsync Optimization**: Improve cross-region sync efficiency.
34-
35-
- **Elastic Caching & Scheduling**:
36-
- **Disk-Aware Scheduling**: Optimize workload placement based on disk capacity, utilization, and locality.
37-
- **Intelligent Scaling**:
38-
- Recommend underutilized Pods for scaling (cost/performance-aware).
39-
- Ensure cache engines adapt to dynamic throughput post-scaling.
40-
- **Cloud-Agnostic Recovery**: Rebuild caches across regions using cloud disk snapshots.
41-
42-
- **Observability-Driven Optimization**
43-
- **Pattern Recognition**: Analyze data access patterns to auto-inject acceleration components (e.g., caching, prefetching).
44-
- **Idle Dataset Detection**: Identify unused datasets via reference counting and access history.
45-
46-
- **Application-Side Acceleration**
47-
- **Transparent Prefetching**:
48-
- Inject sidecar containers to prefetch data dynamically (e.g., Alluxio/Fluid Runtime).
49-
- Auto-adjust prefetch strategies (block size, concurrency) based on access patterns.
50-
- **Dynamic SDK Injection**: Attach acceleration SDKs to Pods via Fluid Admission Controller (no base image modification).
51-
52-
53-
### **3. Data Anytime**
54-
**Core Goal**: Ensure **real-time, adaptive, and intelligent** data availability for workloads.
55-
56-
- **Temporal Workflows with Kueue**:
57-
- Trigger ML jobs (TFJob, PyTorchJob) **after prewarming completes**.
58-
- Automate post-job cleanup (data migration/cache eviction).
59-
- **Dynamic Volume Mounting**:
60-
- Support dynamic volume mounting capabilities for multi-cloud/hybrid-cloud scenarios.
61-
- Enable dyanmic data mount operations in Python SDK.
3+
## Fluid 2026 Roadmap
624

5+
### 1. Data Anyway
6+
7+
> **Objective:** Enable fluid data access **regardless of infrastructure constraints** (e.g., storage types, runtime environments) without developing controller code.
8+
9+
#### Generic Cache Runtime
10+
11+
- **Pluggable Architecture:** Standardized Cache Runtime Interface for rapid integration of new engines (CubeFS, Dragonfly, Vineyard) with minimal boilerplate.
12+
- **Orchestration Based on AdvancedStatefulSet:** Migrate from StatefulSet to AdvancedStatefulSet for fine-grained Pod lifecycle management, ordered rollout, and enhanced failover capabilities.
13+
14+
#### Runtime Dynamic Configuration
15+
16+
- **Zero-Downtime Tuning:** Adjust cache replicas, storage media tiers (SSD/HDD/RAM), and eviction policies without Dataset reconstruction or workload restart.
17+
- **Hot Parameter Swapping:** Runtime modification of cache engine configurations (e.g., Alluxio thread pool, Jindo worker threads) for traffic spike handling.
18+
19+
#### API Upgrade to `v1alpha2`
20+
21+
- Standardized Conditions, `ObservedGeneration`, and phase transition semantics for improved GitOps and tooling compatibility.
22+
- Conversion webhook support for seamless `v1alpha1``v1alpha2` migration.
23+
24+
#### Validation Webhook
25+
26+
- Admission-time CRD validation with auto-correction suggestions to prevent misconfigurations.
27+
- Policy enforcement for resource quotas and security constraints.
28+
29+
#### ThinRuntime Productization
30+
31+
- Production-ready stability for large-scale deployments with **minimum container privileges** (eliminate privileged FUSE Pod requirements).
32+
33+
---
34+
35+
### 2. Data Anywhere
36+
37+
> **Objective:** Achieve **cross-region, cross-cluster, and cross-platform** data mobility and accessibility.
38+
39+
#### LLM KV Cache Orchestration
40+
41+
- **Disaggregated KV Cache:** Externalize vLLM/SGLang KV Cache to Fluid-managed distributed storage, enabling 10x+ throughput improvement for long-context inference.
42+
- **Cross-Pod Cache Sharing:** Live migration of KV Cache between inference instances for preemptive scheduling and spot instance tolerance.
43+
- **Mooncake Integration:** Official partnership for high-performance KV Cache backend with RDMA acceleration.
44+
45+
#### Efficient Data Prewarming & Migration
46+
47+
- **Distributed Prewarming:** Maximize bandwidth utilization for fast data loading.
48+
- **Throttling Control:** Limit bandwidth usage during prewarming to avoid saturation.
49+
- **Rsync Optimization:** Improve cross-region sync efficiency.
50+
51+
#### JindoRuntime High Availability
52+
53+
- **Master Pod Crash Recovery:** Automatic re-setup and state reconstruction after cache master failure without data loss.
54+
- **Metadata Persistence:** WAL-based metadata recovery for rapid failover.
55+
56+
#### Observability-Driven Optimization
57+
58+
- **Access Pattern Recognition:** ML-based analysis to auto-inject acceleration strategies (prefetching, block size optimization).
59+
- **Dataset Garbage Collection:** Idle dataset detection via reference counting and access history analysis.
60+
61+
---
62+
63+
### 3. Data Anytime
64+
65+
> **Objective:** Ensure **real-time, adaptive, and intelligent** data availability for workloads.
66+
67+
#### Temporal Workflow Integration
68+
69+
- **Kueue-Driven Pipelines:** Trigger training/inference jobs automatically upon DataLoad completion; automate post-job cache eviction and data migration.
70+
- **Event-Driven Policies:** Flexible metadata synchronization triggered by workload lifecycle events.
71+
72+
#### Developer Experience
73+
74+
- **Fluid kubectl Plugin:** Native CLI extension (`kubectl fluid`) for:
75+
- Dataset status inspection and health diagnostics
76+
- On-demand prewarming triggering (`kubectl fluid warmup`)
77+
- Cache performance profiling and bottleneck analysis
78+
- Runtime configuration hot-updates

0 commit comments

Comments
 (0)