docs: add runtime configuration best practices guide (#690)

Ayush-Patel-56 · Ayush-Patel-56 · commit ca5c06c51454 · 2026-04-29T10:09:03.000+05:30
Signed-off-by: Ayush-Patel-56 &lt;ayushpatel2731@gmail.com&gt;
Signed-off-by: Ayush Patel &lt;ayushpatel2731@gmail.com&gt;
diff --git a/docs/en/TOC.md b/docs/en/TOC.md
@@ -10,7 +10,8 @@
 + Get Started
   - [Quick Start](userguide/get_started.md)
   - [Installation](userguide/install.md)
-  - [Trubleshooting](userguide/troubleshooting.md)
+  - [Configuration Best Practices](userguide/config_best_practices.md)
+  - [Troubleshooting](userguide/troubleshooting.md)
 + Dataset
   + Creation
     - [Accelerate Data Accessing(via POSIX)](samples/accelerate_data_accessing.md)
diff --git a/docs/en/userguide/config_best_practices.md b/docs/en/userguide/config_best_practices.md
@@ -0,0 +1,75 @@
+# Fluid Configuration Guide: Best Practices and Tuning
+
+This document serves as a deep-dive into the configuration knobs of Fluid. While Fluid works out-of-the-box with sensible defaults, achieving production-grade performance requires tuning based on your specific storage backend and workload characteristics.
+
+## 1. Dataset: The Foundation
+
+The `Dataset` resource defines **where** your data lives and **how** it should be accessed. 
+
+### Key Considerations
+*   **Mount Point Naming**: When mounting multiple sources, use explicit `name` fields. Fluid uses these names to create the internal directory structure. Without them, you risk path collisions if two sources have similar root structures.
+*   **Read-Only vs. Read-Write**: For most AI training workloads, set `readOnly: true` in your mounts. This allows the caching engine (like Alluxio) to optimize for read-heavy traffic and avoid the overhead of consistency checks for writes.
+
+| Config Point | Why it matters |
+| :--- | :--- |
+| `spec.placement: Exclusive` | **Performance Isolation.** Prevents other datasets from "stealing" cache space on the same node. Essential for low-latency requirements. |
+| `spec.nodeAffinity` | **Disk Type Targeting.** If your cluster has a mix of HDD and NVMe nodes, use affinity to ensure Fluid only caches data on the high-speed nodes. |
+
+---
+
+## 2. AlluxioRuntime: High-Performance Caching
+
+Alluxio is the "engine" for most Fluid deployments. Its configuration determines your data-plane throughput.
+
+### Tuning the Memory Tier (MEM)
+For the fastest possible access, use `/dev/shm` (Ramdisk).
+*   **Best Practice**: Ensure your `tieredstore` levels point to a medium of type `MEM`. 
+*   **Gotcha**: If your node runs out of RAM, the Alluxio Worker might be OOMKilled. Always set `resources.limits.memory` slightly higher than your total `quota`.
+
+### JVM Heap Management
+Since Alluxio is Java-based, `jvmOptions` are critical. If you have millions of small files, the Master node needs more heap space to track metadata.
+```yaml
+# Example: Increasing Master Heap for large metadata
+master:
+  jvmOptions:
+    - "-Xms4g"
+    - "-Xmx4g"
+```
+
+---
+
+## 3. JuiceFSRuntime: Cloud-Native POSIX
+
+JuiceFS is excellent for environments where POSIX compliance is a hard requirement.
+
+### Metadata vs. Data
+JuiceFS separates metadata (Redis/MySQL/TiKV) from data (S3/OSS).
+*   **Optimization**: Use the `attr-cache` option in `spec.fuse.options`. Setting this to `60s` or higher can drastically reduce the load on your metadata service during repetitive tasks like `ls -R`.
+*   **Worker Caching**: Prefer configuring local cache capacity and directories through `spec.tieredstore.levels`, which is the recommended way to size JuiceFS worker cache and avoid filling the node's root partition. If you are maintaining an older configuration, note that `spec.worker.options` uses the key `cache-size` (no leading dashes), but `cache-size`/`cache-dir` there are deprecated in favor of `tieredstore.levels`.
+
+---
+
+## 4. JindoRuntime: Alibaba Cloud Optimization
+
+If you are running in ACK (Alibaba Cloud Container Service), JindoRuntime provides native optimizations for OSS.
+
+*   **Credential Management**: Avoid hardcoding AK/SK in the YAML. Use `hadoopConfig` to reference a ConfigMap containing `core-site.xml` with your OSS credentials.
+*   **Log Bloat**: Jindo can be chatty. Set `spec.fuse.logConfig` to `level: warn` for stable production environments to save disk space on logs.
+
+---
+
+## 5. ThinRuntime: The "Universal" Adapter
+
+ThinRuntime is intended for storage systems that don't have a dedicated Fluid controller (e.g., NFS, Ceph).
+
+*   **Standardization**: Leverage `ThinRuntimeProfile`. It allows you to define the "how-to-mount" logic once and reuse it across multiple datasets.
+*   **Health Probes**: Since ThinRuntime relies on external FUSE binaries, always define `livenessProbe`. This allows Kubernetes to auto-restart the FUSE pod if the mount point becomes "stale" or "transport endpoint is not connected."
+
+---
+
+## Common Production Checklist
+
+1.  **Resource Quotas**: Never run workers without `limits`. A caching engine will naturally try to consume all available resources.
+2.  **Pull Secrets**: If your images are in a private registry, `imagePullSecrets` must be defined at the spec level so the Master, Worker, and Fuse pods can all pull successfully.
+3.  **Tiered Locality**: Use `storage-network` labels if your storage and compute are on separate network planes to avoid cross-switch bottlenecking.
+
diff --git a/docs/zh/TOC.md b/docs/zh/TOC.md
@@ -14,6 +14,7 @@
 + 入门
   - [安装](userguide/install.md)
   - [快速开始](userguide/get_started.md)
+  - [配置最佳实践](userguide/config_best_practices.md)
   - [问题诊断](userguide/troubleshooting.md)
 + 数据集使用
   + 创建
diff --git a/docs/zh/userguide/config_best_practices.md b/docs/zh/userguide/config_best_practices.md
@@ -0,0 +1,75 @@
+# Fluid 配置指南：最佳实践与性能调优
+
+本文档旨在深入探讨 Fluid 的各项配置参数。虽然 Fluid 提供了开箱即用的默认值，但在生产环境中，针对特定的存储后端和工作负载特性进行调优是确保高性能的关键。
+
+## 1. Dataset: 核心基础
+
+`Dataset` 资源定义了数据的**来源**以及**访问方式**。
+
+### 关键注意事项
+*   **挂载点命名**: 在挂载多个数据源时，请务必指定清晰的 `name` 字段。Fluid 会根据这些名称构建内部目录结构。如果不指定名称，当多个数据源具有相似的根目录结构时，可能会发生路径冲突。
+*   **只读与读写**: 对于大多数 AI 训练任务，建议将挂载点设置为 `readOnly: true`。这允许像 Alluxio 这样的缓存引擎针对纯读流量进行优化，并避免维护写入一致性带来的额外开销。
+
+| 配置项 | 核心价值 |
+| :--- | :--- |
+| `spec.placement: Exclusive` | **性能隔离。** 防止同一节点上的其他数据集“挤占”缓存空间，是低延迟要求的保障。 |
+| `spec.nodeAffinity` | **精准定位。** 如果集群中包含 HDD 和 NVMe 混合节点，通过亲和性确保 Fluid 只在高速节点上配置缓存。 |
+
+---
+
+## 2. AlluxioRuntime: 高性能分布式缓存
+
+Alluxio 是 Fluid 中应用最广泛的缓存引擎，其配置直接决定了数据层（Data-Plane）的吞吐量。
+
+### 内存级缓存调优 (MEM)
+为了获得极速访问，通常使用 `/dev/shm`（内存盘）。
+*   **最佳实践**: 确保 `tieredstore` 层级设置中，介质类型指向 `MEM`。
+*   **风险提示**: 如果节点内存不足，Alluxio Worker 可能会因 OOM 被 kill。务必将 `resources.limits.memory` 设置为略高于 `配额`。
+
+### JVM 堆内存管理
+由于 Alluxio 基于 Java 开发，`jvmOptions` 至关重要。如果存在数百万个小文件，Master 节点需要更多的堆内存来跟踪元数据。
+```yaml
+# 示例：为元数据较多的场景增加 Master 堆内存
+master:
+  jvmOptions:
+    - "-Xms4g"
+    - "-Xmx4g"
+```
+
+---
+
+## 3. JuiceFSRuntime: 云原生 POSIX 存储
+
+JuiceFS 非常适合那些对 POSIX 兼容性有硬性要求的环境。
+
+### 元数据与性能
+JuiceFS 将元数据与数据物理隔离。
+*   **优化建议**: 利用 `spec.fuse.options` 中的 `attr-cache` 选项。将其设置为 `60s` 或更长，可以显著减轻元数据服务在执行 `ls -R` 等高频扫描任务时的压力。
+*   **空间配额**: 优先通过 `spec.tieredstore.levels` 规划本地缓存目录与容量，限制本地磁盘占用，防止存储填满宿主机根分区。避免继续在 `spec.worker.options` 中使用 `cache-size` / `cache-dir` 这类已弃用配置。
+
+---
+
+## 4. JindoRuntime: 阿里云生态优化
+
+在阿里云 ACK 环境中，JindoRuntime 针对 OSS 提供了原生加速。
+
+*   **凭据安全**: 避免在 YAML 中硬编码 AK/SK。推荐使用 `hadoopConfig` 引用包含 `core-site.xml` 的 ConfigMap。
+*   **日志控制**: Jindo 在默认情况下日志量可能较大。生产环境中建议设置 `spec.fuse.logConfig` 为 `level: warn`，以节省节点日志存储空间。
+
+---
+
+## 5. ThinRuntime: 通用适配器
+
+ThinRuntime 专为尚未内置在 Fluid 中的存储系统（如 NFS、Ceph）而设计。
+
+*   **标准化部署**: 充分利用 `ThinRuntimeProfile`。您可以一次性定义挂载逻辑，并在多个 Dataset 中复用。
+*   **健康检查**: 由于 ThinRuntime 依赖外部 FUSE 进程，务必定义 `livenessProbe`。这能确保在挂载点出现“传输端点未连接”等异常时，Kubernetes 能自动重启 FUSE Pod。
+
+---
+
+## 生产环境 Checklist
+
+1.  **资源配额**: 严禁在不设置 `limits` 的情况下运行 Worker。缓存引擎通常会倾向于耗尽所有可用资源。
+2.  **镜像密钥**: 如果镜像存储在私有仓库，必须在 Spec 级配置 `imagePullSecrets`，以确保所有组件 Pod（Master, Worker, Fuse）都能成功拉取镜像并启动。
+3.  **分层本地性**: 如果计算节点与存储节点位于不同的网络平面，建议结合网络标签（storage-network）使用，以避免跨核心交换机的流量瓶颈。
+