Detailed Analysis Traces for GPU-Disaggregated Deep Learning Recommendation Models
This dataset records latency sensitive inference instances for GPU disaggregated serving of Deep Learning Recommendation Models. It contains per instance resource reservations and life cycle timestamps for scheduling analysis and capacity planning.
This dataset represents a groundbreaking trace collection from production GPU-disaggregated serving systems for Deep Learning Recommendation Models (DLRMs), accompanying the NSDI'25 paper on GPU-disaggregated serving at scale. The dataset captures real-world operational characteristics of inference services in a large-scale production environment, providing invaluable insights into resource allocation patterns, temporal dynamics, and system behavior for latency-sensitive ML workloads.
- Total rows, 23871.
- Unique apps, 156.
- Role split, {'CN': 16485, 'HN': 7386}.
Instance id. Role. Application group. Requests and limits for CPU, GPU, RDMA, memory, disk. Density cap per node. Creation, scheduling, and deletion timestamps relative to the trace start.
- All workloads are marked latency sensitive. Instances are typically long running with high priority, as stated by the authors.
- Scheduling delay distribution and runtime distribution are included in the figures. Concurrency over time gives a view of system load.
- RDMA percentage and max instances per node expose placement constraints that influence packing on heterogeneous nodes.
- Total Instances: 23,871 inference instances
- Services: 156 unique inference services (applications)
- Workload Type: 100% Latency-Sensitive (LS) workloads
- Priority Level: High-priority, long-running inference instances
- System Architecture: GPU-disaggregated architecture separating compute and GPU resources
- CPU Nodes (CN): 16,485 instances (69.1%)
- Pure CPU-based inference workloads
- No GPU allocation
- Lower RDMA requirements (mean: 3.4%)
- Heterogeneous GPU Nodes (HN): 7,386 instances (30.9%)
- GPU-accelerated inference workloads
- All instances allocated exactly 1 GPU
- Higher RDMA requirements (mean: 20.5%)
- CN Instances: Average 55.9 cores (range: 2-96 cores)
- HN Instances: Average 13.1 cores (range: 2-64 cores)
- Clear separation between CPU-intensive (CN) and GPU-accelerated (HN) workloads
- CN Instances: Average 286.5 GiB (range: 4-600 GiB)
- HN Instances: Average 91.4 GiB (range: 4-460 GiB)
- Strong correlation (0.97) between CPU and memory allocation
- 30.9% of instances require GPU acceleration
- All GPU instances allocated exactly 1 GPU
- GPU instances exclusively on HN nodes
- Critical for disaggregated architecture communication
- CN: Lower bandwidth needs (1-50% of RNIC)
- HN: Higher bandwidth needs (1-100% of RNIC)
- Reflects data movement patterns in disaggregated systems
- CN Instances: Average 269.9 GiB disk space
- HN Instances: Average 649.0 GiB disk space
- HN instances require 2.4x more storage (likely for model caching)
- Captures instance lifecycle events: creation, scheduling, and deletion
- Includes both pre-existing and newly created instances
- Represents production system behavior over extended period
- Most instances scheduled immediately upon creation
- Minimal scheduling delays observed
- Reflects optimized resource allocation in production
- Mix of short-lived and long-running instances
- Characteristic of production inference workloads
- Supports both batch and real-time serving patterns
- 56.7% of instances have no deployment density constraints
- Remaining instances have limits ranging from 1-32 instances per node
- Application-specific anti-affinity rules for fault tolerance
Top 3 most frequent configurations:
- Profile 1 (8,236 instances): CN with 64 CPU cores, 320 GiB memory, 1% RDMA
- Profile 2 (3,593 instances): HN with 12 CPU cores, 1 GPU, 80 GiB memory, 25% RDMA
- Profile 3 (1,919 instances): CN with 48 CPU cores, 240 GiB memory, 1% RDMA
- Clear bimodal distribution between CPU and GPU workloads
- CN instances optimized for CPU-intensive operations
- HN instances balanced for GPU acceleration with supporting CPU resources
- Tight coupling between CPU and memory allocation (correlation: 0.97)
- Independent scaling of GPU resources from CPU/memory
- RDMA bandwidth scaled based on disaggregation communication needs
- All workloads classified as latency-sensitive
- High-priority, long-running inference services
- Immediate scheduling indicates sufficient resource availability
- Efficient resource utilization through separation of concerns
- CN nodes handle CPU-intensive preprocessing/postprocessing
- HN nodes focus on GPU-accelerated model inference
- RDMA enables efficient data movement between disaggregated components
This dataset enables research in:
- Resource Allocation: Optimal scheduling strategies for disaggregated systems
- Performance Modeling: Understanding latency-throughput tradeoffs
- System Design: Architectural decisions for ML serving infrastructure
- Workload Characterization: Production DLRM inference patterns
- Capacity Planning: Resource provisioning for ML workloads
- Fault Tolerance: Instance distribution and anti-affinity strategies
This dataset represents one of the first publicly available production traces for GPU-disaggregated DLRM serving, providing:
- Real-world validation data for system research
- Baseline for performance comparisons
- Foundation for reproducible research in ML systems
- Insights into production-scale ML infrastructure
This dataset provides a unique window into production GPU-disaggregated systems, offering researchers and practitioners valuable insights for advancing the field of large-scale ML serving infrastructure.