Based on our comprehensive benchmarking suite (v0.3.1), here's a detailed analysis of performance characteristics and optimization opportunities.
LWW Register Operations:
- Creation (1 item): ~1.05 µs (sub-microsecond)
- Creation (10 items): ~12.0 µs (linear scaling)
- Creation (100 items): ~113.5 µs (linear scaling)
- Creation (1000 items): ~975.8 µs (linear scaling)
LWW Register Merges:
- Merge (1 item): ~24.4 ns (nanosecond range)
- Merge (10 items): ~288.2 ns (nanosecond range)
- Merge (100 items): ~2.47 µs (microsecond range)
- Merge (1000 items): ~23.1 µs (microsecond range)
GCounter Operations:
- Increment (10 items): ~110.6 ns (nanosecond range)
- Increment (100 items): ~1.26 µs (microsecond range)
- Increment (1000 items): ~11.7 µs (microsecond range)
- Increment (10000 items): ~119.8 µs (microsecond range)
Insert Operations:
- 10 items: ~1.63 µs
- 100 items: ~13.6 µs
- 1000 items: ~144.7 µs
- 10000 items: ~1.44 ms
Get Operations:
- 10 items: ~3.94 ns (nanosecond range)
- 100 items: ~42.5 ns (nanosecond range)
- 1000 items: ~382.7 ns (nanosecond range)
Warning Signs:
Warning: Unable to complete 100 samples in 5.0s.
You may wish to increase target time to 5.2s, enable flat sampling,
or reduce sample count to 60.
Affected Operations:
- LWW Register creation (1000 items): Target time exceeded
- LWW Map operations (10000 items): Target time exceeded
Root Cause Analysis:
- Memory Allocation: Large vector allocations for bulk operations
- Serialization Overhead: JSON serialization scales poorly with size
- Hash Map Resizing: HashMap reallocation during bulk inserts
Performance Drop:
- Set (10 items): ~2.99 µs
- Set (100 items): ~235.5 µs (78x slower)
- Set (1000 items): ~4.29 ms (1435x slower)
Get Operations:
- Get (10 items): ~427.1 ns
- Get (100 items): ~8.55 µs (20x slower)
- Get (1000 items): ~541.8 µs (1269x slower)
Root Cause:
- Linear Search: O(n) complexity in storage implementations
- Memory Fragmentation: Poor memory locality for large datasets
- Serialization Bottleneck: JSON encoding/decoding overhead
Memory Allocation Times:
- LWW Register: ~9.53 µs
- LWW Map: ~130.2 µs (13.7x slower)
- GCounter: ~15.9 µs
Analysis:
- HashMap Overhead: LWW Map has significant memory allocation cost
- Metadata Storage: Additional metadata increases memory footprint
- Fragmentation: Large data structures cause memory fragmentation
// Current: Individual operations
for item in items {
collection.insert(&item.id, &item).await?;
}
// Optimized: Batch operations
let batch: Vec<_> = items.into_iter()
.map(|item| (item.id, item))
.collect();
collection.insert_batch(batch).await?;use std::sync::Arc;
use parking_lot::Mutex;
struct CRDTMemoryPool {
lww_registers: Arc<Mutex<Vec<LwwRegister<String>>>>,
lww_maps: Arc<Mutex<Vec<LwwMap<String, String>>>>,
gcounters: Arc<Mutex<Vec<GCounter>>>,
}
impl CRDTMemoryPool {
fn get_lww_register(&self) -> LwwRegister<String> {
self.lww_registers.lock().pop()
.unwrap_or_else(|| LwwRegister::new(String::new(), ReplicaId::default()))
}
fn return_lww_register(&self, register: LwwRegister<String>) {
self.lww_registers.lock().push(register);
}
}// Use bincode instead of JSON for better performance
use bincode::{serialize, deserialize};
impl<T: Serialize + DeserializeOwned> CRDT<T> {
fn to_bytes(&self) -> Result<Vec<u8>, Box<dyn std::error::Error>> {
Ok(serialize(self)?)
}
fn from_bytes(bytes: &[u8]) -> Result<Self, Box<dyn std::error::Error>> {
Ok(deserialize(bytes)?)
}
}use std::collections::BTreeMap;
struct IndexedStorage {
primary: HashMap<String, Vec<u8>>,
indices: BTreeMap<String, Vec<String>>, // key -> [id]
}
impl IndexedStorage {
fn get_by_index(&self, index_name: &str, value: &str) -> Vec<String> {
self.indices.get(&format!("{}:{}", index_name, value))
.cloned()
.unwrap_or_default()
}
}struct LazyCRDT<T> {
id: String,
data: Option<T>,
storage: Arc<dyn Storage>,
}
impl<T> LazyCRDT<T> {
async fn get(&mut self) -> Result<&T, Box<dyn std::error::Error>> {
if self.data.is_none() {
let bytes = self.storage.get(&self.id).await?;
self.data = Some(bincode::deserialize(&bytes)?);
}
Ok(self.data.as_ref().unwrap())
}
}use flate2::{Compress, Decompress, Compression};
struct CompressedStorage {
inner: Box<dyn Storage>,
compression_level: Compression,
}
impl CompressedStorage {
async fn set_compressed(&self, key: &str, value: &[u8]) -> Result<(), Box<dyn std::error::Error>> {
let mut compress = Compress::new(self.compression_level, false);
let mut compressed = Vec::new();
compress.compress_vec(value, &mut compressed, flate2::FlushCompress::Finish)?;
self.inner.set(key, &compressed).await?;
Ok(())
}
}use tokio::sync::mpsc;
struct StreamingCRDT<T> {
sender: mpsc::Sender<CRDTOperation<T>>,
receiver: mpsc::Receiver<CRDTOperation<T>>,
}
enum CRDTOperation<T> {
Insert(String, T),
Update(String, T),
Delete(String),
Merge(String, T),
}
impl<T> StreamingCRDT<T> {
async fn process_operations(&mut self) {
while let Some(op) = self.receiver.recv().await {
match op {
CRDTOperation::Insert(key, value) => {
// Process insert
}
CRDTOperation::Update(key, value) => {
// Process update
}
// ... other operations
}
}
}
}struct HierarchicalCRDT<T> {
root: LwwRegister<T>,
children: LwwMap<String, Box<HierarchicalCRDT<T>>>,
}
impl<T> HierarchicalCRDT<T> {
fn get_path(&self, path: &[String]) -> Option<&T> {
if path.is_empty() {
Some(self.root.value())
} else {
let child = self.children.get(&path[0])?;
child.get_path(&path[1..])
}
}
}- LWW Register Creation (1000 items): < 500 µs (2x improvement)
- Storage Set (1000 items): < 2 ms (2x improvement)
- Memory Allocation: < 50% reduction in allocation time
- Large Scale Operations: Support 100k+ items efficiently
- Memory Usage: 30% reduction in memory footprint
- Serialization: 5x improvement in large data serialization
- Real-time Performance: < 1ms latency for 10k+ item operations
- Memory Efficiency: 50% reduction in memory usage
- Scalability: Support 1M+ items with linear performance
# Before optimization
cargo bench --package leptos-sync-core --bench crdt_benchmarks
# After optimization
cargo bench --package leptos-sync-core --bench crdt_benchmarks
# Compare results
cargo bench --package leptos-sync-core --bench crdt_benchmarks -- --verbose# Install memory profiler
cargo install memory-profiler
# Profile memory usage
memory-profiler --output memory-profile.json cargo bench#[test]
fn test_performance_regression() {
let start = std::time::Instant::now();
// Perform operation
let mut crdt = LwwRegister::new("test", ReplicaId::default());
for i in 0..1000 {
crdt.merge(&LwwRegister::new(format!("value_{}", i), ReplicaId::default())).unwrap();
}
let duration = start.elapsed();
assert!(duration < std::time::Duration::from_millis(1),
"Performance regression: operation took {:?}", duration);
}- Bulk Operations: Implement batch insert/update operations
- Serialization: Switch from JSON to bincode for internal operations
- Memory Pool: Implement object pooling for frequently used CRDTs
- Indexed Storage: Add secondary indices for faster lookups
- Lazy Loading: Implement on-demand data loading
- Compression: Add compression for large data sets
- Streaming CRDTs: Implement async streaming operations
- Hierarchical CRDTs: Add tree-based data structures
- Advanced Caching: Implement intelligent cache invalidation
- Operation Latency: P50, P95, P99 percentiles
- Memory Usage: Peak and average memory consumption
- Throughput: Operations per second at various scales
- Error Rates: Failed operations and retry counts
- Latency: > 10ms for single operations
- Memory: > 100MB for 1k items
- Throughput: < 1000 ops/sec for 10k items
- Errors: > 1% failure rate
Next Steps: Start with Phase 1 optimizations for immediate impact, then move to Phase 2 for sustained improvements. Monitor performance metrics continuously to ensure optimizations deliver expected results.