diff --git a/data-raw/plan-v2.md b/data-raw/plan-v2.md new file mode 100644 index 0000000..eeec274 --- /dev/null +++ b/data-raw/plan-v2.md @@ -0,0 +1,223 @@ +# Plan: Profile Data Structure v2 + +## Motivation + +The current profile data structure (v1) was designed before `dm` existed and has several limitations: + +1. **Pre-aggregation**: Identical stack traces are collapsed with a `value` count, losing individual sample identity. This prevents attaching per-sample metadata (timestamps, memory snapshots). +2. **Nested list columns**: `samples$locations` stores a list of tibbles, which complicates querying and is not representable in a relational model without unnesting. +3. **No memory profiling support**: The format has no type-stable way to store optional memory profiling data alongside time samples. +4. **No provenance**: There is no way to identify the source of profiling data, making it impossible to combine profiles from multiple runs while tracking origin. +5. **Container**: The primary container is a named list (`profile_data`), rather than a `dm` object with built-in constraint checking. + +## Requirements + +1. Store raw data faithfully, without pre-aggregation +2. Allow optional memory profiling alongside time profiling, in a type-stable way (long form) +3. Use `dm` as the primary container +4. Store the source/provenance of profiling data +5. Versioned format with conversion from v1 (legacy) + +## Current Data Model (v1) + +``` +profile_data (named list of tibbles) +├── meta: key (chr), value (chr) +├── sample_types: type (chr), unit (chr) +├── samples: value (int), locations (list of tibbles with location_id) +├── locations: location_id (int), function_id (int), line (int) +└── functions: function_id (int), name (chr), system_name (chr), filename (chr), start_line (int) +``` + +Key issues: +- `samples$locations` is a nested list column — not relational +- `samples$value` aggregates identical traces — loses individual samples +- No provenance tracking +- `dm_from_profile()` already unnests into a bridge table (`samples_locations`) — this should be the canonical form + +## Proposed Data Model (v2) + +The new data structure uses a `dm` object as the primary container with six tables: + +``` +dm object +├── meta: key (chr), value (chr) +├── sources: source_id (int), source_type (chr), source_uri (chr), source_timestamp (dbl) +├── samples: sample_id (int), source_id (int) +├── sample_values: sample_id (int), type (chr), unit (chr), value (dbl) +├── sample_locations: sample_id (int), depth (int), location_id (int) +├── locations: location_id (int), function_id (int), line (int) +└── functions: function_id (int), name (chr), system_name (chr), filename (chr), start_line (int) +``` + +### Table descriptions + +#### `meta` + +Unchanged from v1, except the version value becomes `"2.0"`. + +| Column | Type | Description | +|--------|------|-------------| +| `key` | chr | Metadata key | +| `value`| chr | Metadata value | + +Required row: `key = "version"`, `value = "2.0"`. + +#### `sources` + +New table. Tracks the provenance of profiling data so that multiple profiles can be combined. + +| Column | Type | Description | +|--------|------|-------------| +| `source_id` | int | Primary key | +| `source_type` | chr | Origin type: `"rprof"`, `"pprof"`, `"manual"`, etc. | +| `source_uri` | chr | File path, URL, or identifier of the source | +| `source_timestamp` | dbl | Epoch timestamp (seconds since 1970-01-01 UTC) when the profile was captured, `NA` if unknown | + +Primary key: `source_id`. + +#### `samples` + +One row per raw sample (no aggregation). Each sample is linked to a source. + +| Column | Type | Description | +|--------|------|-------------| +| `sample_id` | int | Primary key | +| `source_id` | int | Foreign key → `sources` | + +Primary key: `sample_id`. +Foreign key: `source_id` → `sources.source_id`. + +#### `sample_values` + +Long-form table for sample measurements. This is the type-stable way to store both time and memory profiling data. + +| Column | Type | Description | +|--------|------|-------------| +| `sample_id` | int | Foreign key → `samples` | +| `type` | chr | Measurement type: `"samples"`, `"alloc_size"`, `"dealloc_size"`, etc. | +| `unit` | chr | Measurement unit: `"count"`, `"bytes"`, etc. | +| `value` | dbl | The measured value | + +Foreign key: `sample_id` → `samples.sample_id`. +Compound key: (`sample_id`, `type`). + +This replaces the separate `sample_types` table and the `samples$value` column from v1. Each sample can have multiple measurements (e.g., one row for the time sample count and one row for allocated bytes), enabling optional memory profiling in a type-stable long form. + +#### `sample_locations` + +Bridge table linking samples to their stack trace locations. Replaces the nested list column `samples$locations` from v1. + +| Column | Type | Description | +|--------|------|-------------| +| `sample_id` | int | Foreign key → `samples` | +| `depth` | int | Position in the stack trace (1 = innermost) | +| `location_id` | int | Foreign key → `locations` | + +Foreign key: `sample_id` → `samples.sample_id`. +Foreign key: `location_id` → `locations.location_id`. +Compound key: (`sample_id`, `depth`). + +#### `locations` + +Unchanged from v1. + +| Column | Type | Description | +|--------|------|-------------| +| `location_id` | int | Primary key | +| `function_id` | int | Foreign key → `functions`, `NA` allowed | +| `line` | int | Source line number, 0 if unknown, `NA` allowed | + +#### `functions` + +Unchanged from v1. + +| Column | Type | Description | +|--------|------|-------------| +| `function_id` | int | Primary key | +| `name` | chr | Demangled function name | +| `system_name` | chr | Mangled/raw function name | +| `filename` | chr | Source file name | +| `start_line` | int | Start line in source file, 0 if unknown | + +### Relational diagram + +``` +sources 1──* samples 1──* sample_values + 1──* sample_locations *──1 locations *──1 functions +``` + +### Key design decisions + +1. **No pre-aggregation**: Each sample gets its own row in `samples`. Aggregation (run-length encoding of identical traces) is a downstream concern for analysis, not storage. + +2. **Long-form sample values**: The `sample_values` table stores measurements in long form. A time-only profile has one row per sample; a profile with memory data has multiple rows per sample. This is type-stable (always the same columns) and extensible (new measurement types require no schema changes). + +3. **Bridge table for locations**: The `sample_locations` table with a `depth` column replaces the nested list column approach. This is fully relational and works naturally with `dm`. + +4. **Provenance via `sources`**: Each sample is linked to a `source`, enabling combination of multiple profiles. The `source_type` and `source_uri` columns provide traceability. + +5. **`dm` as primary container**: The `dm` object is the canonical format, with primary and foreign keys defined. The `profile_data` class wraps or is replaced by `dm`. + +## Migration: v1 → v2 + +### Conversion function + +`profile_v2_from_v1(x)` converts a v1 `profile_data` object to a v2 `dm` object: + +1. Create a single `sources` row with `source_type` inferred from the hidden `.rprof` or `.msg` component, `source_uri = NA` (not stored in v1). +2. Expand `samples`: unnest the `value` column by repeating each sample `value` times, assigning unique `sample_id`s. +3. Create `sample_values`: one row per sample with `type = "samples"`, `unit = "count"`, `value = 1`. +4. Create `sample_locations`: unnest `samples$locations`, adding `depth` based on row position within each sample. +5. Copy `locations` and `functions` unchanged. +6. Set `meta$value` to `"2.0"`. + +### Version detection + +`validate_profile()` inspects `meta$version` to determine which validation rules to apply. The v1 validation remains for backward compatibility. + +### Backward compatibility + +- `read_rprof()` and `read_pprof()` return v2 format by default, with an option to return v1 for backward compatibility. +- `write_rprof()` and `write_pprof()` accept both v1 and v2 formats. +- `dm_from_profile()` is retained for v1 objects; v2 objects are already `dm` objects. + +## Implementation plan + +### Phase 1: Foundation + +- [ ] Add `dm` to `Imports` (currently in `Suggests`) +- [ ] Define `new_profile_v2()` constructor that creates a `dm` with all keys +- [ ] Define `validate_profile_v2()` validation for the v2 format +- [ ] Implement `profile_v2_from_v1()` conversion function + +### Phase 2: Reader updates + +- [ ] Update `rprof_to_ds()` to produce v2 format directly +- [ ] Update `msg_to_ds()` to produce v2 format directly +- [ ] Add `source_uri` parameter to `read_rprof()` and `read_pprof()` +- [ ] Store provenance in the `sources` table + +### Phase 3: Writer updates + +- [ ] Update `write_rprof()` to accept v2 format +- [ ] Update `write_pprof()` to accept v2 format +- [ ] Handle aggregation (collapsing identical traces) in the writer layer + +### Phase 4: Memory profiling support + +- [ ] Support reading Rprof memory profiling data into `sample_values` +- [ ] Support writing memory profiling data to pprof format + +### Phase 5: Profile combination + +- [ ] Implement `combine_profiles()` to merge multiple v2 profiles +- [ ] Ensure `source_id` and other IDs are remapped to avoid conflicts +- [ ] Preserve provenance across combinations + +## Open questions + +1. Should `dm` move from `Suggests` to `Imports`? This adds a dependency but makes the v2 format first-class. Alternative: keep `dm` in `Suggests` and use a plain list internally, constructing `dm` on demand. +2. Should the `profile_data` class be retained as a wrapper around `dm`, or should v2 objects simply be `dm` objects with a subclass? +3. Should `sample_values` use `dbl` for `value` to accommodate both counts and byte sizes, or should it be polymorphic (e.g., a `vctrs` type)? +4. Should `depth` in `sample_locations` be 1-indexed (R convention) or 0-indexed (pprof convention)?