The current profile data structure (v1) was designed before dm existed and has several limitations:
- Pre-aggregation: Identical stack traces are collapsed with a
valuecount, losing individual sample identity. This prevents attaching per-sample metadata (timestamps, memory snapshots). - Nested list columns:
samples$locationsstores a list of tibbles, which complicates querying and is not representable in a relational model without unnesting. - No memory profiling support: The format has no type-stable way to store optional memory profiling data alongside time samples.
- No provenance: There is no way to identify the source of profiling data, making it impossible to combine profiles from multiple runs while tracking origin.
- Container: The primary container is a named list (
profile_data), rather than admobject with built-in constraint checking.
- Store raw data faithfully, without pre-aggregation
- Allow optional memory profiling alongside time profiling, in a type-stable way (long form)
- Use
dmas the primary container - Store the source/provenance of profiling data
- Versioned format with conversion from v1 (legacy)
profile_data (named list of tibbles)
├── meta: key (chr), value (chr)
├── sample_types: type (chr), unit (chr)
├── samples: value (int), locations (list of tibbles with location_id)
├── locations: location_id (int), function_id (int), line (int)
└── functions: function_id (int), name (chr), system_name (chr), filename (chr), start_line (int)
Key issues:
samples$locationsis a nested list column — not relationalsamples$valueaggregates identical traces — loses individual samples- No provenance tracking
dm_from_profile()already unnests into a bridge table (samples_locations) — this should be the canonical form
The new data structure uses a dm object as the primary container with six tables:
dm object
├── meta: key (chr), value (chr)
├── sources: source_id (int), source_type (chr), source_uri (chr), source_timestamp (dbl)
├── samples: sample_id (int), source_id (int)
├── sample_values: sample_id (int), type (chr), unit (chr), value (dbl)
├── sample_locations: sample_id (int), depth (int), location_id (int)
├── locations: location_id (int), function_id (int), line (int)
└── functions: function_id (int), name (chr), system_name (chr), filename (chr), start_line (int)
Unchanged from v1, except the version value becomes "2.0".
| Column | Type | Description |
|---|---|---|
key |
chr | Metadata key |
value |
chr | Metadata value |
Required row: key = "version", value = "2.0".
New table. Tracks the provenance of profiling data so that multiple profiles can be combined.
| Column | Type | Description |
|---|---|---|
source_id |
int | Primary key |
source_type |
chr | Origin type: "rprof", "pprof", "manual", etc. |
source_uri |
chr | File path, URL, or identifier of the source |
source_timestamp |
dbl | Epoch timestamp (seconds since 1970-01-01 UTC) when the profile was captured, NA if unknown |
Primary key: source_id.
One row per raw sample (no aggregation). Each sample is linked to a source.
| Column | Type | Description |
|---|---|---|
sample_id |
int | Primary key |
source_id |
int | Foreign key → sources |
Primary key: sample_id.
Foreign key: source_id → sources.source_id.
Long-form table for sample measurements. This is the type-stable way to store both time and memory profiling data.
| Column | Type | Description |
|---|---|---|
sample_id |
int | Foreign key → samples |
type |
chr | Measurement type: "samples", "alloc_size", "dealloc_size", etc. |
unit |
chr | Measurement unit: "count", "bytes", etc. |
value |
dbl | The measured value |
Foreign key: sample_id → samples.sample_id.
Compound key: (sample_id, type).
This replaces the separate sample_types table and the samples$value column from v1. Each sample can have multiple measurements (e.g., one row for the time sample count and one row for allocated bytes), enabling optional memory profiling in a type-stable long form.
Bridge table linking samples to their stack trace locations. Replaces the nested list column samples$locations from v1.
| Column | Type | Description |
|---|---|---|
sample_id |
int | Foreign key → samples |
depth |
int | Position in the stack trace (1 = innermost) |
location_id |
int | Foreign key → locations |
Foreign key: sample_id → samples.sample_id.
Foreign key: location_id → locations.location_id.
Compound key: (sample_id, depth).
Unchanged from v1.
| Column | Type | Description |
|---|---|---|
location_id |
int | Primary key |
function_id |
int | Foreign key → functions, NA allowed |
line |
int | Source line number, 0 if unknown, NA allowed |
Unchanged from v1.
| Column | Type | Description |
|---|---|---|
function_id |
int | Primary key |
name |
chr | Demangled function name |
system_name |
chr | Mangled/raw function name |
filename |
chr | Source file name |
start_line |
int | Start line in source file, 0 if unknown |
sources 1──* samples 1──* sample_values
1──* sample_locations *──1 locations *──1 functions
-
No pre-aggregation: Each sample gets its own row in
samples. Aggregation (run-length encoding of identical traces) is a downstream concern for analysis, not storage. -
Long-form sample values: The
sample_valuestable stores measurements in long form. A time-only profile has one row per sample; a profile with memory data has multiple rows per sample. This is type-stable (always the same columns) and extensible (new measurement types require no schema changes). -
Bridge table for locations: The
sample_locationstable with adepthcolumn replaces the nested list column approach. This is fully relational and works naturally withdm. -
Provenance via
sources: Each sample is linked to asource, enabling combination of multiple profiles. Thesource_typeandsource_uricolumns provide traceability. -
dmas primary container: Thedmobject is the canonical format, with primary and foreign keys defined. Theprofile_dataclass wraps or is replaced bydm.
profile_v2_from_v1(x) converts a v1 profile_data object to a v2 dm object:
- Create a single
sourcesrow withsource_typeinferred from the hidden.rprofor.msgcomponent,source_uri = NA(not stored in v1). - Expand
samples: unnest thevaluecolumn by repeating each samplevaluetimes, assigning uniquesample_ids. - Create
sample_values: one row per sample withtype = "samples",unit = "count",value = 1. - Create
sample_locations: unnestsamples$locations, addingdepthbased on row position within each sample. - Copy
locationsandfunctionsunchanged. - Set
meta$valueto"2.0".
validate_profile() inspects meta$version to determine which validation rules to apply. The v1 validation remains for backward compatibility.
read_rprof()andread_pprof()return v2 format by default, with an option to return v1 for backward compatibility.write_rprof()andwrite_pprof()accept both v1 and v2 formats.dm_from_profile()is retained for v1 objects; v2 objects are alreadydmobjects.
- Add
dmtoImports(currently inSuggests) - Define
new_profile_v2()constructor that creates admwith all keys - Define
validate_profile_v2()validation for the v2 format - Implement
profile_v2_from_v1()conversion function
- Update
rprof_to_ds()to produce v2 format directly - Update
msg_to_ds()to produce v2 format directly - Add
source_uriparameter toread_rprof()andread_pprof() - Store provenance in the
sourcestable
- Update
write_rprof()to accept v2 format - Update
write_pprof()to accept v2 format - Handle aggregation (collapsing identical traces) in the writer layer
- Support reading Rprof memory profiling data into
sample_values - Support writing memory profiling data to pprof format
- Implement
combine_profiles()to merge multiple v2 profiles - Ensure
source_idand other IDs are remapped to avoid conflicts - Preserve provenance across combinations
- Should
dmmove fromSuggeststoImports? This adds a dependency but makes the v2 format first-class. Alternative: keepdminSuggestsand use a plain list internally, constructingdmon demand. - Should the
profile_dataclass be retained as a wrapper arounddm, or should v2 objects simply bedmobjects with a subclass? - Should
sample_valuesusedblforvalueto accommodate both counts and byte sizes, or should it be polymorphic (e.g., avctrstype)? - Should
depthinsample_locationsbe 1-indexed (R convention) or 0-indexed (pprof convention)?