Plan: Profile Data Structure v2

Motivation

The current profile data structure (v1) was designed before dm existed and has several limitations:

Pre-aggregation: Identical stack traces are collapsed with a value count, losing individual sample identity. This prevents attaching per-sample metadata (timestamps, memory snapshots).
Nested list columns: samples$locations stores a list of tibbles, which complicates querying and is not representable in a relational model without unnesting.
No memory profiling support: The format has no type-stable way to store optional memory profiling data alongside time samples.
No provenance: There is no way to identify the source of profiling data, making it impossible to combine profiles from multiple runs while tracking origin.
Container: The primary container is a named list (profile_data), rather than a dm object with built-in constraint checking.

Requirements

Store raw data faithfully, without pre-aggregation
Allow optional memory profiling alongside time profiling, in a type-stable way (long form)
Use dm as the primary container
Store the source/provenance of profiling data
Versioned format with conversion from v1 (legacy)

Current Data Model (v1)

profile_data (named list of tibbles)
├── meta:         key (chr), value (chr)
├── sample_types: type (chr), unit (chr)
├── samples:      value (int), locations (list of tibbles with location_id)
├── locations:    location_id (int), function_id (int), line (int)
└── functions:    function_id (int), name (chr), system_name (chr), filename (chr), start_line (int)

Key issues:

samples$locations is a nested list column — not relational
samples$value aggregates identical traces — loses individual samples
No provenance tracking
dm_from_profile() already unnests into a bridge table (samples_locations) — this should be the canonical form

Proposed Data Model (v2)

The new data structure uses a dm object as the primary container with six tables:

dm object
├── meta:              key (chr), value (chr)
├── sources:           source_id (int), source_type (chr), source_uri (chr), source_timestamp (dbl)
├── samples:           sample_id (int), source_id (int)
├── sample_values:     sample_id (int), type (chr), unit (chr), value (dbl)
├── sample_locations:  sample_id (int), depth (int), location_id (int)
├── locations:         location_id (int), function_id (int), line (int)
└── functions:         function_id (int), name (chr), system_name (chr), filename (chr), start_line (int)

Table descriptions

`meta`

Unchanged from v1, except the version value becomes "2.0".

Column	Type	Description
`key`	chr	Metadata key
`value`	chr	Metadata value

Required row: key = "version", value = "2.0".

`sources`

New table. Tracks the provenance of profiling data so that multiple profiles can be combined.

Column	Type	Description
`source_id`	int	Primary key
`source_type`	chr	Origin type: `"rprof"`, `"pprof"`, `"manual"`, etc.
`source_uri`	chr	File path, URL, or identifier of the source
`source_timestamp`	dbl	Epoch timestamp (seconds since 1970-01-01 UTC) when the profile was captured, `NA` if unknown

Primary key: source_id.

`samples`

One row per raw sample (no aggregation). Each sample is linked to a source.

Column	Type	Description
`sample_id`	int	Primary key
`source_id`	int	Foreign key → `sources`

Primary key: sample_id. Foreign key: source_id → sources.source_id.

`sample_values`

Long-form table for sample measurements. This is the type-stable way to store both time and memory profiling data.

Column	Type	Description
`sample_id`	int	Foreign key → `samples`
`type`	chr	Measurement type: `"samples"`, `"alloc_size"`, `"dealloc_size"`, etc.
`unit`	chr	Measurement unit: `"count"`, `"bytes"`, etc.
`value`	dbl	The measured value

Foreign key: sample_id → samples.sample_id. Compound key: (sample_id, type).

This replaces the separate sample_types table and the samples$value column from v1. Each sample can have multiple measurements (e.g., one row for the time sample count and one row for allocated bytes), enabling optional memory profiling in a type-stable long form.

`sample_locations`

Bridge table linking samples to their stack trace locations. Replaces the nested list column samples$locations from v1.

Column	Type	Description
`sample_id`	int	Foreign key → `samples`
`depth`	int	Position in the stack trace (1 = innermost)
`location_id`	int	Foreign key → `locations`

Foreign key: sample_id → samples.sample_id. Foreign key: location_id → locations.location_id. Compound key: (sample_id, depth).

`locations`

Unchanged from v1.

Column	Type	Description
`location_id`	int	Primary key
`function_id`	int	Foreign key → `functions`, `NA` allowed
`line`	int	Source line number, 0 if unknown, `NA` allowed

`functions`

Unchanged from v1.

Column	Type	Description
`function_id`	int	Primary key
`name`	chr	Demangled function name
`system_name`	chr	Mangled/raw function name
`filename`	chr	Source file name
`start_line`	int	Start line in source file, 0 if unknown

Relational diagram

sources 1──* samples 1──* sample_values
                     1──* sample_locations *──1 locations *──1 functions

Key design decisions

No pre-aggregation: Each sample gets its own row in samples. Aggregation (run-length encoding of identical traces) is a downstream concern for analysis, not storage.
Long-form sample values: The sample_values table stores measurements in long form. A time-only profile has one row per sample; a profile with memory data has multiple rows per sample. This is type-stable (always the same columns) and extensible (new measurement types require no schema changes).
Bridge table for locations: The sample_locations table with a depth column replaces the nested list column approach. This is fully relational and works naturally with dm.
Provenance via sources: Each sample is linked to a source, enabling combination of multiple profiles. The source_type and source_uri columns provide traceability.
dm as primary container: The dm object is the canonical format, with primary and foreign keys defined. The profile_data class wraps or is replaced by dm.

Migration: v1 → v2

Conversion function

profile_v2_from_v1(x) converts a v1 profile_data object to a v2 dm object:

Create a single sources row with source_type inferred from the hidden .rprof or .msg component, source_uri = NA (not stored in v1).
Expand samples: unnest the value column by repeating each sample value times, assigning unique sample_ids.
Create sample_values: one row per sample with type = "samples", unit = "count", value = 1.
Create sample_locations: unnest samples$locations, adding depth based on row position within each sample.
Copy locations and functions unchanged.
Set meta$value to "2.0".

Version detection

validate_profile() inspects meta$version to determine which validation rules to apply. The v1 validation remains for backward compatibility.

Backward compatibility

read_rprof() and read_pprof() return v2 format by default, with an option to return v1 for backward compatibility.
write_rprof() and write_pprof() accept both v1 and v2 formats.
dm_from_profile() is retained for v1 objects; v2 objects are already dm objects.

Implementation plan

Phase 1: Foundation

Add dm to Imports (currently in Suggests)
Define new_profile_v2() constructor that creates a dm with all keys
Define validate_profile_v2() validation for the v2 format
Implement profile_v2_from_v1() conversion function

Phase 2: Reader updates

Update rprof_to_ds() to produce v2 format directly
Update msg_to_ds() to produce v2 format directly
Add source_uri parameter to read_rprof() and read_pprof()
Store provenance in the sources table

Phase 3: Writer updates

Update write_rprof() to accept v2 format
Update write_pprof() to accept v2 format
Handle aggregation (collapsing identical traces) in the writer layer

Phase 4: Memory profiling support

Support reading Rprof memory profiling data into sample_values
Support writing memory profiling data to pprof format

Phase 5: Profile combination

Implement combine_profiles() to merge multiple v2 profiles
Ensure source_id and other IDs are remapped to avoid conflicts
Preserve provenance across combinations

Open questions

Should dm move from Suggests to Imports? This adds a dependency but makes the v2 format first-class. Alternative: keep dm in Suggests and use a plain list internally, constructing dm on demand.
Should the profile_data class be retained as a wrapper around dm, or should v2 objects simply be dm objects with a subclass?
Should sample_values use dbl for value to accommodate both counts and byte sizes, or should it be polymorphic (e.g., a vctrs type)?
Should depth in sample_locations be 1-indexed (R convention) or 0-indexed (pprof convention)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan: Profile Data Structure v2

Motivation

Requirements

Current Data Model (v1)

Proposed Data Model (v2)

Table descriptions

`meta`

`sources`

`samples`

`sample_values`

`sample_locations`

`locations`

`functions`

Relational diagram

Key design decisions

Migration: v1 → v2

Conversion function

Version detection

Backward compatibility

Implementation plan

Phase 1: Foundation

Phase 2: Reader updates

Phase 3: Writer updates

Phase 4: Memory profiling support

Phase 5: Profile combination

Open questions

FilesExpand file tree

plan-v2.md

Latest commit

History

plan-v2.md

File metadata and controls

Plan: Profile Data Structure v2

Motivation

Requirements

Current Data Model (v1)

Proposed Data Model (v2)

Table descriptions

meta

sources

samples

sample_values

sample_locations

locations

functions

Relational diagram

Key design decisions

Migration: v1 → v2

Conversion function

Version detection

Backward compatibility

Implementation plan

Phase 1: Foundation

Phase 2: Reader updates

Phase 3: Writer updates

Phase 4: Memory profiling support

Phase 5: Profile combination

Open questions

`meta`

`sources`

`samples`

`sample_values`

`sample_locations`

`locations`

`functions`