Skip to content

fix: decouple visual styling from tbl_hierarchical_rate_by_grade to prevent Cartesian join explosion#228

Open
Melkiades wants to merge 61 commits into
mainfrom
js/fix-cartesian-join-explosion
Open

fix: decouple visual styling from tbl_hierarchical_rate_by_grade to prevent Cartesian join explosion#228
Melkiades wants to merge 61 commits into
mainfrom
js/fix-cartesian-join-explosion

Conversation

@Melkiades
Copy link
Copy Markdown
Contributor

@Melkiades Melkiades commented May 6, 2026

Fixes #226

Description

tbl_hierarchical_rate_by_grade() was blanking the label column for grade rows (label = "") before returning. When tbl_with_pools() called tbl_merge(), the join on (variable, row_type, var_label, label) lost row uniqueness — every blank-label grade row matched every other, producing a Cartesian cross-join with massive row duplication and column suffixing.

Fix: Decouple data generation from visual presentation:

  • tbl_hierarchical_rate_by_grade() now returns a structurally pristine table where label retains unique grade text. Injects custom_info metadata for downstream use.
  • add_grade_column() (new exported function) applies all visual formatting (label_grade column, label blanking, headers, indentation) as a post-processing step.

New workflow

# Standalone
tbl_hierarchical_rate_by_grade(...) |> add_grade_column()

# With pooled columns (the fix)
tbl_with_pools(..., .tbl_fun = tbl_hierarchical_rate_by_grade) |> add_grade_column()

How to test

# This previously caused a Cartesian explosion — now completes instantly
tbl_with_pools(
  data = adae, pools = my_pools, by = "TRTA",
  denominator = adsl,
  .tbl_fun = tbl_hierarchical_rate_by_grade,
  variables = c(AEBODSYS, AEDECOD, AETOXGR),
  grade_groups = list("Grade 1-2" = c("1", "2"))
) |> add_grade_column()

Run devtools::document() then devtools::test(). Accept new snapshots with testthat::snapshot_accept().

Files changed

File Change
R/tbl_hierarchical_rate_by_grade.R Removed visual styling, added custom_info metadata injection
R/add_grade_column.R New exported post-processing function
tests/testthat/test-add_grade_column.R 8 test cases
tests/testthat/test-tbl_hierarchical_rate_by_grade.R Updated to pipe through add_grade_column()
tests/testthat/test-tbl_with_pools.R Added end-to-end pipeline regression test

@Melkiades
Copy link
Copy Markdown
Contributor Author

@jszczypinski sorry for adding you, I just asked for last committer ahah

Melkiades and others added 6 commits May 6, 2026 10:19
…revent Cartesian join explosion

tbl_hierarchical_rate_by_grade() was blanking the label column for grade
rows before returning, causing tbl_merge() inside tbl_with_pools() to
lose row uniqueness and produce a Cartesian cross-join.

Split into two functions:
- tbl_hierarchical_rate_by_grade(): returns structurally pristine table
  with unique label values, injects custom_info metadata
- add_grade_column(): post-processing function that applies all visual
  formatting (label_grade column, label blanking, headers, indentation)

Co-authored-by: Ona <no-reply@ona.com>
…revent Cartesian join explosion

tbl_hierarchical_rate_by_grade() was blanking the label column for grade
rows before returning, causing tbl_merge() inside tbl_with_pools() to
lose row uniqueness and produce a Cartesian cross-join.

Split into two functions:
- tbl_hierarchical_rate_by_grade(): returns structurally pristine table
  with unique label values, injects custom_info metadata
- add_grade_column(): post-processing function that applies all visual
  formatting (label_grade column, label blanking, headers, indentation)
- add_grade_column() now returns early if label_grade already exists
- metadata lookup iterates over x$tbls instead of hardcoding tbls[[1]]
- added idempotency test (test #8)
- fixed broken comment block in tbl_with_pools test

Co-authored-by: Ona <no-reply@ona.com>
- add_grade_column() now returns early if label_grade already exists
- metadata lookup iterates over x$tbls instead of hardcoding tbls[[1]]
- added idempotency test (test #8)
- fixed broken comment block in tbl_with_pools test
Co-authored-by: Ona <no-reply@ona.com>
@Melkiades Melkiades force-pushed the js/fix-cartesian-join-explosion branch from 1afacd6 to 8becdc8 Compare May 6, 2026 10:19
@Melkiades
Copy link
Copy Markdown
Contributor Author

@jszczypinski I did revert them ahah crazy stuff

Copy link
Copy Markdown
Contributor Author

@Melkiades Melkiades left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review after fixes. All previously identified issues are resolved:

  • Idempotency guard added (line 75): early return if label_grade already exists, prevents silent corruption on double-call.
  • Robust metadata lookup (line 78): Find() iterates over x$tbls instead of hardcoding tbls[[1]].
  • Idempotency test added (test #8): verifies table_body is identical after second call and grade labels are not corrupted.
  • Comment block fixed in test-tbl_with_pools.R.
  • Docs updated to match implementation.

Remaining notes:

  • Output snapshots for tbl_hierarchical_rate_by_grade were cleared and need regeneration on first devtools::test() run (testthat::snapshot_accept()).
  • devtools::document() must be run to export add_grade_column in NAMESPACE.

Melkiades and others added 6 commits May 6, 2026 12:52
…elect deprecation

Three tidyselect contexts in tbl_mmrm() were passing string variables
(arm, visit) as bare names instead of using all_of(). This triggered
deprecation warnings from tidyselect 1.1.0+.

Co-authored-by: Ona <no-reply@ona.com>
…elect deprecation

Three tidyselect contexts in tbl_mmrm() were passing string variables
(arm, visit) as bare names instead of using all_of(). This triggered
deprecation warnings from tidyselect 1.1.0+.
@Melkiades Melkiades force-pushed the js/fix-cartesian-join-explosion branch from 51ffb39 to 74c8419 Compare May 6, 2026 11:50
jszczypinski and others added 2 commits May 6, 2026 12:37
…elect deprecation

Three tidyselect contexts in tbl_mmrm() were passing string variables
(arm, visit) as bare names instead of using all_of(). This triggered
deprecation warnings from tidyselect 1.1.0+.

Also added AGENTS.md to .gitignore.

Co-authored-by: Ona <no-reply@ona.com>
…elect deprecation

Three tidyselect contexts in tbl_mmrm() were passing string variables
(arm, visit) as bare names instead of using all_of(). This triggered
deprecation warnings from tidyselect 1.1.0+.

Also added AGENTS.md to .gitignore.
@Melkiades Melkiades force-pushed the js/fix-cartesian-join-explosion branch from 803c081 to 76f4c0b Compare May 6, 2026 12:37
@Melkiades Melkiades marked this pull request as ready for review May 6, 2026 12:53
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Unit Tests Summary

  1 files  233 suites   3m 10s ⏱️
233 tests 233 ✅ 0 💤 0 ❌
682 runs  682 ✅ 0 💤 0 ❌

Results for commit 76ef8f4.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Unit Test Performance Difference

Test Suite $Status$ Time on main $±Time$ $±Tests$ $±Skipped$ $±Failures$ $±Errors$
add_grade_column 👶 $+0.00$ $+3$ $0$ $0$ $0$
Additional test case details
Test Suite $Status$ Time on main $±Time$ Test Case
add_forest 💚 $5.99$ $-3.41$ add_forest_table_engine_flextable_works
add_grade_column 👶 $+0.00$ add_grade_column_blanks_stats_for_SOC_label_rows
add_grade_column 👶 $+0.00$ add_grade_column_errors_on_non_gtsummary_input
add_grade_column 👶 $+0.00$ add_grade_column_errors_when_custom_info_is_missing
add_grade_column 👶 $+0.00$ add_grade_column_extracts_custom_info_from_merged_tables
add_grade_column 👶 $+0.00$ add_grade_column_is_idempotent_when_called_twice
add_grade_column 👶 $+0.00$ add_grade_column_recodes_zero_statistics
add_grade_column 👶 $+0.00$ add_grade_column_sets_correct_header_labels
add_grade_column 👶 $+20.67$ add_grade_column_works_on_a_standalone_tbl_hierarchical_rate_by_grade
add_grade_column 👶 $+0.00$ add_grade_column_works_without_grade_groups
add_hierarchical_count_row 💚 $3.96$ $-2.52$ add_hierarchical_count_row_works
adjust_stat_columns_wrap 💚 $2.19$ $-1.54$ adjust_stat_columns_wrap_validates_input_correctly
annotate_gg 💚 $9.49$ $-3.91$ annotate_lineplot_df_warns_if_plot_is_not_generated_by_gg_lineplot
tbl_baseline_chg 💔 $12.00$ $+2.18$ tbl_baseline_chg_works
tbl_hierarchical_incidence_rate 💚 $20.79$ $-1.36$ tbl_hierarchical_incidence_rate_runs_with_minimum_arguments
tbl_hierarchical_rate_and_count 💚 $14.00$ $-2.19$ tbl_hierarchical_rate_and_count_works
tbl_hierarchical_rate_by_grade 👶 $+0.00$ tbl_hierarchical_rate_by_grade_appends_missing_grade_group_levels_to_character_grade
tbl_hierarchical_rate_by_grade 💔 $17.42$ $+3.91$ tbl_hierarchical_rate_by_grade_works
tbl_roche_subgroups 💔 $0.95$ $+2.92$ tbl_roche_subgroups_time_to_event_NULL_works
tbl_roche_summary 💚 $7.58$ $-4.16$ tbl_roche_summary_works
tbl_shift 💚 $9.96$ $-2.42$ tbl_shift_strata_location_
tbl_with_pools 👶 $+0.00$ tbl_with_pools_tbl_hierarchical_rate_by_grade_add_grade_column_pipeline_works
tbl_with_pools 💔 $2.24$ $+9.64$ tbl_with_pools_validates_inputs_correctly

Results for commit 5e67faa

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

badge

Code Coverage Summary

Filename                                 Stmts    Miss  Cover    Missing
-------------------------------------  -------  ------  -------  ------------------------------------------------------------------------------------------------
R/add_blank_rows.R                          63       0  100.00%
R/add_difference_row.R                     101       0  100.00%
R/add_forest_utils.R                        97      10  89.69%   76-79, 94-100
R/add_forest.R                             139       0  100.00%
R/add_hierarchical_count_row.R              33       0  100.00%
R/adjust_stat_columns_wrap.R                29       1  96.55%   53
R/annotate_gg_km.R                         135       0  100.00%
R/annotate_gg_pkc.R                         92       0  100.00%
R/annotate_gg.R                             81       0  100.00%
R/ard_tabulate_abnormal_by_baseline.R       65       0  100.00%
R/crane-package.R                            2       2  0.00%    26-27
R/deprecated.R                              21      21  0.00%    15-51
R/df_add_poolings.R                         41       0  100.00%
R/get_cox_pairwise_df.R                     96       6  93.75%   226-231
R/gg_km_utils.R                             35      14  60.00%   18-35
R/gg_km.R                                  143      37  74.13%   54-57, 74, 101, 175-180, 183-186, 196-198, 203-204, 238-240, 247-250, 254, 265-269, 282, 284-286
R/gg_lineplot.R                             94       0  100.00%
R/gg_mmrm_lineplot.R                       102       1  99.02%   106
R/gg_pkc_lineplot.R                         98       0  100.00%
R/gg_utils.R                               221       0  100.00%
R/label_roche.R                             72       0  100.00%
R/modify_header_rm_md.R                     18       2  88.89%   35-36
R/modify_zero_recode.R                      20       1  95.00%   64
R/reverse_difference_ci.R                   33       0  100.00%
R/tbl_baseline_chg.R                       188       0  100.00%
R/tbl_coxph.R                               90       1  98.89%   219
R/tbl_hierarchical_incidence_rate.R        209       0  100.00%
R/tbl_hierarchical_rate_and_count.R        339      13  96.17%   343, 425, 446-456
R/tbl_hierarchical_rate_by_grade.R         317       3  99.05%   169-171
R/tbl_listing.R                             35       0  100.00%
R/tbl_mmrm.R                               254       1  99.61%   393
R/tbl_null_report.R                          9       0  100.00%
R/tbl_rmpt.R                               157      12  92.36%   297-302, 314-319
R/tbl_roche_subgroups.R                    155       0  100.00%
R/tbl_roche_summary.R                       64       0  100.00%
R/tbl_shift.R                              116       0  100.00%
R/tbl_survfit_quantiles.R                  132       1  99.24%   295
R/tbl_survfit_times.R                       92       0  100.00%
R/tbl_with_pools.R                          64       0  100.00%
R/theme_gtsummary_roche.R                   84       1  98.81%   61
R/utils.R                                   36       0  100.00%
TOTAL                                     4172     127  96.96%

Diff against main

Filename                              Stmts    Miss  Cover
----------------------------------  -------  ------  -------
R/tbl_hierarchical_rate_by_grade.R      +43       0  +0.15%
TOTAL                                   +43       0  +0.03%

Results for commit: 76ef8f4

Minimum allowed coverage is 80%

♻️ This comment has been updated with latest results

jszczypinski and others added 6 commits May 6, 2026 13:20
Tightly coupled function now shares the same file and help page via
@Rdname. Removed standalone R/add_grade_column.R and its man page.

Co-authored-by: Ona <no-reply@ona.com>
Tightly coupled function now shares the same file and help page via
@Rdname. Removed standalone R/add_grade_column.R and its man page.
Co-authored-by: Ona <no-reply@ona.com>
…engineering/crane into js/fix-cartesian-join-explosion
…engineering/crane into js/fix-cartesian-join-explosion
Melkiades and others added 4 commits May 8, 2026 13:13
…engineering/crane into js/fix-cartesian-join-explosion
…engineering/crane into js/fix-cartesian-join-explosion
Bare `label` inside modify_table_body() lambda is not visible to
R CMD check. Use .data$label for all references.

Co-authored-by: Ona <no-reply@ona.com>
Bare `label` inside modify_table_body() lambda is not visible to
R CMD check. Use .data$label for all references.
Comment thread tests/testthat/_snaps/tbl_hierarchical_rate_by_grade.md Outdated
Melkiades and others added 13 commits May 8, 2026 13:29
- Fix test #2: assert no label_grade indent rows exist (was always-true)
- Add comment to test #7: explain post_fmt_fun is a proxy check
- Use slice_head(n = 30) for deterministic row selection
- Add local_wide_snapshot() helper for consistent snapshot width

Co-authored-by: Ona <no-reply@ona.com>
- Fix test #2: assert no label_grade indent rows exist (was always-true)
- Add comment to test #7: explain post_fmt_fun is a proxy check
- Use slice_head(n = 30) for deterministic row selection
- Add local_wide_snapshot() helper for consistent snapshot width
Signed-off-by: Davide Garolini <dgarolini@gmail.com>
Comment on lines +1 to +6
# Set a wide console width for snapshot tests to prevent line-wrapping
# differences across platforms. Call inside test_that() blocks before
# expect_snapshot() — withr::local_options() resets automatically on exit.
local_wide_snapshot <- function(width = 220, .local_envir = parent.frame()) {
withr::local_options(list(width = width), .local_envir = .local_envir)
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@llrs-roche do you think we need this? is this the right place?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a function for this.
If you want to set a fixed width, we can directly add the call to withr on each test.
If there were more options or other configurations for the snapshots then it might make sense to have a specific function.

@Melkiades Melkiades requested a review from jszczypinski May 20, 2026 15:19
@Melkiades
Copy link
Copy Markdown
Contributor Author

ready for review @shajoezhu @llrs-roche (Jan is out till monday I believe)

@Melkiades Melkiades force-pushed the js/fix-cartesian-join-explosion branch from a6da1a5 to 8c7f6de Compare May 20, 2026 17:34
Melkiades and others added 3 commits May 20, 2026 19:38
Signed-off-by: Davide Garolini <dgarolini@gmail.com>
Signed-off-by: Davide Garolini <dgarolini@gmail.com>
Copy link
Copy Markdown
Contributor

@llrs-roche llrs-roche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have yet to check the PR locally or verified that the issue is indeed fixed but I have reviewed the code.

See comments below for some concerns

# apply visual formatting to the table body
x <- x |>
gtsummary::modify_table_body(
\(table_body) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider if these anonymous functions can be extracted from the functions, simplified, and reused across the package. Not to address it now but please open an issue

Comment on lines +1 to +6
# Set a wide console width for snapshot tests to prevent line-wrapping
# differences across platforms. Call inside test_that() blocks before
# expect_snapshot() — withr::local_options() resets automatically on exit.
local_wide_snapshot <- function(width = 220, .local_envir = parent.frame()) {
withr::local_options(list(width = width), .local_envir = .local_envir)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a function for this.
If you want to set a fixed width, we can directly add the call to withr on each test.
If there were more options or other configurations for the snapshots then it might make sense to have a specific function.

Comment on lines +428 to +433
ADSL_pipe <- cards::ADSL
ADAE_pipe <- cards::ADAE |>
dplyr::filter(
AESOC %in% unique(cards::ADAE$AESOC)[1:3],
AETERM %in% unique(cards::ADAE$AETERM)[1:3]
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the beginning of the file there is already an ADSL and ADAE objects created. Aren't they enough? Why do we need these modifications here?

Comment on lines +422 to +426
test_that("tbl_with_pools() + tbl_hierarchical_rate_by_grade() + add_grade_column() pipeline works", {
# Regression test for the Cartesian join explosion.
# Previously, tbl_hierarchical_rate_by_grade() blanked the `label` column for
# grade rows before returning, causing tbl_merge() inside tbl_with_pools() to
# lose row uniqueness and produce a Cartesian cross-join.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should be summarized and be the title of the test

Comment on lines +495 to +496
#' @export
add_grade_column <- function(x) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is added with tbl_hierarchical_rate_by_grade() if it must be called also after any merging?

@llrs-roche
Copy link
Copy Markdown
Contributor

Do we need a special version of cards or anything?
When running this PR locally I see many failures:

devtools::test(filter = 'add_grade_column')
[ FAIL 7 | WARN 0 | SKIP 0 | PASS 2 ]

@llrs-roche llrs-roche self-assigned this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug fix for tbl_with_pools in cases where custom label columns are added (tbl_hierarchical_*)

3 participants