Skip to content

Adding diagnostics mode for proof failures#2060

Open
sergey3bv wants to merge 2 commits into
lightninglabs:mainfrom
sergey3bv:feat/diagnostic-mode
Open

Adding diagnostics mode for proof failures#2060
sergey3bv wants to merge 2 commits into
lightninglabs:mainfrom
sergey3bv:feat/diagnostic-mode

Conversation

@sergey3bv
Copy link
Copy Markdown
Contributor

Should close #1867

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a diagnostics mode designed to capture and persist artifacts related to proof validation failures. By asynchronously writing these failures to a specified directory, it provides better visibility into issues encountered during the asset transfer process, aiding in debugging and troubleshooting.

Highlights

  • New Diagnostics Package: Introduced a new diagnostics package to handle asynchronous persistence of proof validation failures, including a service for writing failure artifacts to disk.
  • Integration with ChainPorter: Updated ChainPorter to capture and report proof validation failures during both pre-broadcast and post-broadcast stages.
  • Configuration Updates: Added a new --diagnostics-dir configuration flag to enable and configure the diagnostics service.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new diagnostics service for Taproot Assets, enabling asynchronous persistence of proof validation failures to disk. The service includes a background writer, a queue for non-blocking reporting, and logic to sanitize and store failure metadata alongside binary proof artifacts. Feedback includes a critical fix for a potential race condition in the cloneFailure function where pointer fields were not being deep-copied, a request to add missing documentation for the writeFailureReport function per the style guide, and a suggestion to refactor pre-broadcast failure reporting in the ChainPorter to use existing helper methods for better consistency.

Comment thread diagnostics/service.go
Comment thread diagnostics/service.go
Comment thread tapfreighter/chain_porter.go Outdated
@sergey3bv sergey3bv force-pushed the feat/diagnostic-mode branch from 55087ee to e39586e Compare April 10, 2026 13:57
@sergey3bv
Copy link
Copy Markdown
Contributor Author

Hey, @jtobin, make itest-parallel runs completely fine locally yet fails locally. Could you please take a look

@kaldun-tech
Copy link
Copy Markdown
Contributor

Hey @sergey3bv it looks like you have a bad merge or rebase on your branch that caused the integration test issues. Found the root cause using Claude Opus.

  1. limit_constraints_test.go - The entire file was deleted including the entire test for RFQ limit-order constraints (critical coverage)
  2. decode_invoice_test.go - Removed the test coverage for multi-tranche group key decoding.
  3. liquidity_test.go - Synchronization code was removed which caused this CI failure

Here's Claude's recommendations. Once you have these fixed we can proceed with the rest of the review:

  1. Immediate: Revert all changes to test files:
    git checkout main -- itest/custom_channels/decode_invoice_test.go
    git checkout main -- itest/custom_channels/limit_constraints_test.go
    git checkout main -- itest/custom_channels/liquidity_test.go
    git checkout main -- itest/custom_channels/passive_assets_test.go
  2. Then: Re-apply only the necessary net.Miner → net.Miner.Client changes if needed for API compatibility
  3. Review: The PR should be rebased cleanly on main with only the diagnostics-related changes.

@sergey3bv sergey3bv force-pushed the feat/diagnostic-mode branch from e39586e to d4af801 Compare May 12, 2026 07:42
@sergey3bv
Copy link
Copy Markdown
Contributor Author

Hey, @kaldun-tech, I updated the PR according to your comment, could you please take a look.

cc @jtobin

@sergey3bv sergey3bv force-pushed the feat/diagnostic-mode branch 3 times, most recently from eb74718 to 45ea017 Compare May 12, 2026 12:10

return nil
}, ccTransferTimeout)
}, ccTransferConfirmTimeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes to itest look appropriate to improve CI stability via diagnostics testing

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to flip-flop on this. My concern is that the itest changes are extensive enough to be out of scope for the PR. They look good but are orthogonal to diagnostics. It could introduce a regression and make the commit history muddled.

It's up to the repo maintainers in on whether they want to keep the itest changes bundled in this change or split them into a separate PR.

Comment thread tapfreighter/chain_porter.go Outdated
Comment thread diagnostics/service.go
TransferOutputIndex *int `json:"transfer_output_idx,omitempty"`
OutputProofFiles []string `json:"output_proof_files,omitempty"`
InputProofFiles []string `json:"input_proof_files,omitempty"`
}
Copy link
Copy Markdown
Contributor

@kaldun-tech kaldun-tech May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The metadata.json schema is not documented. Consider adding a tapd version field in a future PR.

Something like: TapdVersion string json:"tapd_version,omitempty"

This would help support teams know which tapd version produced the diagnostics dump when it happens

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add explanations of why the fields that are marked omitempty are safe to be omitted. A reader can't tell from the struct definition alone which fields appear in which stage's reports.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good improvement!

Comment thread diagnostics/service.go
}
}

func (s *Service) writer() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to skip a comment block on these new functions?

Comment thread tapconfig/config.go Outdated
Comment thread diagnostics/service.go
}

return clones
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File level concern: You don't have a mechanism for disk-space management. Ex:

  • Limit total directory size
  • Limit number of failure reports
  • Delete old reports (retention policy)
  • Rotate/archive old runs

Every proof validation failure can write to disk indefinitely. So the risk is if a node has persistent proof validation issues, the diagnostics directory could grow unbounded.

This seems acceptable for version 1 as the feature is expliclty for debugging. In a future enhancement you could add flags --diagnostics-max-size or --diagnostics-retention-days

Copy link
Copy Markdown
Contributor

@kaldun-tech kaldun-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks solid. The only change I would strongly recommend is where Gemini flagged the missing function comments. The rest is nits that we can take care of in a future PR.

There's clean separation of concerns here and correct async patterns. Good test coverage too.

@sergey3bv sergey3bv force-pushed the feat/diagnostic-mode branch from 45ea017 to 6203bd0 Compare May 15, 2026 09:09
@sergey3bv sergey3bv requested a review from kaldun-tech May 15, 2026 09:10
@sergey3bv
Copy link
Copy Markdown
Contributor Author

Hey, @kaldun-tech, I updated the PR according to your comments, could you please take a look.

Comment thread diagnostics/service.go Outdated
Comment thread tapcfg/server.go Outdated
return nil
}

func (p *ChainPorter) reportProofValidationFailure(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: functions could use clarifying comments on what they are meant to do. verifyOutputProofPreBroadcast on line 1489 and verifyPacketInputProofs on line 1731 have comments for example.


if p.cfg.DiagnosticsRecorder == nil {
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reportProofValidationFailure for disabled diagnostics we return nil. But the calls to reportProofValidationFailure passes in a ProofValidationFailure. So we build a parameter that is not necessarily used.

You can optimize this to avoid unnecessary I/O and allocations by guarding the entire block

Comment thread diagnostics/types.go Outdated
@kaldun-tech
Copy link
Copy Markdown
Contributor

In my view you're on the right track in this iteration except for a few nitpicks & minor issues. The deep copy logic is smart towards the bottom of diagnostics/service.go.

Copy link
Copy Markdown
Contributor

@kaldun-tech kaldun-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part the code here is solid and addresses the original issue. I believe it's worthwhile to add explanatory comments in new and updated files and remove the added nolint directive.

Comment thread diagnostics/service.go
TransferOutputIndex *int `json:"transfer_output_idx,omitempty"`
OutputProofFiles []string `json:"output_proof_files,omitempty"`
InputProofFiles []string `json:"input_proof_files,omitempty"`
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add explanations of why the fields that are marked omitempty are safe to be omitted. A reader can't tell from the struct definition alone which fields appear in which stage's reports.

Comment thread diagnostics/service.go
return writtenNames, nil
}

func sanitizeFileName(name string) string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to add more comments on at least the added functions as Gemini pointed out before. What does the reader need to understand about the intent of each function? In this case stripping non-alphanumeric characters from a filename

Comment thread diagnostics/service.go
defaultQueueSize = 64
)

var fileNameSanitizer = regexp.MustCompile(`[^a-zA-Z0-9._-]+`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also benefit from a comment

Comment thread diagnostics/service.go Outdated
inputProofFile, err := p.fetchInputProof(ctx, inputs[idx])
if err != nil {
return nil, fmt.Errorf("fetch input proof %d: %w",
idx, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function fails fast if there's an internal error while iterating over the inputs. The reason this is sensible is this function is called in cases where proof verification failed so we want to report the error promptly and move on. Partial proof artifacts could be misleading during support investigation.

Readers would benefit from explanatory comments.

@github-project-automation github-project-automation Bot moved this from 🆕 New to 👀 In review in Taproot-Assets Project Board May 19, 2026
@sergey3bv sergey3bv force-pushed the feat/diagnostic-mode branch 2 times, most recently from 836a7a0 to d3fc9be Compare May 26, 2026 13:34
@sergey3bv sergey3bv force-pushed the feat/diagnostic-mode branch from d3fc9be to 759fd49 Compare May 26, 2026 15:17
@sergey3bv sergey3bv requested a review from kaldun-tech May 27, 2026 06:42
@sergey3bv
Copy link
Copy Markdown
Contributor Author

Hey, @kaldun-tech, I updated the PR according to your comments. Could you please take a look?

Copy link
Copy Markdown
Contributor

@kaldun-tech kaldun-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookds good to me! Curious to hear from the maintainers on the disk space and itest concerns.


return nil
}, ccTransferTimeout)
}, ccTransferConfirmTimeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to flip-flop on this. My concern is that the itest changes are extensive enough to be out of scope for the PR. They look good but are orthogonal to diagnostics. It could introduce a regression and make the commit history muddled.

It's up to the repo maintainers in on whether they want to keep the itest changes bundled in this change or split them into a separate PR.

Comment thread diagnostics/service.go
case s.queue <- queued:
default:
atomic.AddUint64(&s.dropped, 1)
log.Warnf("Diagnostics queue full, dropping proof failure "+
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning for queue full seems sensible for an initial version. If there is concern about log spam we could switch to rate-limited warnings

Comment thread diagnostics/service.go
func NewService(rootDir, tapdVersion string) (*Service, error) {
if strings.TrimSpace(rootDir) == "" {
return nil, fmt.Errorf(
"diagnostics root directory cannot be empty",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a scan of all the error messages here. They seem appropriate.

@lightninglabs-deploy
Copy link
Copy Markdown

@sergey3bv, remember to re-request review from reviewers when ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

[feature]: Add diagnostics mode to capture proof failures and support artefacts

3 participants