Adding diagnostics mode for proof failures#2060
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a diagnostics mode designed to capture and persist artifacts related to proof validation failures. By asynchronously writing these failures to a specified directory, it provides better visibility into issues encountered during the asset transfer process, aiding in debugging and troubleshooting. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new diagnostics service for Taproot Assets, enabling asynchronous persistence of proof validation failures to disk. The service includes a background writer, a queue for non-blocking reporting, and logic to sanitize and store failure metadata alongside binary proof artifacts. Feedback includes a critical fix for a potential race condition in the cloneFailure function where pointer fields were not being deep-copied, a request to add missing documentation for the writeFailureReport function per the style guide, and a suggestion to refactor pre-broadcast failure reporting in the ChainPorter to use existing helper methods for better consistency.
55087ee to
e39586e
Compare
|
Hey, @jtobin, |
|
Hey @sergey3bv it looks like you have a bad merge or rebase on your branch that caused the integration test issues. Found the root cause using Claude Opus.
Here's Claude's recommendations. Once you have these fixed we can proceed with the rest of the review:
|
e39586e to
d4af801
Compare
|
Hey, @kaldun-tech, I updated the PR according to your comment, could you please take a look. cc @jtobin |
eb74718 to
45ea017
Compare
|
|
||
| return nil | ||
| }, ccTransferTimeout) | ||
| }, ccTransferConfirmTimeout) |
There was a problem hiding this comment.
These changes to itest look appropriate to improve CI stability via diagnostics testing
There was a problem hiding this comment.
Sorry to flip-flop on this. My concern is that the itest changes are extensive enough to be out of scope for the PR. They look good but are orthogonal to diagnostics. It could introduce a regression and make the commit history muddled.
It's up to the repo maintainers in on whether they want to keep the itest changes bundled in this change or split them into a separate PR.
| TransferOutputIndex *int `json:"transfer_output_idx,omitempty"` | ||
| OutputProofFiles []string `json:"output_proof_files,omitempty"` | ||
| InputProofFiles []string `json:"input_proof_files,omitempty"` | ||
| } |
There was a problem hiding this comment.
Minor: The metadata.json schema is not documented. Consider adding a tapd version field in a future PR.
Something like: TapdVersion string json:"tapd_version,omitempty"
This would help support teams know which tapd version produced the diagnostics dump when it happens
There was a problem hiding this comment.
Please add explanations of why the fields that are marked omitempty are safe to be omitted. A reader can't tell from the struct definition alone which fields appear in which stage's reports.
| } | ||
| } | ||
|
|
||
| func (s *Service) writer() { |
There was a problem hiding this comment.
Is there a reason to skip a comment block on these new functions?
| } | ||
|
|
||
| return clones | ||
| } |
There was a problem hiding this comment.
File level concern: You don't have a mechanism for disk-space management. Ex:
- Limit total directory size
- Limit number of failure reports
- Delete old reports (retention policy)
- Rotate/archive old runs
Every proof validation failure can write to disk indefinitely. So the risk is if a node has persistent proof validation issues, the diagnostics directory could grow unbounded.
This seems acceptable for version 1 as the feature is expliclty for debugging. In a future enhancement you could add flags --diagnostics-max-size or --diagnostics-retention-days
kaldun-tech
left a comment
There was a problem hiding this comment.
Overall this looks solid. The only change I would strongly recommend is where Gemini flagged the missing function comments. The rest is nits that we can take care of in a future PR.
There's clean separation of concerns here and correct async patterns. Good test coverage too.
45ea017 to
6203bd0
Compare
|
Hey, @kaldun-tech, I updated the PR according to your comments, could you please take a look. |
| return nil | ||
| } | ||
|
|
||
| func (p *ChainPorter) reportProofValidationFailure( |
There was a problem hiding this comment.
Nit: functions could use clarifying comments on what they are meant to do. verifyOutputProofPreBroadcast on line 1489 and verifyPacketInputProofs on line 1731 have comments for example.
|
|
||
| if p.cfg.DiagnosticsRecorder == nil { | ||
| return | ||
| } |
There was a problem hiding this comment.
In reportProofValidationFailure for disabled diagnostics we return nil. But the calls to reportProofValidationFailure passes in a ProofValidationFailure. So we build a parameter that is not necessarily used.
You can optimize this to avoid unnecessary I/O and allocations by guarding the entire block
|
In my view you're on the right track in this iteration except for a few nitpicks & minor issues. The deep copy logic is smart towards the bottom of diagnostics/service.go. |
kaldun-tech
left a comment
There was a problem hiding this comment.
For the most part the code here is solid and addresses the original issue. I believe it's worthwhile to add explanatory comments in new and updated files and remove the added nolint directive.
| TransferOutputIndex *int `json:"transfer_output_idx,omitempty"` | ||
| OutputProofFiles []string `json:"output_proof_files,omitempty"` | ||
| InputProofFiles []string `json:"input_proof_files,omitempty"` | ||
| } |
There was a problem hiding this comment.
Please add explanations of why the fields that are marked omitempty are safe to be omitted. A reader can't tell from the struct definition alone which fields appear in which stage's reports.
| return writtenNames, nil | ||
| } | ||
|
|
||
| func sanitizeFileName(name string) string { |
There was a problem hiding this comment.
I recommend to add more comments on at least the added functions as Gemini pointed out before. What does the reader need to understand about the intent of each function? In this case stripping non-alphanumeric characters from a filename
| defaultQueueSize = 64 | ||
| ) | ||
|
|
||
| var fileNameSanitizer = regexp.MustCompile(`[^a-zA-Z0-9._-]+`) |
There was a problem hiding this comment.
Would also benefit from a comment
| inputProofFile, err := p.fetchInputProof(ctx, inputs[idx]) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("fetch input proof %d: %w", | ||
| idx, err) |
There was a problem hiding this comment.
This function fails fast if there's an internal error while iterating over the inputs. The reason this is sensible is this function is called in cases where proof verification failed so we want to report the error promptly and move on. Partial proof artifacts could be misleading during support investigation.
Readers would benefit from explanatory comments.
836a7a0 to
d3fc9be
Compare
d3fc9be to
759fd49
Compare
|
Hey, @kaldun-tech, I updated the PR according to your comments. Could you please take a look? |
kaldun-tech
left a comment
There was a problem hiding this comment.
Lookds good to me! Curious to hear from the maintainers on the disk space and itest concerns.
|
|
||
| return nil | ||
| }, ccTransferTimeout) | ||
| }, ccTransferConfirmTimeout) |
There was a problem hiding this comment.
Sorry to flip-flop on this. My concern is that the itest changes are extensive enough to be out of scope for the PR. They look good but are orthogonal to diagnostics. It could introduce a regression and make the commit history muddled.
It's up to the repo maintainers in on whether they want to keep the itest changes bundled in this change or split them into a separate PR.
| case s.queue <- queued: | ||
| default: | ||
| atomic.AddUint64(&s.dropped, 1) | ||
| log.Warnf("Diagnostics queue full, dropping proof failure "+ |
There was a problem hiding this comment.
Warning for queue full seems sensible for an initial version. If there is concern about log spam we could switch to rate-limited warnings
| func NewService(rootDir, tapdVersion string) (*Service, error) { | ||
| if strings.TrimSpace(rootDir) == "" { | ||
| return nil, fmt.Errorf( | ||
| "diagnostics root directory cannot be empty", |
There was a problem hiding this comment.
Did a scan of all the error messages here. They seem appropriate.
|
@sergey3bv, remember to re-request review from reviewers when ready |
Should close #1867