Skip to content

fix!: retrieve transactions from the mempool async, ignore irelevant rounds, and don't start processing messages until switching from blocksync to consensus#1695

Merged
evan-forbes merged 29 commits intofeature/recoveryfrom
evan/recovery/debug-mamo
Apr 2, 2025
Merged

fix!: retrieve transactions from the mempool async, ignore irelevant rounds, and don't start processing messages until switching from blocksync to consensus#1695
evan-forbes merged 29 commits intofeature/recoveryfrom
evan/recovery/debug-mamo

Conversation

@evan-forbes
Copy link
Copy Markdown
Member

@evan-forbes evan-forbes commented Mar 26, 2025

Description

this PR started as a simple fix, but then its scope grew significantly as we found more bugs and made the testnet work

it:

  • makes retrieving the transactions from the mempool async during propagation.
  • ajusts priorities for gossip
  • stops automatically broadcasting transactions with CAT, instead only sends seentx (this stops peers from downloading transactions more than once)
  • minor optimizations around not verifying proofs that the node just generated.
  • NOTE: this PR reverts fix: remove go routines for RecheckTx #1553 and fix: mempool locking mechanism in v1 and cat #1582 cause those break CAT and we need a working mempool that doesn't re-gossip everything
  • ignores irelevant round
  • only processes compact blocks until the consensus reactor starts
  • a few other minor bug fixes

@evan-forbes evan-forbes self-assigned this Mar 26, 2025
@evan-forbes evan-forbes requested a review from a team as a code owner March 26, 2025 18:37
@evan-forbes evan-forbes requested review from rach-id and rootulp and removed request for a team March 26, 2025 18:37
@celestia-bot celestia-bot requested a review from tzdybal March 26, 2025 18:37
Comment thread mempool/cat/pool.go
Comment thread mempool/cat/pool.go
Comment thread mempool/cat/store_test.go
Comment thread mempool/v1/mempool.go
Comment thread mempool/tx.go Outdated
Comment thread store/store.go Outdated
@celestia-bot celestia-bot requested a review from rach-id March 27, 2025 18:44
Comment thread consensus/propagation/have_wants.go
func (p *ProposalCache) AddProposal(cb *proptypes.CompactBlock) (added bool) {
p.pmtx.Lock()
defer p.pmtx.Unlock()
if cb.Proposal.Height <= p.consensusHeight {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we're no longer storing past proposals? or it means we already have it?

Comment thread consensus/propagation/catchup.go Outdated
@celestia-bot celestia-bot requested a review from rach-id March 30, 2025 21:24
@evan-forbes evan-forbes changed the title fix: retrieve transactions from the mempool async fix: retrieve transactions from the mempool async, ignore irelevant rounds, and don't start processing messages until switching from blocksync to consensus Mar 31, 2025
Comment thread consensus/propagation/commitment_state.go Outdated
Comment thread consensus/propagation/commitment_state.go
Comment thread consensus/propagation/types/unmarshalling_test.go
@celestia-bot celestia-bot requested a review from rach-id April 2, 2025 16:26
Copy link
Copy Markdown
Member Author

@evan-forbes evan-forbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL github keeps all coments pending if there was a review started somehwere until that review is finished...

Comment on lines +54 to +57
for _, ind := range original.BitArray().GetTrueIndices() {
ps.totalMap.SetIndex(ind, true)
}
return ps
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only reason catchupp was working before this addition was becuase we were cacheing so many past blocks that most of the time nodes would still rely on the compact block to still be in their peers cache

Comment on lines +22 to +28
data := blockProp.unfinishedHeights()
peers := blockProp.getPeers()
for _, prop := range data {
height, round := prop.compactBlock.Proposal.Height, prop.compactBlock.Proposal.Round

if height == currentHeight && round == currentRound {
// don't re-request parts for any round on the current height
if height == currentHeight {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each catchup message will contribute to congestion, so if we leave old rounds here we introduce a feedback loop that eventually halts the chain as node fall more and more behind. this PR only applies catchup to the highest round this node has seen per height. It also ignores rounds of the current height

Comment on lines -68 to +78
missing.Sub(mc)
missing = missing.Sub(mc)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof annoying bug

bit arrays are treated as pointers everywhere else except when they are subtracted or (and probably added?)

Comment thread consensus/propagation/catchup.go Outdated
Comment thread consensus/propagation/have_wants.go
Comment on lines -90 to +96
Priority: 20,
Priority: 45,
SendQueueCapacity: 20000,
RecvMessageCapacity: maxMsgSize,
MessageType: &propproto.Message{},
},
{
ID: DataChannel,
Priority: 15,
Priority: 40,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general adjusting all of the important channel's priority to be much larger and the mempool's priorities much small help to ensure we're only gossiping mempool txs when we have bandwidth.

Comment thread consensus/propagation/types/unmarshalling.go
Comment thread mempool/cat/reactor.go
Comment on lines 393 to +397
func (memR *Reactor) broadcastNewTx(wtx *wrappedTx) {
msg := &protomem.Message{
Sum: &protomem.Message_Txs{
Txs: &protomem.Txs{
Txs: [][]byte{wtx.tx.Tx},
Sum: &protomem.Message_SeenTx{
SeenTx: &protomem.SeenTx{
TxKey: wtx.tx.Hash(),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note the change here

Naively it makes sense to broadcast txs automatically, but when we look at the traces we see that this causes many nodes to download the transaction more than once becuse they will see a SeenTx from another peer before they finish downloading their tx, which they then request

Comment thread consensus/propagation/commitment_state.go Outdated
Comment thread consensus/propagation/commitment_state.go
@evan-forbes evan-forbes enabled auto-merge (squash) April 2, 2025 16:28
@evan-forbes evan-forbes disabled auto-merge April 2, 2025 19:48
Copy link
Copy Markdown
Member

@rach-id rach-id left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:


_, partSet, _, found := blockProp.getAllState(cb.Proposal.Height, cb.Proposal.Round, false)
if !found {
blockProp.Logger.Error("failed to get all state for this node's proposal", "height", cb.Proposal.Height, "round", cb.Proposal.Round)
Copy link
Copy Markdown
Member

@rach-id rach-id Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't have the compact block or the partset, shouldn't we stop at this point?

Comment on lines +112 to +120
if height <= p.consensusHeight {
return false
}

if round < p.consensusRound {
return false
}

return true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential refactor:

return height > p.consensusHeight || round >= p.consensusRound

Comment thread consensus/propagation/types/unmarshalling.go
@celestia-bot celestia-bot requested a review from rach-id April 2, 2025 22:16
Comment on lines -46 to +61
func TestReactorBroadcastTxsMessage(t *testing.T) {
config := cfg.TestConfig()
const N = 5
reactors := makeAndConnectReactors(t, config, N)

txs := checkTxs(t, reactors[0].mempool, numTxs, mempool.UnknownPeerID)
sort.Slice(txs, func(i, j int) bool {
return txs[i].priority > txs[j].priority // N.B. higher priorities first
})
transactions := make(types.Txs, len(txs))
for idx, tx := range txs {
transactions[idx] = tx.tx
}

waitForTxsOnReactors(t, transactions, reactors)
}
// todo: readd this test after deugging it
// func TestReactorBroadcastTxsMessage(t *testing.T) {
// config := cfg.TestConfig()
// const N = 5
// reactors := makeAndConnectReactors(t, config, N)

// txs := checkTxs(t, reactors[0].mempool, numTxs, mempool.UnknownPeerID)
// sort.Slice(txs, func(i, j int) bool {
// return txs[i].priority > txs[j].priority // N.B. higher priorities first
// })
// transactions := make(types.Txs, len(txs))
// for idx, tx := range txs {
// transactions[idx] = tx.tx
// }

// waitForTxsOnReactors(t, transactions, reactors)
// }
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a weird bug in CAT atm where not all txs from the rpc are broadcasted, at least in this test. The original broadcaster simply isn't sending the SeenTxs

@evan-forbes evan-forbes enabled auto-merge (squash) April 2, 2025 22:26
@evan-forbes evan-forbes merged commit 98db3ce into feature/recovery Apr 2, 2025
18 of 20 checks passed
@evan-forbes evan-forbes deleted the evan/recovery/debug-mamo branch April 2, 2025 22:33
@evan-forbes evan-forbes restored the evan/recovery/debug-mamo branch April 3, 2025 18:47
rach-id pushed a commit that referenced this pull request May 21, 2025
…rounds, and don't start processing messages until switching from blocksync to consensus (#1695)

## Description

this PR started as a simple fix, but then its scope grew significantly
as we found more bugs and made the testnet work

it:
- makes retrieving the transactions from the mempool async during
propagation.
- ajusts priorities for gossip
- stops automatically broadcasting transactions with CAT, instead only
sends seentx (this stops peers from downloading transactions more than
once)
- minor optimizations around not verifying proofs that the node just
generated.
- NOTE: this PR reverts
#1553 and
#1582 cause those break
CAT and we need a working mempool that doesn't re-gossip everything
- ignores irelevant round
- only processes compact blocks until the consensus reactor starts
- a few other minor bug fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

2 participants