Skip to content

Lock Sliding Sync connections when inserting lazy members, to prevent repeated deadlocks.#19826

Open
reivilibre wants to merge 5 commits into
developfrom
rei/ss_deadlock
Open

Lock Sliding Sync connections when inserting lazy members, to prevent repeated deadlocks.#19826
reivilibre wants to merge 5 commits into
developfrom
rei/ss_deadlock

Conversation

@reivilibre

Copy link
Copy Markdown
Contributor

Got paged today for this. The sliding sync worker in question had loads of deadlocks in the logs.
I restarted it and it got unwedged, but we should have a more robust defence, which this PR proposes.

psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 257324 waits for ShareLock on transaction 688227036; blocked by process 254908.
Process 254908 waits for ShareLock on transaction 688222971; blocked by process 256179.
Process 256179 waits for ExclusiveLock on tuple (302352,92) of relation 2962200779 of database 16403; blocked by process 257213.
Process 257213 waits for ShareLock on transaction 688225005; blocked by process 254905.
Process 254905 waits for ShareLock on transaction 688228814; blocked by process 257324.
HINT:  See server log for query details.
CONTEXT:  while inserting index tuple (183070,103) in relation "sliding_sync_connection_lazy_members"

I wonder if an unfortunate side effect is that these repeated attempts leave a lot of dead tuples on the table,
which would then harm the performance of the next attempt to insert the tuples,
I suspect making it more likely that they will deadlock again (?).


By acquring a FOR NO KEY UPDATE lock upfront before beginning work, we can ensure that one
of the transactions gets queued behind the other one, meaning the first one can succeed unimpeded.

FOR NO KEY UPDATE blocks other FOR NO KEY UPDATE locks and is the weakest lock level that blocks itself.

@reivilibre reivilibre marked this pull request as ready for review June 5, 2026 02:57
@reivilibre reivilibre requested a review from a team as a code owner June 5, 2026 02:57
# https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS
# (We could also consider sorting our insertions, but not clear if Postgres
# guarantees to preservee the insertion order)
txn.execute(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to go further than this and block all concurrent writes on a given connection to also avoid serialisation failures. Generally it's also good to do as much locking as possible at the start of the transaction.

We already fetch the connection_key from sliding_sync_connections as the first query in the transaction, so it might just be as easy as adding the locking there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup derp, I did mean to go to the top but didn't occur to me to check whether this was called as part of something else. I will blame the 3am factor.

Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
Comment on lines +217 to +222
# Specifically, the statements seen to deadlock against
# each other were
# `INSERT INTO sliding_sync_connection_lazy_members`
# with conflicting tuples on
# "sliding_sync_connection_lazy_members_idx" UNIQUE, btree
# (connection_key, room_id, user_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own reference, can someone explain the real life situations that causes this?

Someone is sending multiple concurrent requests with the same connection_key?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would appear so, I guess? There were multiple users involved.

I suppose it's possible that some of these were retries after a connection dropped/timed out/... or something like that.

SELECT 1
FROM sliding_sync_connections
WHERE connection_key = ?
FOR NO KEY UPDATE

@erikjohnston erikjohnston Jun 11, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't suppose we can optionally add this to the query above if its postgres? Or does it not work for more complex select statements?

@reivilibre reivilibre Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's on a different table.
We need a lock over the connection_key so it feels like putting it on the sliding_sync_connections table might be best.

I'm not really seeing a sensible way to rejig this otherwise; in the other branch of the if-else there may not be rows on sliding sync connection positions to lock

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The select above joins on sliding_sync_connections so I think should work? You can also do FOR NO KEY UPDATE OF sliding_sync_connections by the looks of it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh right I wasn't aware of OF xxx. That'll do it

@reivilibre reivilibre requested a review from erikjohnston June 15, 2026 11:28

@erikjohnston erikjohnston left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants