Lock Sliding Sync connections when inserting lazy members, to prevent repeated deadlocks.#19826
Lock Sliding Sync connections when inserting lazy members, to prevent repeated deadlocks.#19826reivilibre wants to merge 5 commits into
Conversation
| # https://www.postgresql.org/docs/current/explicit-locking.html#LOCKING-ROWS | ||
| # (We could also consider sorting our insertions, but not clear if Postgres | ||
| # guarantees to preservee the insertion order) | ||
| txn.execute( |
There was a problem hiding this comment.
We probably want to go further than this and block all concurrent writes on a given connection to also avoid serialisation failures. Generally it's also good to do as much locking as possible at the start of the transaction.
We already fetch the connection_key from sliding_sync_connections as the first query in the transaction, so it might just be as easy as adding the locking there?
There was a problem hiding this comment.
Yup derp, I did mean to go to the top but didn't occur to me to check whether this was called as part of something else. I will blame the 3am factor.
8c2d312 to
b1c313e
Compare
| # Specifically, the statements seen to deadlock against | ||
| # each other were | ||
| # `INSERT INTO sliding_sync_connection_lazy_members` | ||
| # with conflicting tuples on | ||
| # "sliding_sync_connection_lazy_members_idx" UNIQUE, btree | ||
| # (connection_key, room_id, user_id) |
There was a problem hiding this comment.
For my own reference, can someone explain the real life situations that causes this?
Someone is sending multiple concurrent requests with the same connection_key?
There was a problem hiding this comment.
It would appear so, I guess? There were multiple users involved.
I suppose it's possible that some of these were retries after a connection dropped/timed out/... or something like that.
| SELECT 1 | ||
| FROM sliding_sync_connections | ||
| WHERE connection_key = ? | ||
| FOR NO KEY UPDATE |
There was a problem hiding this comment.
I don't suppose we can optionally add this to the query above if its postgres? Or does it not work for more complex select statements?
There was a problem hiding this comment.
That's on a different table.
We need a lock over the connection_key so it feels like putting it on the sliding_sync_connections table might be best.
I'm not really seeing a sensible way to rejig this otherwise; in the other branch of the if-else there may not be rows on sliding sync connection positions to lock
There was a problem hiding this comment.
The select above joins on sliding_sync_connections so I think should work? You can also do FOR NO KEY UPDATE OF sliding_sync_connections by the looks of it
There was a problem hiding this comment.
Ahhhh right I wasn't aware of OF xxx. That'll do it
Got paged today for this. The sliding sync worker in question had loads of deadlocks in the logs.
I restarted it and it got unwedged, but we should have a more robust defence, which this PR proposes.
I wonder if an unfortunate side effect is that these repeated attempts leave a lot of dead tuples on the table,
which would then harm the performance of the next attempt to insert the tuples,
I suspect making it more likely that they will deadlock again (?).
By acquring a
FOR NO KEY UPDATElock upfront before beginning work, we can ensure that oneof the transactions gets queued behind the other one, meaning the first one can succeed unimpeded.
FOR NO KEY UPDATEblocks otherFOR NO KEY UPDATElocks and is the weakest lock level that blocks itself.