functional tests: wait for sync and wait for height helpers, expose timeouts to environment variables, timeouts for stallment to avoid misleading breaks#1038
Conversation
3944590 to
504f98f
Compare
| except (ReadTimeout, RequestsConnectionError) as exc: | ||
| last_exc = exc | ||
| self.log.debug( | ||
| self.log_msg(f"Transient error on {method}, retrying: {exc}") | ||
| ) | ||
| time.sleep(1) | ||
| continue |
There was a problem hiding this comment.
NIT: tested locally with this snippet:
@pytest.mark.rpc
def test_rpc_after_daemon_killed(florestad_node):
florestad = florestad_node
florestad.rpc.get_blockchain_info() # alive
florestad.daemon.process.kill()
florestad.daemon.process.wait()
start = time.time()
try:
florestad.rpc.get_blockchain_info()
except Exception:
elapsed = time.time() - start
pytest.fail(f"RPC took {elapsed:.2f}s to fail")
ran on both master and on this PR
an RPC call after the daemon is killed fails in 0.00s on master but on this branch it takes 60.27s before failing which is exactly the REQUEST_STALE_TIMEOUT default
the retry on ConnectionError makes sense for the startup race under -n auto....but it also fires after a successful RPC, so a daemon crash mid-test eats the full 60s before failing......
could the retry distinguish "haven't connected yet" from "was connected, now isn't"?
orrrrr Is the 60s wait time not much of a big deal??
There was a problem hiding this comment.
Nice catch!
orrrrr Is the 60s wait time not much of a big deal??
No, it is, were trying to save time here, the intent for those timeouts is to help slow runners.
I think its possible to know wheter a node died or is just non-responsive. Another nice thing for this pr is the centralization for those requests, so this should be straightforward.
There was a problem hiding this comment.
alright....make sense
|
For the "wait fro sync" part, I think core have a RPC for that. It basically hangs until the daemon reaches a certain height. We could do that as well |
This is an excellent idea, but the error that will occur if the sync hasn’t finished will be a request timeout. In that case, we just need to map that error to indicate the correct cause. Currently, the maximum time we wait for a request to complete is 15 seconds. For chain sync, we may need to increase that limit, because on weaker computers 15 seconds might not be enough. I think we can safely raise it to somewhere between 45 and 60 seconds. |
| prev_height = None | ||
| prev_validated = None | ||
|
|
||
| while True: |
There was a problem hiding this comment.
Using while true is not a good idea. The ideal approach is to handle the timeout case here, because inside the while, if that happens, it can exit by returning immediately. If it keeps going after the timeout, it ends up causing an error.
In this PR here: #897 I added a wait_until helper that already does this automatically, so it makes things easier.
Yeah, this is really interesting... Moves the timeout to the node side and perhaps will get rid of some python overhead. But for this Pr I insist on the current python-centric approach, I think its better to leave this specific rpc to work in the future and bring a more elegant solution for it. |
…y logic Add wait_for_sync and wait_for_height to FlorestaTestFramework, both using stale-state detection: the countdown resets on progress and only fires when the node truly stalls. Both tolerate transient RPC errors (ReadTimeout, ConnectionError) by retrying within the stale window instead of failing immediately. Add retry logic to BaseRPC.perform_request: transient errors are retried for up to REQUEST_STALE_TIMEOUT before propagating. Rename BaseRPC.TIMEOUT to REQUEST_TIMEOUT. Extract all hardcoded timeouts into environment-variable-backed constants so slow runners can increase tolerance: - FLORESTA_SYNC_TIMEOUT (default: 120s) - FLORESTA_PEER_CONNECTION_TIMEOUT (default: 30s) - FLORESTA_REQUEST_TIMEOUT (default: 15s) - FLORESTA_REQUEST_STALE_TIMEOUT (default: 60s) Document all environment variables in doc/running-tests.md
…_height Replace inline sync-waiting loops across test files with the new framework helpers. Tests syncing from utreexod use wait_for_sync (requires IBD completion), while getblock (syncing from bitcoind) uses wait_for_height.
504f98f to
a39ef8d
Compare
| ) | ||
|
|
||
|
|
||
| def wait_until(predicate, *, timeout=SYNC_TIMEOUT, interval=0.5): |
There was a problem hiding this comment.
Move this to util.py
Description and Notes
Here i present some general DX improvements for the functional tests, with simple changes on how we deal with timeouts.
The tests are not bounded by CPU speed but bounded by IO heavy usage, thats why the stale check helps to provide preciser timeouts.
Heres a breakdown of the impact of the changes on my machine:
With this we can add the
-n autoflag to CI and see if anything breaks, Ill do that after the checks for this PR are completed and ill add them here.CI runs
Yeah, CI is a lost cause for performance... luisschwab When you will give us a CI runner ?
How to verify the changes you have done?
Run the prepare, the general
just test-functionaldoes not accept args and will run prepare and build binaries whitout need:just test-functional-prepare
In master and checked in this branch:
just test-functional-run '-n auto'
and youll see that tests will break on master but not on this branch.
Speculation
The major bound still IO, this is proven by the tests breaking because of READ timeouts. Perhaps we could make tests to run totally on ram ? Can we cache datadirs ? Can we avoid to write in disk on test environments ? The logs are being more usefull to debug running phase than datadirs itself
The tests got faster on more workers not necessarily because of CPU working more but because more tests were running at the same time, IO usage spiked even more in this case.
Also, this show us that we can optimize tests even more... Why ? you ask me.
Its obvious! Faster and well engineered test framework will allow us to run those tests on more environments, being able to verify floresta functionality in more cases and even add more extensive heavy load tests whitout bothering about the execution phase.