functional tests: wait for sync and wait for height helpers, expose timeouts to environment variables, timeouts for stallment to avoid misleading breaks by jaoleal · Pull Request #1038 · getfloresta/Floresta

jaoleal · 2026-05-07T18:12:30Z

Description and Notes

Here i present some general DX improvements for the functional tests, with simple changes on how we deal with timeouts.

The tests are not bounded by CPU speed but bounded by IO heavy usage, thats why the stale check helps to provide preciser timeouts.

Heres a breakdown of the impact of the changes on my machine:

- master, no workers definition (defaults to 4) - 28 passed in 252.09s (0:04:12).

- master, auto workers definition (14) - 3 failed, 22 passed in 90.21s (0:01:30). 
(failure because of IO read reached timeout).
 test_get_ping, test_add_node_v2, getconnectioncount failed

- this branch, no workers definition (default to 4) - 28 passed in 257.95s (0:04:17) 

- this branch, auto workers definition (14) - 28 passed in 143.31s (0:02:23)

With this we can add the -n auto flag to CI and see if anything breaks, Ill do that after the checks for this PR are completed and ill add them here.

CI runs

master, no workers definition ( defaults to 4 ) - 28 passed in 70.67s (0:01:10)  

this branch, no workers definition ( defaults to  4 ) - 28 passed in 73.71s (0:01:13)

this branch, workers on auto (still 4) - 28 passed in 142.72s (0:02:22)

Yeah, CI is a lost cause for performance... luisschwab When you will give us a CI runner ?

How to verify the changes you have done?

Run the prepare, the general just test-functional does not accept args and will run prepare and build binaries whitout need:

just test-functional-prepare

In master and checked in this branch:

just test-functional-run '-n auto'

and youll see that tests will break on master but not on this branch.

Speculation

The major bound still IO, this is proven by the tests breaking because of READ timeouts. Perhaps we could make tests to run totally on ram ? Can we cache datadirs ? Can we avoid to write in disk on test environments ? The logs are being more usefull to debug running phase than datadirs itself

The tests got faster on more workers not necessarily because of CPU working more but because more tests were running at the same time, IO usage spiked even more in this case.

Also, this show us that we can optimize tests even more... Why ? you ask me.

Its obvious! Faster and well engineered test framework will allow us to run those tests on more environments, being able to verify floresta functionality in more cases and even add more extensive heavy load tests whitout bothering about the execution phase.

Micah-Shallom · 2026-05-10T16:11:58Z

+            except (ReadTimeout, RequestsConnectionError) as exc:
+                last_exc = exc
+                self.log.debug(
+                    self.log_msg(f"Transient error on {method}, retrying: {exc}")
+                )
+                time.sleep(1)
+                continue


NIT: tested locally with this snippet:

@pytest.mark.rpc def test_rpc_after_daemon_killed(florestad_node): florestad = florestad_node florestad.rpc.get_blockchain_info() # alive florestad.daemon.process.kill() florestad.daemon.process.wait() start = time.time() try: florestad.rpc.get_blockchain_info() except Exception: elapsed = time.time() - start pytest.fail(f"RPC took {elapsed:.2f}s to fail")

ran on both master and on this PR

an RPC call after the daemon is killed fails in 0.00s on master but on this branch it takes 60.27s before failing which is exactly the REQUEST_STALE_TIMEOUT default

the retry on ConnectionError makes sense for the startup race under -n auto....but it also fires after a successful RPC, so a daemon crash mid-test eats the full 60s before failing......

could the retry distinguish "haven't connected yet" from "was connected, now isn't"?

orrrrr Is the 60s wait time not much of a big deal??

Nice catch!

orrrrr Is the 60s wait time not much of a big deal??

No, it is, were trying to save time here, the intent for those timeouts is to help slow runners.

I think its possible to know wheter a node died or is just non-responsive. Another nice thing for this pr is the centralization for those requests, so this should be straightforward.

alright....make sense

Davidson-Souza · 2026-05-12T18:17:47Z

For the "wait fro sync" part, I think core have a RPC for that. It basically hangs until the daemon reaches a certain height. We could do that as well

moisesPompilio · 2026-05-12T18:29:39Z

For the "wait fro sync" part, I think core have a RPC for that. It basically hangs until the daemon reaches a certain height. We could do that as well

This is an excellent idea, but the error that will occur if the sync hasn’t finished will be a request timeout. In that case, we just need to map that error to indicate the correct cause.

Currently, the maximum time we wait for a request to complete is 15 seconds. For chain sync, we may need to increase that limit, because on weaker computers 15 seconds might not be enough. I think we can safely raise it to somewhere between 45 and 60 seconds.

moisesPompilio · 2026-05-12T18:34:13Z

+        prev_height = None
+        prev_validated = None
+
+        while True:


Using while true is not a good idea. The ideal approach is to handle the timeout case here, because inside the while, if that happens, it can exit by returning immediately. If it keeps going after the timeout, it ends up causing an error.

In this PR here: #897 I added a wait_until helper that already does this automatically, so it makes things easier.

jaoleal · 2026-05-18T12:28:36Z

For the "wait fro sync" part, I think core have a RPC for that. It basically hangs until the daemon reaches a certain height. We could do that as well

Yeah, this is really interesting... Moves the timeout to the node side and perhaps will get rid of some python overhead.

But for this Pr I insist on the current python-centric approach, I think its better to leave this specific rpc to work in the future and bring a more elegant solution for it.

…y logic Add wait_for_sync and wait_for_height to FlorestaTestFramework, both using stale-state detection: the countdown resets on progress and only fires when the node truly stalls. Both tolerate transient RPC errors (ReadTimeout, ConnectionError) by retrying within the stale window instead of failing immediately. Add retry logic to BaseRPC.perform_request: transient errors are retried for up to REQUEST_STALE_TIMEOUT before propagating. Rename BaseRPC.TIMEOUT to REQUEST_TIMEOUT. Extract all hardcoded timeouts into environment-variable-backed constants so slow runners can increase tolerance: - FLORESTA_SYNC_TIMEOUT (default: 120s) - FLORESTA_PEER_CONNECTION_TIMEOUT (default: 30s) - FLORESTA_REQUEST_TIMEOUT (default: 15s) - FLORESTA_REQUEST_STALE_TIMEOUT (default: 60s) Document all environment variables in doc/running-tests.md

…_height Replace inline sync-waiting loops across test files with the new framework helpers. Tests syncing from utreexod use wait_for_sync (requires IBD completion), while getblock (syncing from bitcoind) uses wait_for_height.

moisesPompilio · 2026-05-18T15:25:25Z

+)
+
+
+def wait_until(predicate, *, timeout=SYNC_TIMEOUT, interval=0.5):


Move this to util.py

jaoleal requested review from Davidson-Souza and moisesPompilio May 7, 2026 18:12

jaoleal self-assigned this May 7, 2026

jaoleal marked this pull request as ready for review May 7, 2026 18:12

Davidson-Souza added the functional tests label May 7, 2026

Davidson-Souza added this to Floresta May 7, 2026

github-project-automation Bot moved this to Backlog in Floresta May 7, 2026

Davidson-Souza moved this from Backlog to Needs review in Floresta May 7, 2026

jaoleal force-pushed the test/wait-for-sync-helper branch from 3944590 to 504f98f Compare May 7, 2026 18:31

csgui added this to the Q2/2026 milestone May 8, 2026

jaoleal mentioned this pull request May 9, 2026

ci: run build and test for each commit in pushed range #1040

Closed

Micah-Shallom reviewed May 10, 2026

View reviewed changes

moisesPompilio reviewed May 12, 2026

View reviewed changes

jaoleal mentioned this pull request May 12, 2026

rpcserver: Support named and null parameters, Review optionals on rpc internal methods, Stronger Response and Error codes #831

Merged

3 tasks

jaoleal added 2 commits May 18, 2026 11:26

jaoleal force-pushed the test/wait-for-sync-helper branch from 504f98f to a39ef8d Compare May 18, 2026 15:05

moisesPompilio reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

functional tests: wait for sync and wait for height helpers, expose timeouts to environment variables, timeouts for stallment to avoid misleading breaks#1038

functional tests: wait for sync and wait for height helpers, expose timeouts to environment variables, timeouts for stallment to avoid misleading breaks#1038
jaoleal wants to merge 2 commits into
getfloresta:masterfrom
jaoleal:test/wait-for-sync-helper

jaoleal commented May 7, 2026 •

edited

Loading

Uh oh!

Micah-Shallom May 10, 2026

Uh oh!

jaoleal May 10, 2026

Uh oh!

Micah-Shallom May 10, 2026

Uh oh!

Davidson-Souza commented May 12, 2026

Uh oh!

moisesPompilio commented May 12, 2026

Uh oh!

moisesPompilio May 12, 2026

Uh oh!

jaoleal commented May 18, 2026

Uh oh!

moisesPompilio May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		)


		def wait_until(predicate, *, timeout=SYNC_TIMEOUT, interval=0.5):

Conversation

jaoleal commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description and Notes

CI runs

How to verify the changes you have done?

Speculation

Uh oh!

Micah-Shallom May 10, 2026

Choose a reason for hiding this comment

Uh oh!

jaoleal May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Micah-Shallom May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Davidson-Souza commented May 12, 2026

Uh oh!

moisesPompilio commented May 12, 2026

Uh oh!

moisesPompilio May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jaoleal commented May 18, 2026

Uh oh!

moisesPompilio May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jaoleal commented May 7, 2026 •

edited

Loading