Skip to content

functional tests: wait for sync and wait for height helpers, expose timeouts to environment variables, timeouts for stallment to avoid misleading breaks#1038

Open
jaoleal wants to merge 2 commits into
getfloresta:masterfrom
jaoleal:test/wait-for-sync-helper
Open

functional tests: wait for sync and wait for height helpers, expose timeouts to environment variables, timeouts for stallment to avoid misleading breaks#1038
jaoleal wants to merge 2 commits into
getfloresta:masterfrom
jaoleal:test/wait-for-sync-helper

Conversation

@jaoleal
Copy link
Copy Markdown
Member

@jaoleal jaoleal commented May 7, 2026

Description and Notes

Here i present some general DX improvements for the functional tests, with simple changes on how we deal with timeouts.

The tests are not bounded by CPU speed but bounded by IO heavy usage, thats why the stale check helps to provide preciser timeouts.

Heres a breakdown of the impact of the changes on my machine:

- master, no workers definition (defaults to 4) - 28 passed in 252.09s (0:04:12).

- master, auto workers definition (14) - 3 failed, 22 passed in 90.21s (0:01:30). 
(failure because of IO read reached timeout).
 test_get_ping, test_add_node_v2, getconnectioncount failed

- this branch, no workers definition (default to 4) - 28 passed in 257.95s (0:04:17) 

- this branch, auto workers definition (14) - 28 passed in 143.31s (0:02:23)

With this we can add the -n auto flag to CI and see if anything breaks, Ill do that after the checks for this PR are completed and ill add them here.

CI runs

master, no workers definition ( defaults to 4 ) - 28 passed in 70.67s (0:01:10)  

this branch, no workers definition ( defaults to  4 ) - 28 passed in 73.71s (0:01:13)

this branch, workers on auto (still 4) - 28 passed in 142.72s (0:02:22)

Yeah, CI is a lost cause for performance... luisschwab When you will give us a CI runner ?

How to verify the changes you have done?

Run the prepare, the general just test-functional does not accept args and will run prepare and build binaries whitout need:

just test-functional-prepare

In master and checked in this branch:

just test-functional-run '-n auto'

and youll see that tests will break on master but not on this branch.

Speculation

The major bound still IO, this is proven by the tests breaking because of READ timeouts. Perhaps we could make tests to run totally on ram ? Can we cache datadirs ? Can we avoid to write in disk on test environments ? The logs are being more usefull to debug running phase than datadirs itself

The tests got faster on more workers not necessarily because of CPU working more but because more tests were running at the same time, IO usage spiked even more in this case.

Also, this show us that we can optimize tests even more... Why ? you ask me.

Its obvious! Faster and well engineered test framework will allow us to run those tests on more environments, being able to verify floresta functionality in more cases and even add more extensive heavy load tests whitout bothering about the execution phase.

@jaoleal jaoleal self-assigned this May 7, 2026
@jaoleal jaoleal marked this pull request as ready for review May 7, 2026 18:12
@github-project-automation github-project-automation Bot moved this to Backlog in Floresta May 7, 2026
@Davidson-Souza Davidson-Souza moved this from Backlog to Needs review in Floresta May 7, 2026
@jaoleal jaoleal force-pushed the test/wait-for-sync-helper branch from 3944590 to 504f98f Compare May 7, 2026 18:31
@csgui csgui added this to the Q2/2026 milestone May 8, 2026
Comment on lines +161 to +167
except (ReadTimeout, RequestsConnectionError) as exc:
last_exc = exc
self.log.debug(
self.log_msg(f"Transient error on {method}, retrying: {exc}")
)
time.sleep(1)
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: tested locally with this snippet:

@pytest.mark.rpc
def test_rpc_after_daemon_killed(florestad_node):
    florestad = florestad_node
    florestad.rpc.get_blockchain_info()           # alive
    florestad.daemon.process.kill()
    florestad.daemon.process.wait()
    start = time.time()
    try:
        florestad.rpc.get_blockchain_info()
    except Exception:
        elapsed = time.time() - start
        pytest.fail(f"RPC took {elapsed:.2f}s to fail")

ran on both master and on this PR

an RPC call after the daemon is killed fails in 0.00s on master but on this branch it takes 60.27s before failing which is exactly the REQUEST_STALE_TIMEOUT default

the retry on ConnectionError makes sense for the startup race under -n auto....but it also fires after a successful RPC, so a daemon crash mid-test eats the full 60s before failing......

could the retry distinguish "haven't connected yet" from "was connected, now isn't"?

orrrrr Is the 60s wait time not much of a big deal??

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

orrrrr Is the 60s wait time not much of a big deal??

No, it is, were trying to save time here, the intent for those timeouts is to help slow runners.

I think its possible to know wheter a node died or is just non-responsive. Another nice thing for this pr is the centralization for those requests, so this should be straightforward.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright....make sense

@Davidson-Souza
Copy link
Copy Markdown
Member

For the "wait fro sync" part, I think core have a RPC for that. It basically hangs until the daemon reaches a certain height. We could do that as well

@moisesPompilio
Copy link
Copy Markdown
Collaborator

For the "wait fro sync" part, I think core have a RPC for that. It basically hangs until the daemon reaches a certain height. We could do that as well

This is an excellent idea, but the error that will occur if the sync hasn’t finished will be a request timeout. In that case, we just need to map that error to indicate the correct cause.

Currently, the maximum time we wait for a request to complete is 15 seconds. For chain sync, we may need to increase that limit, because on weaker computers 15 seconds might not be enough. I think we can safely raise it to somewhere between 45 and 60 seconds.

prev_height = None
prev_validated = None

while True:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using while true is not a good idea. The ideal approach is to handle the timeout case here, because inside the while, if that happens, it can exit by returning immediately. If it keeps going after the timeout, it ends up causing an error.

In this PR here: #897 I added a wait_until helper that already does this automatically, so it makes things easier.

@jaoleal
Copy link
Copy Markdown
Member Author

jaoleal commented May 18, 2026

For the "wait fro sync" part, I think core have a RPC for that. It basically hangs until the daemon reaches a certain height. We could do that as well

Yeah, this is really interesting... Moves the timeout to the node side and perhaps will get rid of some python overhead.

But for this Pr I insist on the current python-centric approach, I think its better to leave this specific rpc to work in the future and bring a more elegant solution for it.

jaoleal added 2 commits May 18, 2026 11:26
…y logic

Add wait_for_sync and wait_for_height to FlorestaTestFramework, both
using stale-state detection: the countdown resets on progress and only
fires when the node truly stalls.

Both tolerate transient RPC errors (ReadTimeout, ConnectionError) by
retrying within the stale window instead of failing immediately.

Add retry logic to BaseRPC.perform_request: transient errors are
retried for up to REQUEST_STALE_TIMEOUT before propagating.

Rename BaseRPC.TIMEOUT to REQUEST_TIMEOUT.

Extract all hardcoded timeouts into environment-variable-backed constants
so slow runners can increase tolerance:

- FLORESTA_SYNC_TIMEOUT (default: 120s)
- FLORESTA_PEER_CONNECTION_TIMEOUT (default: 30s)
- FLORESTA_REQUEST_TIMEOUT (default: 15s)
- FLORESTA_REQUEST_STALE_TIMEOUT (default: 60s)

Document all environment variables in doc/running-tests.md
…_height

Replace inline sync-waiting loops across test files with the new
framework helpers. Tests syncing from utreexod use wait_for_sync
(requires IBD completion), while getblock (syncing from bitcoind)
uses wait_for_height.
@jaoleal jaoleal force-pushed the test/wait-for-sync-helper branch from 504f98f to a39ef8d Compare May 18, 2026 15:05
)


def wait_until(predicate, *, timeout=SYNC_TIMEOUT, interval=0.5):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to util.py

Comment thread tests/test_framework/__init__.py
Comment thread tests/test_framework/__init__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Needs review

Development

Successfully merging this pull request may close these issues.

5 participants