Issue 7578 - schema - attribute refcount is not maintained properly#7582
Issue 7578 - schema - attribute refcount is not maintained properly#7582mreynolds389 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Hey - I've found 8 issues, and left some high level feedback:
- The new schema reload stress test uses module-level globals (stop_event, crash_detected, stats) without resetting them at the start of the test; if this file ever ends up with multiple tests or gets rerun in the same process, stale state could leak between runs, so consider reinitializing these globals at the beginning of test_schema_reload_under_load.
- In test_schema_reload_under_load, you log reloads_timeout and searches_timeout but never update those counters anywhere; either implement the timeout accounting or drop those fields from the summary to avoid confusing, always-zero metrics.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new schema reload stress test uses module-level globals (stop_event, crash_detected, stats) without resetting them at the start of the test; if this file ever ends up with multiple tests or gets rerun in the same process, stale state could leak between runs, so consider reinitializing these globals at the beginning of test_schema_reload_under_load.
- In test_schema_reload_under_load, you log reloads_timeout and searches_timeout but never update those counters anywhere; either implement the timeout accounting or drop those fields from the summary to avoid confusing, always-zero metrics.
## Individual Comments
### Comment 1
<location path="ldap/servers/slapd/attrsyntax.c" line_range="384-385" />
<code_context>
if (delete_it) {
- if (asi->asi_marked_for_delete) { /* one final check */
+ if (slapi_atomic_load_64(&asi->asi_refcnt, __ATOMIC_ACQUIRE) == 0 &&
+ asi->asi_marked_for_delete) /* one final check */
+ {
if (use_lock) {
</code_context>
<issue_to_address>
**issue (bug_risk):** The final delete check should also re-validate `asi_refcnt` under the write lock to avoid a refcount race.
Between the atomic refcount check and acquiring the write lock, another thread can increment `asi_refcnt` from 0 to 1. Under the write lock you only re-check `asi_marked_for_delete`, so you could free an object whose refcount is no longer zero. After taking the write lock, re-read and validate `asi_refcnt == 0` (or use a CAS-style pattern) before calling `attr_syntax_free(asi)`.
</issue_to_address>
### Comment 2
<location path="ldap/servers/slapd/schema.c" line_range="2864-2865" />
<code_context>
schema_errprefix_at, newasip->asi_name,
"Could not be added (OID is \"%s\")",
newasip->asi_oid);
+ attr_syntax_return(newasip);
attr_syntax_free(newasip);
goto clean_up_and_return;
}
</code_context>
<issue_to_address>
**issue (bug_risk):** Calling `attr_syntax_return` followed by `attr_syntax_free` on the same `asi` instance may over-release it.
Here `newasip` is handed back to the attr-syntax layer and then freed again. If `attr_syntax_return(newasip)` already decrements the refcount and may free the object, calling `attr_syntax_free(newasip)` afterwards risks a double free or state corruption. Please use only one of these, consistent with the intended ownership/refcounting contract for `attr_syntax` objects.
</issue_to_address>
### Comment 3
<location path="ldap/servers/slapd/schema.c" line_range="4223-4224" />
<code_context>
primary_file /* force user defined? */,
schema_ds4x_compat, 0)) != 0)
break;
+ attr_syntax_return(tmpasip);
attr_syntax_free(tmpasip); /* trash it */
} else {
if ((*returncode = parse_at_str(s, NULL, returntext,
</code_context>
<issue_to_address>
**question (bug_risk):** The `attr_syntax_return(tmpasip)` + `attr_syntax_free(tmpasip)` pattern may indicate inconsistent ownership handling for parsed attribute syntaxes.
This sequence assumes `parse_at_str` both registers `tmpasip` globally and transfers ownership to the caller. If registration uses a shared refcount, `attr_syntax_return` should likely drop only the registry’s reference; calling `attr_syntax_free` immediately after may be unsafe if the registry still holds (or can reacquire) references. Please verify the intended lifecycle (parse + insert → drop registry ref → free caller’s copy) and confirm that `attr_syntax_free` aligns with that model.
</issue_to_address>
### Comment 4
<location path="ldap/servers/slapd/schema.c" line_range="6049-6050" />
<code_context>
slapi_log_err(SLAPI_LOG_ERR, "schema_berval_to_atlist",
"parse_at_str(%s) failed - %s\n",
at_berval[i]->bv_val, errorbuf[0] ? errorbuf : "unknown");
+ attr_syntax_return(at);
attr_syntax_free(at);
break;
}
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Error handling in `schema_berval_to_atlist` also uses `attr_syntax_return` followed by `attr_syntax_free`, with the same potential double-release concern.
Here you both `attr_syntax_return(at)` and then `attr_syntax_free(at)` on the error path, while the success path also calls `attr_syntax_return(at)` after linking into `at_list`. This makes ownership of `at` unclear and risks double free or leaks. Please clarify the contract of `attr_syntax_return` (does it fully drop the registry’s reference?) and standardize the pattern here and at other call sites (e.g., either always `return` and never `free`, or vice versa).
Suggested implementation:
```c
slapi_log_err(SLAPI_LOG_ERR, "schema_berval_to_atlist",
"parse_at_str(%s) failed - %s\n",
at_berval[i]->bv_val, errorbuf[0] ? errorbuf : "unknown");
attr_syntax_return(at);
break;
```
```c
schema_errprefix_at, newasip->asi_name,
"Could not be added (OID is \"%s\")",
newasip->asi_oid);
attr_syntax_return(newasip);
goto clean_up_and_return;
```
</issue_to_address>
### Comment 5
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="27-30" />
<code_context>
+# ---------------------------------------------------------------------------
+# Globals for cross-thread coordination
+# ---------------------------------------------------------------------------
+stop_event = threading.Event()
+crash_detected = threading.Event()
+stats_lock = threading.Lock()
+stats = Counter()
+
+ITERATIONS = 2
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Reset or scope the global threading Events and stats per test run to avoid state leakage across tests.
Because `stop_event`, `crash_detected`, and `stats` are module-level, they keep their state across test invocations in the same pytest run (e.g. `--lf`/`--ff`, or additional tests in this module). If an event remains set or `stats` is non-empty from a prior run, later runs may exit early or report incorrect results. Please either move this logic into a fixture that creates fresh `Event`/`Counter` instances per test, or reset the events and `stats` at the start of `test_schema_reload_under_load()`.
Suggested implementation:
```python
def test_schema_reload_under_load(topology_st):
# Reset global coordination state for each test invocation to avoid
# leaking state across tests or pytest runs (e.g. --lf/--ff).
stop_event.clear()
crash_detected.clear()
with stats_lock:
stats.clear()
```
The exact signature of `test_schema_reload_under_load` in your file may differ (extra fixtures/parameters, different fixture name, etc.). If so, adjust the `SEARCH` pattern to match the actual definition line of that test function.
The important part is to insert the three reset calls as the first statements in `test_schema_reload_under_load`'s body:
```python
stop_event.clear()
crash_detected.clear()
with stats_lock:
stats.clear()
```
If any other tests in this module also rely on these globals, you should add the same reset snippet at the beginning of those tests as well, or factor it into a small helper and call that helper from each test.
</issue_to_address>
### Comment 6
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="166-168" />
<code_context>
+ except Exception:
+ pass
+
+ task = schema.reload()
+ task.wait()
+ assert task.get_exit_code() == 0
+
+ time.sleep(0.01)
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against a stuck schema reload task by adding a timeout to task.wait().
Without a timeout, a stuck `schema.reload()` (or its task) can hang the entire test run indefinitely. Please use `task.wait(timeout=<seconds>)` and either assert it completes or fail with a clear error if the timeout is exceeded, so we can distinguish a real hang from the intended long-running stress scenario and keep CI reliable.
</issue_to_address>
### Comment 7
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="61-70" />
<code_context>
+ reload_thread.start()
+
+ start_time = time.monotonic()
+ try:
+ while reload_thread.is_alive():
+ reload_thread.join(timeout=5.0)
+ elapsed = time.monotonic() - start_time
+ with stats_lock:
+ s = dict(stats)
+ reloads = s.get('reloads_ok', 0)
+ searches = s.get('searches_ok', 0)
+ conn_errs = (s.get('connection_errors', 0) +
+ s.get('reload_connection_errors', 0))
+ crashed = s.get('crash_detected', 0)
+
+ status = 'CRASH DETECTED' if crashed else 'running'
+ log.info(f' [{elapsed:6.1f}s] reloads={reloads} searches={searches} '
+ f'conn_errors={conn_errs} status={status}')
+
+ if crash_detected.is_set():
+ break
+
</code_context>
<issue_to_address>
**suggestion (testing):** Consider enforcing a global maximum duration for the main stress loop to prevent hangs.
Because this loop only stops when the reload thread exits or `crash_detected` is set, a reload thread stuck in `schema_reload_worker` or a long LDAP call can cause the test to hang indefinitely while the server stays reachable. Please add a hard upper bound (e.g. `pytest.mark.timeout` or a `max_duration` check that fails after N seconds) so the test always terminates even if the reload thread wedges.
Suggested implementation:
```python
stats = Counter()
MAX_STRESS_TEST_DURATION = 300 # seconds; hard upper bound to avoid hanging tests
```
```python
reload_thread.start()
start_time = time.monotonic()
try:
while reload_thread.is_alive():
reload_thread.join(timeout=5.0)
elapsed = time.monotonic() - start_time
# Enforce a hard upper bound so the test cannot hang indefinitely
if elapsed > MAX_STRESS_TEST_DURATION:
pytest.fail(
f"schema reload stress test exceeded max duration "
f"{MAX_STRESS_TEST_DURATION}s; aborting to avoid hang"
)
with stats_lock:
s = dict(stats)
reloads = s.get('reloads_ok', 0)
searches = s.get('searches_ok', 0)
conn_errs = (s.get('connection_errors', 0) +
s.get('reload_connection_errors', 0))
crashed = s.get('crash_detected', 0)
status = 'CRASH DETECTED' if crashed else 'running'
log.info(
f' [{elapsed:6.1f}s] reloads={reloads} searches={searches} '
f'conn_errors={conn_errs} status={status}'
)
if crash_detected.is_set():
break
```
1. Ensure `pytest` is imported at the top of `schema_reload_heap_use_after_free_test.py`, e.g. add `import pytest` alongside the other imports.
2. Optionally, if you prefer a different upper bound, adjust `MAX_STRESS_TEST_DURATION` to the desired number of seconds.
3. If your test suite already uses `pytest.mark.timeout`, you could instead (or additionally) put an explicit `@pytest.mark.timeout(MAX_STRESS_TEST_DURATION + safety_margin)` decorator on the test function containing this loop.
</issue_to_address>
### Comment 8
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="327-329" />
<code_context>
+ log.info('Results')
+ log.info('-' * 70)
+ log.info(f' Elapsed time: {elapsed:.1f}s')
+ log.info(f' Schema reloads: {s.get("reloads_ok", 0)} ok, '
+ f'{s.get("reloads_fail", 0)} fail, '
+ f'{s.get("reloads_timeout", 0)} timeout')
+ log.info(f' LDAP searches: {s.get("searches_ok", 0)} ok, '
+ f'{s.get("searches_fail", 0)} fail, '
</code_context>
<issue_to_address>
**nitpick:** Either initialise or remove the unused *timeout counters in the final stats log.
`reloads_timeout` and `searches_timeout` are never updated in this test, so the log will always report `0 timeout` even if timeouts actually occurred elsewhere. Please either connect these fields to real timeout-handling logic or drop them from the output to avoid misleading diagnostics.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| stop_event = threading.Event() | ||
| crash_detected = threading.Event() | ||
| stats_lock = threading.Lock() | ||
| stats = Counter() |
There was a problem hiding this comment.
suggestion (bug_risk): Reset or scope the global threading Events and stats per test run to avoid state leakage across tests.
Because stop_event, crash_detected, and stats are module-level, they keep their state across test invocations in the same pytest run (e.g. --lf/--ff, or additional tests in this module). If an event remains set or stats is non-empty from a prior run, later runs may exit early or report incorrect results. Please either move this logic into a fixture that creates fresh Event/Counter instances per test, or reset the events and stats at the start of test_schema_reload_under_load().
Suggested implementation:
def test_schema_reload_under_load(topology_st):
# Reset global coordination state for each test invocation to avoid
# leaking state across tests or pytest runs (e.g. --lf/--ff).
stop_event.clear()
crash_detected.clear()
with stats_lock:
stats.clear()The exact signature of test_schema_reload_under_load in your file may differ (extra fixtures/parameters, different fixture name, etc.). If so, adjust the SEARCH pattern to match the actual definition line of that test function.
The important part is to insert the three reset calls as the first statements in test_schema_reload_under_load's body:
stop_event.clear()
crash_detected.clear()
with stats_lock:
stats.clear()If any other tests in this module also rely on these globals, you should add the same reset snippet at the beginning of those tests as well, or factor it into a small helper and call that helper from each test.
| log.info(f' Schema reloads: {s.get("reloads_ok", 0)} ok, ' | ||
| f'{s.get("reloads_fail", 0)} fail, ' | ||
| f'{s.get("reloads_timeout", 0)} timeout') |
There was a problem hiding this comment.
nitpick: Either initialise or remove the unused *timeout counters in the final stats log.
reloads_timeout and searches_timeout are never updated in this test, so the log will always report 0 timeout even if timeouts actually occurred elsewhere. Please either connect these fields to real timeout-handling logic or drop them from the output to avoid misleading diagnostics.
|
Congratulations! One of the builds has completed. 🍾 You can install the built RPMs by following these steps:
Please note that the RPMs should be used only in a testing environment. |
5b58ee2 to
67fdb68
Compare
67fdb68 to
5a6f060
Compare
| } | ||
|
|
||
| newasip->asi_flags |= SLAPI_ATTR_FLAG_KEEP; | ||
| attr_syntax_return(newasip); |
There was a problem hiding this comment.
I think this return might be unsafe for this ownership path... newasip comes from parse_at_str(_, &newasip, _), so it was newly allocated rather than acquired through an attr_syntax_get_*() lookup. IIUC, calling attr_syntax_return() here can decrement a refcount that was never incremented; after attr_syntax_add() succeeds, the schema table should own the object.
| primary_file /* force user defined? */, | ||
| schema_ds4x_compat, 0)) != 0) | ||
| break; | ||
| attr_syntax_return(tmpasip); |
There was a problem hiding this comment.
This looks like the same ownership mismatch as above. tmpasip is only a temporary parse result used for validation, not a refcounted lookup result, Should it be released directly with attr_syntax_free() without first calling attr_syntax_return()?
| at->asi_prev = at_list; | ||
| at_list = at; | ||
| } | ||
| attr_syntax_return(at); |
There was a problem hiding this comment.
Probably, a similar issue here too? The parsed at node is linked into the private remote_at_list and later freed by schema_atlist_free(), so calling attr_syntax_return() while building the list might underflow the refcount
| while (global_at) { | ||
| next = global_at->asi_next; | ||
| attr_syntax_free(global_at); | ||
| global_at->asi_marked_for_delete = 1; |
There was a problem hiding this comment.
Maybe I'm overthinking this, but I think deferred old nodes still keep their asi_prev/asi_next links into the old global_at list. If they are later freed via attr_syntax_return(), could attr_syntax_remove() follow stale links or disturb the new global_at? Should we detach them during the swap or skip unlinking swap-orphans?
| except Exception: | ||
| pass | ||
|
|
||
| task = schema.reload() |
There was a problem hiding this comment.
This assertion runs inside the background reload thread, so pytest may not fail the test reliably if the reload task exits non-zero. Could we record reload task failures in stats and assert them from the main test thread, along with the existing crash/server-alive checks?
| while (global_at) { | ||
| next = global_at->asi_next; | ||
| attr_syntax_free(global_at); | ||
| global_at->asi_marked_for_delete = 1; |
There was a problem hiding this comment.
I agree with that change but I would expect code review tools to report a risk of leak in case refcnt is not null.
| "Could not be added (OID is \"%s\")", | ||
| newasip->asi_oid); | ||
| attr_syntax_free(newasip); | ||
| attr_syntax_return(newasip); |
There was a problem hiding this comment.
Sorry I can not find where the refcnt of 'newasip' is increased in attr_sync_add or before.
So attr_sync_return may set refcnt to -1. Why not just keeping attr_syntax_free ?
| primary_file /* force user defined? */, | ||
| schema_ds4x_compat, 0)) != 0) | ||
| break; | ||
| attr_syntax_return(tmpasip); |
There was a problem hiding this comment.
I think this one is useless as refcnt was not increase in that path
| struct asyntaxinfo *tmpasip = NULL; | ||
| rc = parse_at_str(attr_str, &tmpasip, errorbuf, sizeof(errorbuf), | ||
| DSE_SCHEMA_NO_GLOCK | schema_flags, 0, 0, 0); | ||
| attr_syntax_return(tmpasip); |
| "parse_at_str(%s) failed - %s\n", | ||
| at_berval[i]->bv_val, errorbuf[0] ? errorbuf : "unknown"); | ||
| attr_syntax_free(at); | ||
| attr_syntax_return(at); |
There was a problem hiding this comment.
Same remark I believe that parse_at_str does not increase refcnt
| at->asi_prev = at_list; | ||
| at_list = at; | ||
| } | ||
| attr_syntax_return(at); |
34d6dc7 to
3750475
Compare
Description: The refcount mechanism for attributes is not correctly used, and this prevents its use for synchronizing hashtable access. We increment the refcount, but we don't decrement it correctly. This can potentially cause a rare race condition with schema reload task and heap-use-after-free with searches running at the same time as the reload task. **CWE**: CWE-416 (Use After Free) via CWE-362 (Race Condition) CI test assisted by: Cursor relates: 389ds#7578 Reviewed by: progier, tbordaz, and spichugi(Thanks!!!)
3750475 to
313b853
Compare
Description:
The refcount mechanism for attributes is not correctly used, and this prevents its use for synchronizing hashtable access. We increment the refcount, but we don't decrement it correctly. This can potentially cause a rare race condition with schema reload task and heap-use-after-free with searches running at the same time as the reload task.
CWE: CWE-416 (Use After Free) via CWE-362 (Race Condition)
CI test assisted by: Cursor
relates: #7578
Summary by Sourcery
Fix attribute syntax reference counting and cleanup during schema reload to prevent use-after-free race conditions, and add a stress test that exercises concurrent searches with schema reloads to detect crashes.
Bug Fixes:
Tests: