Issue 7578 - schema - attribute refcount is not maintained properly by mreynolds389 · Pull Request #7582 · 389ds/389-ds-base

mreynolds389 · 2026-06-15T14:31:21Z

Description:

The refcount mechanism for attributes is not correctly used, and this prevents its use for synchronizing hashtable access. We increment the refcount, but we don't decrement it correctly. This can potentially cause a rare race condition with schema reload task and heap-use-after-free with searches running at the same time as the reload task.

CWE: CWE-416 (Use After Free) via CWE-362 (Race Condition)

CI test assisted by: Cursor

relates: #7578

Summary by Sourcery

Fix attribute syntax reference counting and cleanup during schema reload to prevent use-after-free race conditions, and add a stress test that exercises concurrent searches with schema reloads to detect crashes.

Bug Fixes:

Correct attribute syntax reference counting by ensuring attr_syntax_return is called wherever references are taken and by guarding deletion on both refcount and deletion flags.
Prevent premature freeing of global attribute syntax structures during hash table swaps by honoring reference counts and delete markers, avoiding use-after-free during schema reloads.
Improve error logging in schema reload paths with clearer messages and ensure temporary DSE structures are destroyed on failure.

Tests:

Add a multithreaded schema reload stress test that runs concurrent searches and schema reload tasks, with crash monitoring and detailed statistics, to expose heap-use-after-free or race-condition crashes in attribute syntax handling.

sourcery-ai

Hey - I've found 8 issues, and left some high level feedback:

The new schema reload stress test uses module-level globals (stop_event, crash_detected, stats) without resetting them at the start of the test; if this file ever ends up with multiple tests or gets rerun in the same process, stale state could leak between runs, so consider reinitializing these globals at the beginning of test_schema_reload_under_load.
In test_schema_reload_under_load, you log reloads_timeout and searches_timeout but never update those counters anywhere; either implement the timeout accounting or drop those fields from the summary to avoid confusing, always-zero metrics.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new schema reload stress test uses module-level globals (stop_event, crash_detected, stats) without resetting them at the start of the test; if this file ever ends up with multiple tests or gets rerun in the same process, stale state could leak between runs, so consider reinitializing these globals at the beginning of test_schema_reload_under_load.
- In test_schema_reload_under_load, you log reloads_timeout and searches_timeout but never update those counters anywhere; either implement the timeout accounting or drop those fields from the summary to avoid confusing, always-zero metrics.

## Individual Comments

### Comment 1
<location path="ldap/servers/slapd/attrsyntax.c" line_range="384-385" />
<code_context>

         if (delete_it) {
-            if (asi->asi_marked_for_delete) { /* one final check */
+            if (slapi_atomic_load_64(&asi->asi_refcnt, __ATOMIC_ACQUIRE) == 0 &&
+                asi->asi_marked_for_delete) /* one final check */
+            {
                 if (use_lock) {
</code_context>
<issue_to_address>
**issue (bug_risk):** The final delete check should also re-validate `asi_refcnt` under the write lock to avoid a refcount race.

Between the atomic refcount check and acquiring the write lock, another thread can increment `asi_refcnt` from 0 to 1. Under the write lock you only re-check `asi_marked_for_delete`, so you could free an object whose refcount is no longer zero. After taking the write lock, re-read and validate `asi_refcnt == 0` (or use a CAS-style pattern) before calling `attr_syntax_free(asi)`.
</issue_to_address>

### Comment 2
<location path="ldap/servers/slapd/schema.c" line_range="2864-2865" />
<code_context>
                                        schema_errprefix_at, newasip->asi_name,
                                        "Could not be added (OID is \"%s\")",
                                        newasip->asi_oid);
+                attr_syntax_return(newasip);
                 attr_syntax_free(newasip);
                 goto clean_up_and_return;
             }
</code_context>
<issue_to_address>
**issue (bug_risk):** Calling `attr_syntax_return` followed by `attr_syntax_free` on the same `asi` instance may over-release it.

Here `newasip` is handed back to the attr-syntax layer and then freed again. If `attr_syntax_return(newasip)` already decrements the refcount and may free the object, calling `attr_syntax_free(newasip)` afterwards risks a double free or state corruption. Please use only one of these, consistent with the intended ownership/refcounting contract for `attr_syntax` objects.
</issue_to_address>

### Comment 3
<location path="ldap/servers/slapd/schema.c" line_range="4223-4224" />
<code_context>
                                                 primary_file /* force user defined? */,
                                                 schema_ds4x_compat, 0)) != 0)
                     break;
+                attr_syntax_return(tmpasip);
                 attr_syntax_free(tmpasip); /* trash it */
             } else {
                 if ((*returncode = parse_at_str(s, NULL, returntext,
</code_context>
<issue_to_address>
**question (bug_risk):** The `attr_syntax_return(tmpasip)` + `attr_syntax_free(tmpasip)` pattern may indicate inconsistent ownership handling for parsed attribute syntaxes.

This sequence assumes `parse_at_str` both registers `tmpasip` globally and transfers ownership to the caller. If registration uses a shared refcount, `attr_syntax_return` should likely drop only the registry’s reference; calling `attr_syntax_free` immediately after may be unsafe if the registry still holds (or can reacquire) references. Please verify the intended lifecycle (parse + insert → drop registry ref → free caller’s copy) and confirm that `attr_syntax_free` aligns with that model.
</issue_to_address>

### Comment 4
<location path="ldap/servers/slapd/schema.c" line_range="6049-6050" />
<code_context>
                 slapi_log_err(SLAPI_LOG_ERR, "schema_berval_to_atlist",
                               "parse_at_str(%s) failed - %s\n",
                               at_berval[i]->bv_val, errorbuf[0] ? errorbuf : "unknown");
+                attr_syntax_return(at);
                 attr_syntax_free(at);
                 break;
             }
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Error handling in `schema_berval_to_atlist` also uses `attr_syntax_return` followed by `attr_syntax_free`, with the same potential double-release concern.

Here you both `attr_syntax_return(at)` and then `attr_syntax_free(at)` on the error path, while the success path also calls `attr_syntax_return(at)` after linking into `at_list`. This makes ownership of `at` unclear and risks double free or leaks. Please clarify the contract of `attr_syntax_return` (does it fully drop the registry’s reference?) and standardize the pattern here and at other call sites (e.g., either always `return` and never `free`, or vice versa).

Suggested implementation:

```c
                slapi_log_err(SLAPI_LOG_ERR, "schema_berval_to_atlist",
                              "parse_at_str(%s) failed - %s\n",
                              at_berval[i]->bv_val, errorbuf[0] ? errorbuf : "unknown");
                attr_syntax_return(at);
                break;

```

```c
                                       schema_errprefix_at, newasip->asi_name,
                                       "Could not be added (OID is \"%s\")",
                                       newasip->asi_oid);
                attr_syntax_return(newasip);
                goto clean_up_and_return;

```
</issue_to_address>

### Comment 5
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="27-30" />
<code_context>
+# ---------------------------------------------------------------------------
+# Globals for cross-thread coordination
+# ---------------------------------------------------------------------------
+stop_event = threading.Event()
+crash_detected = threading.Event()
+stats_lock = threading.Lock()
+stats = Counter()
+
+ITERATIONS = 2
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Reset or scope the global threading Events and stats per test run to avoid state leakage across tests.

Because `stop_event`, `crash_detected`, and `stats` are module-level, they keep their state across test invocations in the same pytest run (e.g. `--lf`/`--ff`, or additional tests in this module). If an event remains set or `stats` is non-empty from a prior run, later runs may exit early or report incorrect results. Please either move this logic into a fixture that creates fresh `Event`/`Counter` instances per test, or reset the events and `stats` at the start of `test_schema_reload_under_load()`.

Suggested implementation:

```python
def test_schema_reload_under_load(topology_st):
    # Reset global coordination state for each test invocation to avoid
    # leaking state across tests or pytest runs (e.g. --lf/--ff).
    stop_event.clear()
    crash_detected.clear()
    with stats_lock:
        stats.clear()

```

The exact signature of `test_schema_reload_under_load` in your file may differ (extra fixtures/parameters, different fixture name, etc.). If so, adjust the `SEARCH` pattern to match the actual definition line of that test function.

The important part is to insert the three reset calls as the first statements in `test_schema_reload_under_load`'s body:

```python
    stop_event.clear()
    crash_detected.clear()
    with stats_lock:
        stats.clear()
```

If any other tests in this module also rely on these globals, you should add the same reset snippet at the beginning of those tests as well, or factor it into a small helper and call that helper from each test.
</issue_to_address>

### Comment 6
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="166-168" />
<code_context>
+        except Exception:
+            pass
+
+        task = schema.reload()
+        task.wait()
+        assert task.get_exit_code() == 0
+
+        time.sleep(0.01)
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against a stuck schema reload task by adding a timeout to task.wait().

Without a timeout, a stuck `schema.reload()` (or its task) can hang the entire test run indefinitely. Please use `task.wait(timeout=<seconds>)` and either assert it completes or fail with a clear error if the timeout is exceeded, so we can distinguish a real hang from the intended long-running stress scenario and keep CI reliable.
</issue_to_address>

### Comment 7
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="61-70" />
<code_context>
+    reload_thread.start()
+
+    start_time = time.monotonic()
+    try:
+        while reload_thread.is_alive():
+            reload_thread.join(timeout=5.0)
+            elapsed = time.monotonic() - start_time
+            with stats_lock:
+                s = dict(stats)
+            reloads = s.get('reloads_ok', 0)
+            searches = s.get('searches_ok', 0)
+            conn_errs = (s.get('connection_errors', 0) +
+                         s.get('reload_connection_errors', 0))
+            crashed = s.get('crash_detected', 0)
+
+            status = 'CRASH DETECTED' if crashed else 'running'
+            log.info(f'  [{elapsed:6.1f}s] reloads={reloads}  searches={searches}  '
+                  f'conn_errors={conn_errs}  status={status}')
+
+            if crash_detected.is_set():
+                break
+
</code_context>
<issue_to_address>
**suggestion (testing):** Consider enforcing a global maximum duration for the main stress loop to prevent hangs.

Because this loop only stops when the reload thread exits or `crash_detected` is set, a reload thread stuck in `schema_reload_worker` or a long LDAP call can cause the test to hang indefinitely while the server stays reachable. Please add a hard upper bound (e.g. `pytest.mark.timeout` or a `max_duration` check that fails after N seconds) so the test always terminates even if the reload thread wedges.

Suggested implementation:

```python
stats = Counter()
MAX_STRESS_TEST_DURATION = 300  # seconds; hard upper bound to avoid hanging tests

```

```python
    reload_thread.start()

    start_time = time.monotonic()
    try:
        while reload_thread.is_alive():
            reload_thread.join(timeout=5.0)
            elapsed = time.monotonic() - start_time

            # Enforce a hard upper bound so the test cannot hang indefinitely
            if elapsed > MAX_STRESS_TEST_DURATION:
                pytest.fail(
                    f"schema reload stress test exceeded max duration "
                    f"{MAX_STRESS_TEST_DURATION}s; aborting to avoid hang"
                )

            with stats_lock:
                s = dict(stats)
            reloads = s.get('reloads_ok', 0)
            searches = s.get('searches_ok', 0)
            conn_errs = (s.get('connection_errors', 0) +
                         s.get('reload_connection_errors', 0))
            crashed = s.get('crash_detected', 0)

            status = 'CRASH DETECTED' if crashed else 'running'
            log.info(
                f'  [{elapsed:6.1f}s] reloads={reloads}  searches={searches}  '
                f'conn_errors={conn_errs}  status={status}'
            )

            if crash_detected.is_set():
                break

```

1. Ensure `pytest` is imported at the top of `schema_reload_heap_use_after_free_test.py`, e.g. add `import pytest` alongside the other imports.
2. Optionally, if you prefer a different upper bound, adjust `MAX_STRESS_TEST_DURATION` to the desired number of seconds.
3. If your test suite already uses `pytest.mark.timeout`, you could instead (or additionally) put an explicit `@pytest.mark.timeout(MAX_STRESS_TEST_DURATION + safety_margin)` decorator on the test function containing this loop.
</issue_to_address>

### Comment 8
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="327-329" />
<code_context>
+    log.info('Results')
+    log.info('-' * 70)
+    log.info(f'  Elapsed time:      {elapsed:.1f}s')
+    log.info(f'  Schema reloads:    {s.get("reloads_ok", 0)} ok, '
+          f'{s.get("reloads_fail", 0)} fail, '
+          f'{s.get("reloads_timeout", 0)} timeout')
+    log.info(f'  LDAP searches:     {s.get("searches_ok", 0)} ok, '
+          f'{s.get("searches_fail", 0)} fail, '
</code_context>
<issue_to_address>
**nitpick:** Either initialise or remove the unused *timeout counters in the final stats log.

`reloads_timeout` and `searches_timeout` are never updated in this test, so the log will always report `0 timeout` even if timeouts actually occurred elsewhere. Please either connect these fields to real timeout-handling logic or drop them from the output to avoid misleading diagnostics.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-06-15T14:33:48Z

+stop_event = threading.Event()
+crash_detected = threading.Event()
+stats_lock = threading.Lock()
+stats = Counter()


suggestion (bug_risk): Reset or scope the global threading Events and stats per test run to avoid state leakage across tests.

Because stop_event, crash_detected, and stats are module-level, they keep their state across test invocations in the same pytest run (e.g. --lf/--ff, or additional tests in this module). If an event remains set or stats is non-empty from a prior run, later runs may exit early or report incorrect results. Please either move this logic into a fixture that creates fresh Event/Counter instances per test, or reset the events and stats at the start of test_schema_reload_under_load().

Suggested implementation:

def test_schema_reload_under_load(topology_st): # Reset global coordination state for each test invocation to avoid # leaking state across tests or pytest runs (e.g. --lf/--ff). stop_event.clear() crash_detected.clear() with stats_lock: stats.clear()

The exact signature of test_schema_reload_under_load in your file may differ (extra fixtures/parameters, different fixture name, etc.). If so, adjust the SEARCH pattern to match the actual definition line of that test function.

The important part is to insert the three reset calls as the first statements in test_schema_reload_under_load's body:

stop_event.clear() crash_detected.clear() with stats_lock: stats.clear()

If any other tests in this module also rely on these globals, you should add the same reset snippet at the beginning of those tests as well, or factor it into a small helper and call that helper from each test.

sourcery-ai · 2026-06-15T14:33:48Z

+    log.info(f'  Schema reloads:    {s.get("reloads_ok", 0)} ok, '
+          f'{s.get("reloads_fail", 0)} fail, '
+          f'{s.get("reloads_timeout", 0)} timeout')


nitpick: Either initialise or remove the unused *timeout counters in the final stats log.

reloads_timeout and searches_timeout are never updated in this test, so the log will always report 0 timeout even if timeouts actually occurred elsewhere. Please either connect these fields to real timeout-handling logic or drop them from the output to avoid misleading diagnostics.

packit-as-a-service · 2026-06-15T14:48:10Z

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

sudo dnf install -y 'dnf*-command(copr)'
dnf copr enable packit/389ds-389-ds-base-7582
And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

progier389

LGTM

droideck · 2026-06-23T21:55:06Z

            }

            newasip->asi_flags |= SLAPI_ATTR_FLAG_KEEP;
+            attr_syntax_return(newasip);


I think this return might be unsafe for this ownership path... newasip comes from parse_at_str(_, &newasip, _), so it was newly allocated rather than acquired through an attr_syntax_get_*() lookup. IIUC, calling attr_syntax_return() here can decrement a refcount that was never incremented; after attr_syntax_add() succeeds, the schema table should own the object.

droideck · 2026-06-23T21:56:43Z

                                                primary_file /* force user defined? */,
                                                schema_ds4x_compat, 0)) != 0)
                    break;
+                attr_syntax_return(tmpasip);


This looks like the same ownership mismatch as above. tmpasip is only a temporary parse result used for validation, not a refcounted lookup result, Should it be released directly with attr_syntax_free() without first calling attr_syntax_return()?

droideck · 2026-06-23T21:58:20Z

                at->asi_prev = at_list;
                at_list = at;
            }
+            attr_syntax_return(at);


Probably, a similar issue here too? The parsed at node is linked into the private remote_at_list and later freed by schema_atlist_free(), so calling attr_syntax_return() while building the list might underflow the refcount

droideck · 2026-06-23T22:10:39Z

    while (global_at) {
        next = global_at->asi_next;
-        attr_syntax_free(global_at);
+        global_at->asi_marked_for_delete = 1;


Maybe I'm overthinking this, but I think deferred old nodes still keep their asi_prev/asi_next links into the old global_at list. If they are later freed via attr_syntax_return(), could attr_syntax_remove() follow stale links or disturb the new global_at? Should we detach them during the swap or skip unlinking swap-orphans?

droideck · 2026-06-23T22:12:46Z

+        except Exception:
+            pass
+
+        task = schema.reload()


This assertion runs inside the background reload thread, so pytest may not fail the test reliably if the reload task exits non-zero. Could we record reload task failures in stats and assert them from the main test thread, along with the existing crash/server-alive checks?

tbordaz

Few comments

tbordaz · 2026-06-24T10:13:21Z

    while (global_at) {
        next = global_at->asi_next;
-        attr_syntax_free(global_at);
+        global_at->asi_marked_for_delete = 1;


I agree with that change but I would expect code review tools to report a risk of leak in case refcnt is not null.

tbordaz · 2026-06-24T10:20:59Z

                                       "Could not be added (OID is \"%s\")",
                                       newasip->asi_oid);
-                attr_syntax_free(newasip);
+                attr_syntax_return(newasip);


Sorry I can not find where the refcnt of 'newasip' is increased in attr_sync_add or before.
So attr_sync_return may set refcnt to -1. Why not just keeping attr_syntax_free ?

tbordaz · 2026-06-24T10:28:34Z

                                                primary_file /* force user defined? */,
                                                schema_ds4x_compat, 0)) != 0)
                    break;
+                attr_syntax_return(tmpasip);


I think this one is useless as refcnt was not increase in that path

tbordaz · 2026-06-24T10:29:11Z

            struct asyntaxinfo *tmpasip = NULL;
            rc = parse_at_str(attr_str, &tmpasip, errorbuf, sizeof(errorbuf),
                              DSE_SCHEMA_NO_GLOCK | schema_flags, 0, 0, 0);
+            attr_syntax_return(tmpasip);


same remark

tbordaz · 2026-06-24T10:33:00Z

                              "parse_at_str(%s) failed - %s\n",
                              at_berval[i]->bv_val, errorbuf[0] ? errorbuf : "unknown");
-                attr_syntax_free(at);
+                attr_syntax_return(at);


Same remark I believe that parse_at_str does not increase refcnt

tbordaz · 2026-06-24T10:33:12Z

                at->asi_prev = at_list;
                at_list = at;
            }
+            attr_syntax_return(at);


same remark

mreynolds389 · 2026-06-24T14:19:24Z

@droideck @tbordaz thanks for the comments. I need to run more tests to verify, but I recall that all these changes were needed to avoid ASAN errors and to fix the original heap corruption issue. This patch was heavily tested without any issues, but perhaps some of the changes are not needed...

Description: The refcount mechanism for attributes is not correctly used, and this prevents its use for synchronizing hashtable access. We increment the refcount, but we don't decrement it correctly. This can potentially cause a rare race condition with schema reload task and heap-use-after-free with searches running at the same time as the reload task. **CWE**: CWE-416 (Use After Free) via CWE-362 (Race Condition) CI test assisted by: Cursor relates: 389ds#7578 Reviewed by: progier, tbordaz, and spichugi(Thanks!!!)

sourcery-ai Bot reviewed Jun 15, 2026

View reviewed changes

mreynolds389 force-pushed the schema-reload-cve branch 2 times, most recently from 5b58ee2 to 67fdb68 Compare June 15, 2026 20:05

progier389 approved these changes Jun 23, 2026

View reviewed changes

mreynolds389 force-pushed the schema-reload-cve branch from 67fdb68 to 5a6f060 Compare June 23, 2026 19:07

droideck requested changes Jun 23, 2026

View reviewed changes

tbordaz reviewed Jun 24, 2026

View reviewed changes

mreynolds389 force-pushed the schema-reload-cve branch 2 times, most recently from 34d6dc7 to 3750475 Compare June 25, 2026 14:31

mreynolds389 force-pushed the schema-reload-cve branch from 3750475 to 313b853 Compare June 25, 2026 21:07

Uh oh!

Conversation

mreynolds389 commented Jun 15, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sourcery-ai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

packit-as-a-service Bot commented Jun 15, 2026

Uh oh!

progier389 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbordaz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mreynolds389 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mreynolds389 commented Jun 15, 2026 •

edited by sourcery-ai Bot

Loading