Skip to content

Issue 7578 - schema - attribute refcount is not maintained properly#7582

Open
mreynolds389 wants to merge 1 commit into
389ds:mainfrom
mreynolds389:schema-reload-cve
Open

Issue 7578 - schema - attribute refcount is not maintained properly#7582
mreynolds389 wants to merge 1 commit into
389ds:mainfrom
mreynolds389:schema-reload-cve

Conversation

@mreynolds389

@mreynolds389 mreynolds389 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description:

The refcount mechanism for attributes is not correctly used, and this prevents its use for synchronizing hashtable access. We increment the refcount, but we don't decrement it correctly. This can potentially cause a rare race condition with schema reload task and heap-use-after-free with searches running at the same time as the reload task.

CWE: CWE-416 (Use After Free) via CWE-362 (Race Condition)

CI test assisted by: Cursor

relates: #7578

Summary by Sourcery

Fix attribute syntax reference counting and cleanup during schema reload to prevent use-after-free race conditions, and add a stress test that exercises concurrent searches with schema reloads to detect crashes.

Bug Fixes:

  • Correct attribute syntax reference counting by ensuring attr_syntax_return is called wherever references are taken and by guarding deletion on both refcount and deletion flags.
  • Prevent premature freeing of global attribute syntax structures during hash table swaps by honoring reference counts and delete markers, avoiding use-after-free during schema reloads.
  • Improve error logging in schema reload paths with clearer messages and ensure temporary DSE structures are destroyed on failure.

Tests:

  • Add a multithreaded schema reload stress test that runs concurrent searches and schema reload tasks, with crash monitoring and detailed statistics, to expose heap-use-after-free or race-condition crashes in attribute syntax handling.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 8 issues, and left some high level feedback:

  • The new schema reload stress test uses module-level globals (stop_event, crash_detected, stats) without resetting them at the start of the test; if this file ever ends up with multiple tests or gets rerun in the same process, stale state could leak between runs, so consider reinitializing these globals at the beginning of test_schema_reload_under_load.
  • In test_schema_reload_under_load, you log reloads_timeout and searches_timeout but never update those counters anywhere; either implement the timeout accounting or drop those fields from the summary to avoid confusing, always-zero metrics.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new schema reload stress test uses module-level globals (stop_event, crash_detected, stats) without resetting them at the start of the test; if this file ever ends up with multiple tests or gets rerun in the same process, stale state could leak between runs, so consider reinitializing these globals at the beginning of test_schema_reload_under_load.
- In test_schema_reload_under_load, you log reloads_timeout and searches_timeout but never update those counters anywhere; either implement the timeout accounting or drop those fields from the summary to avoid confusing, always-zero metrics.

## Individual Comments

### Comment 1
<location path="ldap/servers/slapd/attrsyntax.c" line_range="384-385" />
<code_context>

         if (delete_it) {
-            if (asi->asi_marked_for_delete) { /* one final check */
+            if (slapi_atomic_load_64(&asi->asi_refcnt, __ATOMIC_ACQUIRE) == 0 &&
+                asi->asi_marked_for_delete) /* one final check */
+            {
                 if (use_lock) {
</code_context>
<issue_to_address>
**issue (bug_risk):** The final delete check should also re-validate `asi_refcnt` under the write lock to avoid a refcount race.

Between the atomic refcount check and acquiring the write lock, another thread can increment `asi_refcnt` from 0 to 1. Under the write lock you only re-check `asi_marked_for_delete`, so you could free an object whose refcount is no longer zero. After taking the write lock, re-read and validate `asi_refcnt == 0` (or use a CAS-style pattern) before calling `attr_syntax_free(asi)`.
</issue_to_address>

### Comment 2
<location path="ldap/servers/slapd/schema.c" line_range="2864-2865" />
<code_context>
                                        schema_errprefix_at, newasip->asi_name,
                                        "Could not be added (OID is \"%s\")",
                                        newasip->asi_oid);
+                attr_syntax_return(newasip);
                 attr_syntax_free(newasip);
                 goto clean_up_and_return;
             }
</code_context>
<issue_to_address>
**issue (bug_risk):** Calling `attr_syntax_return` followed by `attr_syntax_free` on the same `asi` instance may over-release it.

Here `newasip` is handed back to the attr-syntax layer and then freed again. If `attr_syntax_return(newasip)` already decrements the refcount and may free the object, calling `attr_syntax_free(newasip)` afterwards risks a double free or state corruption. Please use only one of these, consistent with the intended ownership/refcounting contract for `attr_syntax` objects.
</issue_to_address>

### Comment 3
<location path="ldap/servers/slapd/schema.c" line_range="4223-4224" />
<code_context>
                                                 primary_file /* force user defined? */,
                                                 schema_ds4x_compat, 0)) != 0)
                     break;
+                attr_syntax_return(tmpasip);
                 attr_syntax_free(tmpasip); /* trash it */
             } else {
                 if ((*returncode = parse_at_str(s, NULL, returntext,
</code_context>
<issue_to_address>
**question (bug_risk):** The `attr_syntax_return(tmpasip)` + `attr_syntax_free(tmpasip)` pattern may indicate inconsistent ownership handling for parsed attribute syntaxes.

This sequence assumes `parse_at_str` both registers `tmpasip` globally and transfers ownership to the caller. If registration uses a shared refcount, `attr_syntax_return` should likely drop only the registry’s reference; calling `attr_syntax_free` immediately after may be unsafe if the registry still holds (or can reacquire) references. Please verify the intended lifecycle (parse + insert → drop registry ref → free caller’s copy) and confirm that `attr_syntax_free` aligns with that model.
</issue_to_address>

### Comment 4
<location path="ldap/servers/slapd/schema.c" line_range="6049-6050" />
<code_context>
                 slapi_log_err(SLAPI_LOG_ERR, "schema_berval_to_atlist",
                               "parse_at_str(%s) failed - %s\n",
                               at_berval[i]->bv_val, errorbuf[0] ? errorbuf : "unknown");
+                attr_syntax_return(at);
                 attr_syntax_free(at);
                 break;
             }
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Error handling in `schema_berval_to_atlist` also uses `attr_syntax_return` followed by `attr_syntax_free`, with the same potential double-release concern.

Here you both `attr_syntax_return(at)` and then `attr_syntax_free(at)` on the error path, while the success path also calls `attr_syntax_return(at)` after linking into `at_list`. This makes ownership of `at` unclear and risks double free or leaks. Please clarify the contract of `attr_syntax_return` (does it fully drop the registry’s reference?) and standardize the pattern here and at other call sites (e.g., either always `return` and never `free`, or vice versa).

Suggested implementation:

```c
                slapi_log_err(SLAPI_LOG_ERR, "schema_berval_to_atlist",
                              "parse_at_str(%s) failed - %s\n",
                              at_berval[i]->bv_val, errorbuf[0] ? errorbuf : "unknown");
                attr_syntax_return(at);
                break;

```

```c
                                       schema_errprefix_at, newasip->asi_name,
                                       "Could not be added (OID is \"%s\")",
                                       newasip->asi_oid);
                attr_syntax_return(newasip);
                goto clean_up_and_return;

```
</issue_to_address>

### Comment 5
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="27-30" />
<code_context>
+# ---------------------------------------------------------------------------
+# Globals for cross-thread coordination
+# ---------------------------------------------------------------------------
+stop_event = threading.Event()
+crash_detected = threading.Event()
+stats_lock = threading.Lock()
+stats = Counter()
+
+ITERATIONS = 2
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Reset or scope the global threading Events and stats per test run to avoid state leakage across tests.

Because `stop_event`, `crash_detected`, and `stats` are module-level, they keep their state across test invocations in the same pytest run (e.g. `--lf`/`--ff`, or additional tests in this module). If an event remains set or `stats` is non-empty from a prior run, later runs may exit early or report incorrect results. Please either move this logic into a fixture that creates fresh `Event`/`Counter` instances per test, or reset the events and `stats` at the start of `test_schema_reload_under_load()`.

Suggested implementation:

```python
def test_schema_reload_under_load(topology_st):
    # Reset global coordination state for each test invocation to avoid
    # leaking state across tests or pytest runs (e.g. --lf/--ff).
    stop_event.clear()
    crash_detected.clear()
    with stats_lock:
        stats.clear()

```

The exact signature of `test_schema_reload_under_load` in your file may differ (extra fixtures/parameters, different fixture name, etc.). If so, adjust the `SEARCH` pattern to match the actual definition line of that test function.

The important part is to insert the three reset calls as the first statements in `test_schema_reload_under_load`'s body:

```python
    stop_event.clear()
    crash_detected.clear()
    with stats_lock:
        stats.clear()
```

If any other tests in this module also rely on these globals, you should add the same reset snippet at the beginning of those tests as well, or factor it into a small helper and call that helper from each test.
</issue_to_address>

### Comment 6
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="166-168" />
<code_context>
+        except Exception:
+            pass
+
+        task = schema.reload()
+        task.wait()
+        assert task.get_exit_code() == 0
+
+        time.sleep(0.01)
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against a stuck schema reload task by adding a timeout to task.wait().

Without a timeout, a stuck `schema.reload()` (or its task) can hang the entire test run indefinitely. Please use `task.wait(timeout=<seconds>)` and either assert it completes or fail with a clear error if the timeout is exceeded, so we can distinguish a real hang from the intended long-running stress scenario and keep CI reliable.
</issue_to_address>

### Comment 7
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="61-70" />
<code_context>
+    reload_thread.start()
+
+    start_time = time.monotonic()
+    try:
+        while reload_thread.is_alive():
+            reload_thread.join(timeout=5.0)
+            elapsed = time.monotonic() - start_time
+            with stats_lock:
+                s = dict(stats)
+            reloads = s.get('reloads_ok', 0)
+            searches = s.get('searches_ok', 0)
+            conn_errs = (s.get('connection_errors', 0) +
+                         s.get('reload_connection_errors', 0))
+            crashed = s.get('crash_detected', 0)
+
+            status = 'CRASH DETECTED' if crashed else 'running'
+            log.info(f'  [{elapsed:6.1f}s] reloads={reloads}  searches={searches}  '
+                  f'conn_errors={conn_errs}  status={status}')
+
+            if crash_detected.is_set():
+                break
+
</code_context>
<issue_to_address>
**suggestion (testing):** Consider enforcing a global maximum duration for the main stress loop to prevent hangs.

Because this loop only stops when the reload thread exits or `crash_detected` is set, a reload thread stuck in `schema_reload_worker` or a long LDAP call can cause the test to hang indefinitely while the server stays reachable. Please add a hard upper bound (e.g. `pytest.mark.timeout` or a `max_duration` check that fails after N seconds) so the test always terminates even if the reload thread wedges.

Suggested implementation:

```python
stats = Counter()
MAX_STRESS_TEST_DURATION = 300  # seconds; hard upper bound to avoid hanging tests

```

```python
    reload_thread.start()

    start_time = time.monotonic()
    try:
        while reload_thread.is_alive():
            reload_thread.join(timeout=5.0)
            elapsed = time.monotonic() - start_time

            # Enforce a hard upper bound so the test cannot hang indefinitely
            if elapsed > MAX_STRESS_TEST_DURATION:
                pytest.fail(
                    f"schema reload stress test exceeded max duration "
                    f"{MAX_STRESS_TEST_DURATION}s; aborting to avoid hang"
                )

            with stats_lock:
                s = dict(stats)
            reloads = s.get('reloads_ok', 0)
            searches = s.get('searches_ok', 0)
            conn_errs = (s.get('connection_errors', 0) +
                         s.get('reload_connection_errors', 0))
            crashed = s.get('crash_detected', 0)

            status = 'CRASH DETECTED' if crashed else 'running'
            log.info(
                f'  [{elapsed:6.1f}s] reloads={reloads}  searches={searches}  '
                f'conn_errors={conn_errs}  status={status}'
            )

            if crash_detected.is_set():
                break

```

1. Ensure `pytest` is imported at the top of `schema_reload_heap_use_after_free_test.py`, e.g. add `import pytest` alongside the other imports.
2. Optionally, if you prefer a different upper bound, adjust `MAX_STRESS_TEST_DURATION` to the desired number of seconds.
3. If your test suite already uses `pytest.mark.timeout`, you could instead (or additionally) put an explicit `@pytest.mark.timeout(MAX_STRESS_TEST_DURATION + safety_margin)` decorator on the test function containing this loop.
</issue_to_address>

### Comment 8
<location path="dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py" line_range="327-329" />
<code_context>
+    log.info('Results')
+    log.info('-' * 70)
+    log.info(f'  Elapsed time:      {elapsed:.1f}s')
+    log.info(f'  Schema reloads:    {s.get("reloads_ok", 0)} ok, '
+          f'{s.get("reloads_fail", 0)} fail, '
+          f'{s.get("reloads_timeout", 0)} timeout')
+    log.info(f'  LDAP searches:     {s.get("searches_ok", 0)} ok, '
+          f'{s.get("searches_fail", 0)} fail, '
</code_context>
<issue_to_address>
**nitpick:** Either initialise or remove the unused *timeout counters in the final stats log.

`reloads_timeout` and `searches_timeout` are never updated in this test, so the log will always report `0 timeout` even if timeouts actually occurred elsewhere. Please either connect these fields to real timeout-handling logic or drop them from the output to avoid misleading diagnostics.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread ldap/servers/slapd/attrsyntax.c
Comment thread ldap/servers/slapd/schema.c Outdated
Comment thread ldap/servers/slapd/schema.c Outdated
Comment thread ldap/servers/slapd/schema.c Outdated
Comment on lines +27 to +30
stop_event = threading.Event()
crash_detected = threading.Event()
stats_lock = threading.Lock()
stats = Counter()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Reset or scope the global threading Events and stats per test run to avoid state leakage across tests.

Because stop_event, crash_detected, and stats are module-level, they keep their state across test invocations in the same pytest run (e.g. --lf/--ff, or additional tests in this module). If an event remains set or stats is non-empty from a prior run, later runs may exit early or report incorrect results. Please either move this logic into a fixture that creates fresh Event/Counter instances per test, or reset the events and stats at the start of test_schema_reload_under_load().

Suggested implementation:

def test_schema_reload_under_load(topology_st):
    # Reset global coordination state for each test invocation to avoid
    # leaking state across tests or pytest runs (e.g. --lf/--ff).
    stop_event.clear()
    crash_detected.clear()
    with stats_lock:
        stats.clear()

The exact signature of test_schema_reload_under_load in your file may differ (extra fixtures/parameters, different fixture name, etc.). If so, adjust the SEARCH pattern to match the actual definition line of that test function.

The important part is to insert the three reset calls as the first statements in test_schema_reload_under_load's body:

    stop_event.clear()
    crash_detected.clear()
    with stats_lock:
        stats.clear()

If any other tests in this module also rely on these globals, you should add the same reset snippet at the beginning of those tests as well, or factor it into a small helper and call that helper from each test.

Comment thread dirsrvtests/tests/suites/schema/schema_reload_heap_use_after_free_test.py Outdated
Comment on lines +327 to +329
log.info(f' Schema reloads: {s.get("reloads_ok", 0)} ok, '
f'{s.get("reloads_fail", 0)} fail, '
f'{s.get("reloads_timeout", 0)} timeout')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Either initialise or remove the unused *timeout counters in the final stats log.

reloads_timeout and searches_timeout are never updated in this test, so the log will always report 0 timeout even if timeouts actually occurred elsewhere. Please either connect these fields to real timeout-handling logic or drop them from the output to avoid misleading diagnostics.

@packit-as-a-service

Copy link
Copy Markdown

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo dnf install -y 'dnf*-command(copr)'
  • dnf copr enable packit/389ds-389-ds-base-7582
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

@mreynolds389 mreynolds389 force-pushed the schema-reload-cve branch 2 times, most recently from 5b58ee2 to 67fdb68 Compare June 15, 2026 20:05

@progier389 progier389 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread ldap/servers/slapd/schema.c Outdated
}

newasip->asi_flags |= SLAPI_ATTR_FLAG_KEEP;
attr_syntax_return(newasip);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this return might be unsafe for this ownership path... newasip comes from parse_at_str(_, &newasip, _), so it was newly allocated rather than acquired through an attr_syntax_get_*() lookup. IIUC, calling attr_syntax_return() here can decrement a refcount that was never incremented; after attr_syntax_add() succeeds, the schema table should own the object.

Comment thread ldap/servers/slapd/schema.c Outdated
primary_file /* force user defined? */,
schema_ds4x_compat, 0)) != 0)
break;
attr_syntax_return(tmpasip);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the same ownership mismatch as above. tmpasip is only a temporary parse result used for validation, not a refcounted lookup result, Should it be released directly with attr_syntax_free() without first calling attr_syntax_return()?

Comment thread ldap/servers/slapd/schema.c Outdated
at->asi_prev = at_list;
at_list = at;
}
attr_syntax_return(at);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, a similar issue here too? The parsed at node is linked into the private remote_at_list and later freed by schema_atlist_free(), so calling attr_syntax_return() while building the list might underflow the refcount

while (global_at) {
next = global_at->asi_next;
attr_syntax_free(global_at);
global_at->asi_marked_for_delete = 1;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm overthinking this, but I think deferred old nodes still keep their asi_prev/asi_next links into the old global_at list. If they are later freed via attr_syntax_return(), could attr_syntax_remove() follow stale links or disturb the new global_at? Should we detach them during the swap or skip unlinking swap-orphans?

except Exception:
pass

task = schema.reload()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion runs inside the background reload thread, so pytest may not fail the test reliably if the reload task exits non-zero. Could we record reload task failures in stats and assert them from the main test thread, along with the existing crash/server-alive checks?

@tbordaz tbordaz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

while (global_at) {
next = global_at->asi_next;
attr_syntax_free(global_at);
global_at->asi_marked_for_delete = 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that change but I would expect code review tools to report a risk of leak in case refcnt is not null.

Comment thread ldap/servers/slapd/schema.c Outdated
"Could not be added (OID is \"%s\")",
newasip->asi_oid);
attr_syntax_free(newasip);
attr_syntax_return(newasip);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I can not find where the refcnt of 'newasip' is increased in attr_sync_add or before.
So attr_sync_return may set refcnt to -1. Why not just keeping attr_syntax_free ?

Comment thread ldap/servers/slapd/schema.c Outdated
primary_file /* force user defined? */,
schema_ds4x_compat, 0)) != 0)
break;
attr_syntax_return(tmpasip);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is useless as refcnt was not increase in that path

Comment thread ldap/servers/slapd/schema.c Outdated
struct asyntaxinfo *tmpasip = NULL;
rc = parse_at_str(attr_str, &tmpasip, errorbuf, sizeof(errorbuf),
DSE_SCHEMA_NO_GLOCK | schema_flags, 0, 0, 0);
attr_syntax_return(tmpasip);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remark

Comment thread ldap/servers/slapd/schema.c Outdated
"parse_at_str(%s) failed - %s\n",
at_berval[i]->bv_val, errorbuf[0] ? errorbuf : "unknown");
attr_syntax_free(at);
attr_syntax_return(at);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark I believe that parse_at_str does not increase refcnt

Comment thread ldap/servers/slapd/schema.c Outdated
at->asi_prev = at_list;
at_list = at;
}
attr_syntax_return(at);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remark

@mreynolds389

Copy link
Copy Markdown
Contributor Author

@droideck @tbordaz thanks for the comments. I need to run more tests to verify, but I recall that all these changes were needed to avoid ASAN errors and to fix the original heap corruption issue. This patch was heavily tested without any issues, but perhaps some of the changes are not needed...

@mreynolds389 mreynolds389 force-pushed the schema-reload-cve branch 2 times, most recently from 34d6dc7 to 3750475 Compare June 25, 2026 14:31
Description:

The refcount mechanism for attributes is not correctly used, and this prevents
its use for synchronizing hashtable access. We increment the refcount, but we
don't decrement it correctly. This can potentially cause a rare race condition
with schema reload task and heap-use-after-free with searches running at the
same time as the reload task.

**CWE**: CWE-416 (Use After Free) via CWE-362 (Race Condition)

CI test assisted by: Cursor

relates: 389ds#7578

Reviewed by: progier, tbordaz, and spichugi(Thanks!!!)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants