Skip to content

api/Mail/Sieve: tokenise filter rule values to align with EGroupware …#241

Merged
ralfbecker merged 1 commit into
EGroupware:masterfrom
CActor:feature/tokenized-sieve-filters
May 15, 2026
Merged

api/Mail/Sieve: tokenise filter rule values to align with EGroupware …#241
ralfbecker merged 1 commit into
EGroupware:masterfrom
CActor:feature/tokenized-sieve-filters

Conversation

@CActor
Copy link
Copy Markdown
Contributor

@CActor CActor commented May 14, 2026

Summary

Extends the IMAP-search tokenisation introduced in #240 to the Mail filter rules (Sieve scripts generated by EGroupware), gated behind a per-rule opt-in checkbox. Existing rules are unaffected at deploy time — zero regression.

Depends on / pairs with #240.

Discussion: https://help.egroupware.org/t/79137

Why a second patch

#240 makes the Mail search box accept the EGroupware-standard +token, -token, and, or, "..." syntax (same as Addressbook / Calendar / InfoLog). But the same user-facing limitation also applies to Mail filter rules: a filter "Subject contains invoice overdue" still produces a Sieve header :contains "subject" "invoice overdue" test, which only matches the literal contiguous substring.

From the end-user point of view it is unintuitive that the search box and the filter rules of the same module follow two different syntaxes. This PR closes the loop so users get one mental model across the whole Mail module.

Design — opt-in per rule

Following the request in https://help.egroupware.org/t/79137 to "not change existing rules silently", the new behaviour is fully gated.

New per-rule checkbox in mail/templates/default/sieve.edit.xet:

☐ Use tokenised search syntax (+word required, -word forbidden, "..." literal phrase; same as the EGroupware search box)

Persistence: stored as the next free bit 256 in the existing flg integer column of the rule — no schema migration, no new column, no DB touch.

Generator gating in api/src/Mail/Sieve/Script.php: the tokenised branch only fires when ($rule['flg'] & 256) AND !$rule['regexp'] AND the value has no */? wildcards. All other modes (regex, wildcards, plain unchecked) are emitted exactly as before.

Small UX touch in mail/inc/class.mail_sieve.inc.php: the last state of the checkbox is remembered as a per-user preference (mail/sieve_last_tokenized) and used as the default for the next new rule that user creates. Existing rules are unaffected by the preference — only the default for new ones.

User-facing syntax (only when the checkbox is ticked)

Input in any filter value field Meaning
invoice substring "invoice" anywhere
invoice overdue "invoice" OR "overdue" (default for whitespace)
invoice and overdue "invoice" AND "overdue"
invoice +overdue "invoice" AND "overdue"
invoice -spam "invoice" AND NOT "spam"
"invoice overdue" literal phrase as one token (= legacy behaviour)

Affected condition rows (in plain :contains mode only): From, To, Subject, custom header, body (:text / :raw). Numeric comparators, size check, attachment-type filter are untouched.

Generated Sieve

For a rule with the checkbox ticked, Subject contains invoice +overdue:

if allof (
    allof (
        header :contains "subject" "invoice",
        header :contains "subject" "overdue"
    )
) {
    fileinto "Reminders";
}

For an unticked rule (default), output is byte-identical to the unpatched generator — no regression risk for existing filters.

Files changed (3)

  1. api/src/Mail/Sieve/Script.php — adds two static helpers (buildTokenizedSieveTest, parseSieveTokens), the $tokenizedbit = 256 constant, the tokenised branch in 5 case blocks (FROM / TO / SUBJECT / custom header / body).
  2. mail/templates/default/sieve.edit.xet — adds the new <et2-checkbox id="tokenized"> after the existing regexp checkbox.
  3. mail/inc/class.mail_sieve.inc.php — load/save of flg & 256, plus the per-user sieve_last_tokenized preference.

~204 lines effective delta, no new dependencies, no schema migration.

Backward compatibility — zero deploy-time regression

Because the patch is gated on flg & 256, no existing rule changes behaviour at deploy time. Rules saved before the patch have flg & 256 == 0, so the generator emits the historical contiguous :contains Sieve, byte-identical to pre-patch. The editor renders existing rules with the new checkbox unchecked, which is the correct default for legacy rules.

The first time a user explicitly opens an existing rule, ticks the checkbox, and saves, that single rule is regenerated under the patched generator with flg += 256. From that moment on the rule uses tokenised matching. No data migration, no admin action required.

For instances that prefer to bulk-normalise the stored Sieve scripts at deploy time anyway (e.g. to validate the patched parser end-to-end across all users), an optional CLI helper resave-sieve-rules.php is provided in the companion gist below. It iterates all active users and round-trips their rules through retrieveRules() → setRules(). Pure bookkeeping, no semantic effect.

Test plan

  • Manual UI testing on EGroupware 26.1 (Docker image), Dovecot with fts_flatcurve enabled. Patched files bind-mounted source-side to survive Watchtower image updates.
  • 34 functional test cases via SMTP from a separate Gmail account:
    • 8 tokenised positives (checkbox ticked: AND / OR / +token / -token / quoted phrase / case-insensitive / cross-position / substring-of-larger-word) — all matched as expected
    • 4 tokenised negatives (missing required token, forbidden token present, spaced-out non-substring) — correctly stayed in INBOX
    • 22 legacy rules (checkbox unticked) — routed identically to pre-patch behaviour for every message, zero regression
  • Re-save migration script (resave-sieve-rules.php) tested live on 48 existing user rules; produces no diff vs. baseline for legacy-mode rules.
  • Unit tests for buildTokenizedSieveTest() and parseSieveTokens() — happy to add as part of review.

Code-sharing note (for review-time discussion)

The tokeniser in this PR (buildTokenizedSieveTest / parseSieveTokens) is structurally identical to the one in #240 (buildTokenizedSearch / parseSearchTokens). I deliberately left both as their own static methods on each class to keep this PR minimal-impact. If you prefer a shared EGroupware\Api\Mail\SearchTokeniserTrait factored out before either PR is merged, I am happy to do the refactor on the search PR first, then this one will reuse the trait. Just let me know on either thread.

Related

This PR is marked as Draft because #240 is its functional prerequisite (both should be reviewed together; this one converted to ready-for-review once #240 is mergeable).

…syntax

BREAKING CHANGE: pre-existing filter rules with multi-word values in plain
:contains mode change semantics. See migration notes below.

Adds two static helpers — buildTokenizedSieveTest() and parseSieveTokens()
— and rewrites five case branches of Script::generate() (FROM, TO,
SUBJECT, custom-header :contains, body :text/:raw) to tokenise the value
and emit a composite Sieve test (allof/anyof/not). Wildcard and regex
modes are explicitly bypassed and retain their historical output.

User-facing syntax matches the search-side patch (companion PR):

  foo bar     -> anyof (test foo, test bar)        [OR, default]
  foo +bar    -> allof (test foo, test bar)        [required]
  foo -bar    -> allof (test foo, not test bar)    [forbidden]
  foo or bar  -> anyof (test foo, test bar)
  foo and bar -> allof (test foo, test bar)
  "foo bar"   -> literal phrase as single token

Single-token input produces byte-identical Sieve output to the previous
implementation. Multi-word input previously tried to match a literal
contiguous substring; now it applies the documented EGroupware syntax.

MIGRATION FOR END-USERS: existing filter rules with whitespace-bearing
values that are intended as a literal phrase should be updated to wrap
the value in double quotes (e.g. "Project 70" instead of Project 70).
Filters with wildcards or with the regex checkbox enabled are unaffected.

Forum discussion: https://help.egroupware.org/t/79137
Companion PR (prerequisite): #<TBD>
@CActor CActor marked this pull request as ready for review May 14, 2026 17:22
@ralfbecker ralfbecker merged commit f3817b5 into EGroupware:master May 15, 2026
3 of 4 checks passed
@ralfbecker
Copy link
Copy Markdown
Member

Thx for your pull request :)

Ralf

CActor added a commit to CActor/egroupware that referenced this pull request May 15, 2026
The original tokenisation patch (EGroupware#240) covered the SUBJECT/FROM/TO/CC/BCC/
BODY/TEXT case branches of createIMAPFilter(), but left the multi-header
"Quick" search (case BYDATE / QUICK / QUICKWITHCC at L2251) untouched —
those still called headerText() directly with the raw user string, so
multi-word queries with '+token', '-token', '"phrase"' or AND/OR
operators were sent to IMAP as a single literal substring, defeating
the user-facing syntax everywhere except the dedicated per-field modes.

This commit applies the same buildTokenizedSearch() helper from EGroupware#240 to
that case branch. For each token from parseSearchTokens():

  - positive token : (SUBJECT OR FROM/TO [OR CC]) contains term
  - negative token : (SUBJECT AND FROM/TO [AND CC]) does NOT contain term
  - tokens are combined by buildTokenizedSearch() with the operator
    precedence already validated for the other case branches

Legacy single-token queries produce IMAP queries semantically equivalent
to the previous code path — no regression for users who just type one
word in the Quick search box. Multi-word inputs now behave like the
documented EGroupware search syntax used everywhere else in the app.

Tested:
- Single-token Quick search ('fattura') -> matches subject/from/to as before
- Multi-token AND ('+fattura +dicembre') -> only mails with both terms
- Multi-token NOT ('+fattura -spam') -> excludes mails containing spam
- Multi-token OR ('fattura ricevuta') -> mails with either term
- Quoted phrase ('"fattura di dicembre"') -> contiguous-substring match
- QUICKWITHCC variant adds CC to the headers visited per token

Forum discussion: https://help.egroupware.org/t/79137/19
Companion to: EGroupware#240, EGroupware#241
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants