Skip to content

xcif parser now wired to the iotbx.cif.reader #1149

Open
olegsobolev wants to merge 25 commits intocctbx:masterfrom
olegsobolev:cif-reader
Open

xcif parser now wired to the iotbx.cif.reader #1149
olegsobolev wants to merge 25 commits intocctbx:masterfrom
olegsobolev:cif-reader

Conversation

@olegsobolev
Copy link
Copy Markdown
Contributor

xcif — a fast C++ CIF parser — is now wired into iotbx.cif.reader with a single global switch in iotbx/cif/__init__.py (the module-level DEFAULT_ENGINE constant) plus a per-call engine= kwarg on the reader. DEFAULT_ENGINE stays "ucif" in this PR; callers who want to opt in today can pass engine="xcif" explicitly, and a one-line follow-up will flip the default after bake-in.

Whole set of tests including Phenix ones are passing with the switch flipped.

Benchmark on a 34 MB mmCIF (4v9d.cif): ~6× speedup end-to-end through iotbx.cif.reader, ~21× for the raw parser alone.

olegsobolev and others added 21 commits April 15, 2026 12:49
…trtod_l

  using a static C locale — thread-safe, zero overhead
  - Debug prints removed (tst_tokenizer.cpp): 3 leftover cout lines gone
  - Missing tests added (tst_numeric.cpp): as_int at INT_MAX/INT_MIN, overflow
   above/below both bounds, as_double_with_su with exponent+SU ("1.5e+3(2)" →
  (1500.0, 0.2))
  - Python test added (tst_bindings.py): column_as_flex_int with a . cell
  raises ValueError
  - nullptr style (xcif_ext.cpp): two p(0) initializers changed to p(nullptr)
  tst_numeric.cpp (3 new tests in §5b):
  - test_double_long_value_nan — values ≥32 chars hit the buffer limit and return NaN
  - test_su_missing_close_paren — "1.5(" without closing ) returns (1.5, 0.0) gracefully
  - test_su_nonnumeric_in_su_content — "1.5(3a5)" accumulates digits, skips 'a' → su=3.5

  tst_tokenizer.cpp (3 new tests):
  - test_unterminated_single_quote_returns_rest — graceful degradation, no crash
  - test_unterminated_double_quote_returns_rest — same for double quotes
  - test_null_byte_truncates_unquoted_token — '\0' stops unquoted value at "val"

  tst_error_handling.cpp (2 new tests in §5):
  - test_unterminated_quoted_string_graceful — parse succeeds, tag is accessible
  - test_unclosed_semicolon_field_graceful — parse succeeds, tag is accessible

  tst_api_compat.py (new exercise_loop_category):
  - Empty list, single tag (with/without dot), multi-tag dot-category, multi-tag underscore-category edge
  cases
  Root cause: locale_t, newlocale(), and strtod_l() are POSIX extensions — they exist on macOS and Linux
  but not on Windows/MSVC. MSVC provides equivalent but differently-named APIs: _locale_t,
  _create_locale(), and _strtod_l().

  Fix in numeric.cpp: Replaced the single POSIX-only xcif_c_locale() function with a platform-selected
  xcif_strtod_c() wrapper using #ifdef _WIN32:
  Two related changes, both needed for parsing the cctbx monomer
  library (mon_lib_list.cif and friends start with a CIF 1.1
  `global_` block).

  - Tokenizer: the exact 7-char token `global_` (case-insensitive)
    now produces TOKEN_BLOCK_HEADER instead of TOKEN_VALUE. Parser
    preserves the full `global_` name (no `data_` prefix strip) and
    backs the Block name_ string_view with static storage so it
    survives Document copy/move.

  - parse()/parse_file() take bool strict=true; xcif_ext exposes it
    as a kwarg. With strict=false, pair items or loops appearing
    before the first data_ block attach to an implicit `global_`
    block, matching ucif's non-strict behaviour.

  Tests: new xcif/regression/cpp/tst_non_strict_mode.cpp and
  xcif/regression/tst_non_strict_mode.py. tst_error_handling and
  all other strict-mode assertions unchanged.
  example.cif was enriched in df37a0d ("Adjust mmCIF example to
  conform dictionary") — gaining extra loops and tags — but the
  tokenize/parse demo tests were left asserting the old counts.
  whitespace or EOF

  CIF 1.1 §2.2.7.3 specifies that the closing delimiter of a quoted
  string (' for single, " for double) is only treated as a close when
  followed by whitespace or end-of-input. A delimiter followed by any
  other character is part of the string content.

  xcif's tokenizer previously terminated on the first delimiter
  unconditionally, breaking real-world values like the cctbx monomer
  library's

    'N-METHYL-PYRIDOXAL-5'-PHOSPHATE     '    (mon_lib_list.cif:1688)

  where the inner `'` precedes `-` (non-whitespace). Fix: when
  encountering the delimiter, peek at the next char and include the
  delimiter in the string unless that next char is whitespace or EOF.

  Tests: new xcif/regression/cpp/tst_quoted_string_delimiter.cpp and
  xcif/regression/tst_quoted_string_delimiter.py cover plain strings,
  embedded delimiters (single and double quotes), multiple embedded
  apostrophes, and close-by-tab / close-by-EOL / close-by-EOF.
  Add engine="ucif"|"xcif" to iotbx.cif.reader and a module-level
  DEFAULT_ENGINE (default "ucif"). Callers can opt in per-instance
  with reader(..., engine="xcif").

  When engine="xcif", the reader calls xcif_ext.parse() and walks
  the resulting Document in Python, driving the existing
  cif_model_builder via its add_data_block / add_data_item /
  add_loop / start_save_frame / end_save_frame callbacks.
  Downstream code still receives mutable iotbx.cif.model.cif
  objects, so mutation sites (restraint_file_merge,
  pdb_interpretation, as_cif_block writers) are unaffected.

  Details:
  - Walker wraps both ValueError and RuntimeError from
    xcif_ext.parse as CifParserError.
  - Heading prefixes ("data_", "save_", literal "global_") are
    restored before calling the builder.
  - error_count() and show_errors() tolerate self.parser=None.

  Test: xcif/regression/tst_iotbx_cif_reader_parity.py parses a
  corpus (7 inline fixtures + 3 real files) with both engines
  and asserts structural equality of the resulting
  iotbx.cif.model. Uses the engine= kwarg directly, so it's safe
  under parallel test runners.
  CIF 1.1 §2.2.7.4 requires the closing delimiter of a semicolon
  text field to appear as the first character of a line. xcif's
  tokenizer previously returned a partial TOKEN_VALUE on EOF and
  let the parse succeed — silently accepting content where the
  user may have mistaken a mid-line `;` for a terminator.

  read_semicolon_field now throws CifError when the input ends
  before a column-1 `;` is found. The existing "graceful"
  regression (tst_error_handling.cpp::test_unclosed_semicolon_field)
  is flipped to assert the raise.

  New regression fixture xcif/regression/cpp/tst_semicolon_field_strict.cpp
  and xcif/regression/tst_semicolon_field_strict.py cover: EOF
  mid-field, EOF on the opening-`;` line, a trailing mid-line `;`
  mistaken for a terminator (the shape that prompted this fix),
  plus several well-formed positive cases.
… fixes):

  xcif: align strict-mode enforcement with ucif

  Four related changes to make xcif behave like ucif when used behind
  iotbx.cif.reader — required so iotbx.cif tests that exercise syntax
  errors pass under engine="xcif".

  - xcif Parser rejects `global_` block headers in strict mode with
    CifError. `global_` is a STAR/DDL2 reserved word, not part of
    strict CIF 1.1; ucif treats it as a strict-only rejection. The
    cctbx monomer library (mon_lib_list.cif and friends) uses
    `global_` and loads with strict=false.

  - xcif Tokenizer raises CifError on unterminated single- or
    double-quoted strings instead of silently returning the partial
    content (see the grammar in the CIF 1.1 syntax spec at
    https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax).
    Two existing regression tests
    (test_unterminated_{single,double}_quote_returns_rest and
    test_unterminated_quoted_string_graceful) are flipped to assert
    the raise.

  - iotbx/cif/__init__.py:
    - `_drive_builder_from_xcif` takes a strict= argument and passes
      it to xcif_ext.parse(strict=). reader(..., strict=False) now
      actually relaxes xcif, not just ucif.
    - The walker drops `global_` blocks entirely (matching ucif/cif.g:
      "global blocks are ignored"), so str(model) of a file with a
      leading `global_` plus a data_ block shows only the data_ block,
      same as ucif.
    - `_drive_builder_from_xcif` now returns a list of error messages
      rather than raising, so `reader(raise_if_errors=False)` gets a
      populated error_count() instead of an exception — matching
      ucif's error-accumulation contract.

  - Tests in xcif/regression/tst_non_strict_mode (C++ and Python) that
    previously asserted `global_` works in strict mode are updated to
    assert the opposite (strict rejects, strict=false accepts), plus a
    new test for the strict rejection.
  Replace the invented "§2.2.7.3 / §2.2.7.4" references with the
  actual paragraph numbers from the IUCr CIF 1.1 syntax specification
  (https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax):

  - Paragraph 15 — quoted strings may embed their delimiter provided
    the delimiter is not followed by whitespace; the closing `'`/`"`
    is only recognised when followed by whitespace or end-of-input.
  - Paragraph 17 — a semicolon text field is delimited by `;`
    appearing as the first character of a line.
  xcif's parse_loop raised "loop value count is not a multiple of tag
  count" with only a source:line:col prefix. For large CIF files with
  many similar loops, line number alone is not enough to locate the
  bad loop — ucif's "Wrong number of data items for loop containing
  _TAG" names the first tag, which is greppable.

  Message now includes the first tag of the offending loop plus the
  actual value/tag counts, e.g.:

    loop containing tag '_atom_site.group_PDB' has 57 values, not a
    multiple of its 18 tags

  Existing regression (tst_error_handling.cpp) tightened from a bare
  "multiple" substring check to assert the tag name and both counts
  are present.
  Three tests in iotbx/cif/tests/tst_model_builder.py
  (test_repeated_loop_1, test_repeated_loop_2, test_repeated_tags)
  and one in iotbx/pdb/tst_pdb.py (exercise_pdb_input_error_handling)
  asserted the exact error strings emitted by ucif's cif_model_builder
  at model-build time. xcif rejects the same malformed inputs at parse
  time with different wording, so these tests break when
  iotbx.cif.reader runs with engine="xcif".

  Replace the strict equality checks with explicit both-wording
  assertions — both messages listed verbatim in an `accepted` tuple,
  the xcif branch matched with endswith(": " + m) to allow for the
  "<source>:line:col: " prefix xcif error translators add.

  Both wordings are preserved in the source so a future reader can
  see exactly what each parser says, and either parser's wording can
  change deliberately with a loud test failure rather than silently
  regressing.
  SHELX and several other tools append a Ctrl-Z / SUB byte (0x1A) to
  .hkl and .cif files they write. CIF 1.1's character set doesn't
  include it, but ucif silently treats it as end-of-file, so real
  files parse cleanly. xcif was tokenising it as a stray unquoted
  value, which pushed the enclosing loop's value count one past a
  multiple of its tag count and triggered a spurious

    loop containing tag '_refln_index_h' has N values,
    not a multiple of its M tags

  error on otherwise well-formed SHELX-style output.

  Tokenizer::next() now checks for 0x1A after skip_whitespace_and_comments,
  consumes the rest of the buffer, and returns TOKEN_EOF. Any content
  after 0x1A is ignored — matching DOS-era semantics and what ucif
  does in practice.
- Add "[skip cache]" to the commit message to tell
  GitHub Actions to not use the build cache in the
  quick tests jobs
- A new cache will be created with the new build
Copy link
Copy Markdown
Member

@bkpoon bkpoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Verdict: Ready to merge as opt-in — a few items should be resolved before the follow-up PR that flips DEFAULT_ENGINE to "xcif".

Because DEFAULT_ENGINE = "ucif" is preserved, merging this PR today ships zero behavior change to the ~120 existing iotbx.cif.reader(...) call sites. The items below are mostly scoped at the default-flip, not this merge.

Strengths

  • Default preserved (iotbx/cif/__init__.py:41); opt-in via engine="xcif" is the right shipping path for a dual-parser change.
  • Walker docstring on _drive_builder_from_xcif (iotbx/cif/__init__.py:48-81) is load-bearing — explains strict/non-strict semantics, the data_/save_ prefix round-trip to match the builder's [find('_')+1:] logic, and the ordering limitation.
  • Real parity test at xcif/regression/tst_iotbx_cif_reader_parity.py drives engine="xcif" against ucif on 7 inline fixtures plus file fixtures, comparing block names, pair items, loop tag sets and column content, and save-frame structure.
  • Non-trivial correctness fixes bundled in:
    • CIF 1.1-compliant embedded-delimiter handling (xcif/tokenizer.cpp:118-142) — fixes the N-METHYL-PYRIDOXAL-5'-PHOSPHATE monomer-library case.
    • Unterminated-quote / semicolon now throws (previously returned partial values silently).
    • global_ recognition for mon_lib files; DOS 0x1A EOF compat for SHELX output.
    • Locale fix via strtod_l / _strtod_l behind a C-locale handle (xcif/numeric.cpp:12-27), thread-safe static init.
    • Integer parsing via strtol with explicit overflow check (xcif/numeric.cpp:63-80); old hand-rolled loop silently overflowed at INT_MAX. Paired with INT_MAX/INT_MIN tests in tst_numeric.cpp.
    • string_view bounds now assertable in debug (xcif/include/xcif/string_view.h).
  • GIL released during parse/parse_file in xcif/xcif_ext.cpp:83-100; Python walker runs after heavy C++ work with GIL reacquired.

Issues

Critical

None.

Important (should be resolved before flipping the default)

  1. Builder contract fragility with save frames. iotbx/cif/__init__.py:102-103 synthesises the save-frame heading as "save_" + sf.name. This round-trips only because iotbx/cif/builders.py:68-72 strips it back with save_frame_heading[save_frame_heading.find('_')+1:]. Third-party builder objects passed via reader(builder=...) that string-compare the full save_<name> token will silently diverge between engines. At minimum, document the builder contract on the builder= parameter.

  2. Pair/loop source order not preserved by xcif. The walker explicitly emits all pairs before all loops within a block (iotbx/cif/__init__.py:87-94) because xcif::Block stores pairs_ and loops_ in separate vectors. ucif preserves source order. The parity test's _compare_block_like (xcif/regression/tst_iotbx_cif_reader_parity.py:159-178) compares via dict(blk_u._items), so order is discarded and CI will not catch this. block.show() output will change once the default flips, which makes file diffs noisy and can break snapshot tests downstream.

    • Fix (minimum): tighten the parity test to assert list-order equivalence. Then either store original order in xcif and honour it in the walker, or flag this as a known behavioral difference in release notes.
  3. raise_if_errors=False semantics diverge. On parse failure, _drive_builder_from_xcif returns before the walker touches the builder at all (iotbx/cif/__init__.py:67-74) — xcif yields an empty self.model(). ucif accumulates errors while partially populating the model. Callers that only check error_count() == 0 are unaffected, but callers trying to salvage a partial model from a damaged file behave radically differently. Should at least be documented in the engine= docstring.

  4. No docstring on reader.engine. The reader class has no class docstring at all; DEFAULT_ENGINE has a one-line comment but doesn't document runtime-override semantics. Nothing in sphinx/iotbx/iotbx.cif.rst is updated either. For a switch that affects ~120 call sites, this is inadequate.

  5. Zero-copy parse_file path is not used. When file_path= is given, reader reads the file into a Python string (iotbx/cif/__init__.py:127-130) then calls xcif_ext.parse. xcif has a parse_file entry point that memory-maps and parses directly. This is plausibly why the headline speedup is 6× through iotbx.cif.reader but 21× for the raw parser on the same 34 MB file. Suggest dispatching to parse_file when file_path is given and engine is xcif; binary-detection only reads the first few KB so it remains cheap.

  6. Engine validation happens after I/O. iotbx/cif/__init__.py:134-140 validates engine only after the file has been read and binary-detected. A typo'd engine forces needless I/O before ValueError. Easy to move the check to the top of __init__.

  7. DEFAULT_ENGINE thread-safety undocumented. Reader reads DEFAULT_ENGINE at construction time (line 137), so a concurrent test mutating it during another test's parse races. Unlikely given Phenix's spawn/fork model but worth noting in the module docstring. Alternatively, freeze it to "ucif" and expose only the per-call kwarg until the flip — making the flip literally a one-line commit as described.

Minor

  1. _VALID_ENGINES = ("ucif", "xcif"): fine for 2 items; a dict mapping engine → walker factory scales better when a third option lands.
  2. Walker catches (ValueError, RuntimeError) at iotbx/cif/__init__.py:68. If a std::exception subclass ever escapes without a custom translator, it would bypass _xcif_errors and the raise_if_errors=False contract. Defensively broadening the except with a class filter is cheap insurance.
  3. xcif error messages prefixed "<source>:line:col: " vs ucif's plain messages. Test updates already accommodate both wordings (iotbx/cif/tests/tst_model_builder.py:34-52, iotbx/pdb/tst_pdb.py:718-728) — worth calling out in release notes for anyone grepping logs for error patterns.
  4. .gitignore adds CLAUDE.md / */CLAUDE.md. Tooling-specific patterns like this are better in ~/.gitignore_global or .git/info/exclude rather than the shared project .gitignore. Trivial nit.
  5. Two-dot diff shows mmtbx/ligands/rdkit_utils.py and mmtbx/validation/validate_ligands.py as deletions; this is a rebase artifact (branch is behind master by 3077f4bc11 and f37c1b9f65). Not a real issue, but rebasing onto master would clean the diff view.
  6. tags = list(xloop.tags) at walker — minor allocation, not worth fixing.
  7. xcif/regression/example.cif was modified to add _chem_comp / _entity / _atom_type stubs with matching tst_example_tokenize.py golden-count updates. Verify no other cctbx modules reference this fixture.

Recommendations

  • Keep the merge two-step. Ship this PR as opt-in, then flip the default in a separate one-line PR with a bake-in window. Do not squash — makes any regression cleanly attributable to the flip and trivially revertible.
  • Add a CI matrix job that runs the cctbx test suite under DEFAULT_ENGINE="xcif". The "whole test suite passes" claim is currently author-local and will decay with each new PR that lands. An env-var-honoring path in iotbx/cif/__init__.py is enough.
  • Strengthen the parity test to fail on source-order mismatches (see item 2).
  • Document semantic differences in the engine= docstring: partial-model-on-error, error-wording prefixes, pair/loop source order, file-read path performance characteristics.
  • Consider extracting the walker to iotbx/cif/_xcif_walker.py. It's ~40 lines of non-trivial Python glue in a module that already carries a lot.

Assessment

Merging this PR today is low-risk because the default path is untouched. Items 1–4 and the CI-matrix recommendation are the blocking work for the subsequent default-flip PR; 5 is a sizeable perf win left on the table; 6–7 are small-effort hardening.


Review assisted by Claude Code.

Block now records a (kind, index) log of entries in the order they
appeared in the source. The iotbx.cif walker iterates this log so
that str(model) output through engine="xcif" matches the ucif
engine when a block interleaves pair items, loops, and save frames.

The parity regression fixture now interleaves pairs and loops and
asserts list-order equality of block._set (the OrderedSet that
drives block.show() / str(model) serialization), catching
pair/loop interleave mismatches that a separate _items vs .loops
comparison would miss.
Move the engine check to the top of reader.__init__. A typo'd engine
name now raises ValueError immediately instead of after smart_open
has already read the file (or raised IOError on a missing path).

Adds tst_iotbx_cif_reader_engine.py to xcif/regression with two
exercises: engine validation short-circuits before any file open,
and both engines parse a trivial fixture.
When the reader is given a plain (uncompressed) file_path and
engine="xcif", dispatch to xcif_ext.parse_file (memory-mapped in C++)
instead of reading the file into a Python string and calling
xcif_ext.parse. Avoids a whole-file str() allocation on the hot path.

Binary detection runs via detect_binary_file.from_initial_block on
the first 1000 bytes, preserving the "Binary file detected" abort.

Compressed files (.gz/.Z/.bz2) and file_object= callers continue to
use the existing read-into-string path; parse_file cannot mmap a
compressed stream or an open file object.

Walker logic factored out of _drive_builder_from_xcif into a shared
_walk_xcif_doc helper; both parse/parse_file wrappers reuse it.

Tests verify file_path + xcif invokes parse_file (not parse) and
that the resulting model matches the input_string path.
Adds a class docstring to reader covering:
- what ucif and xcif are, and why ucif is the default
- engine= behavioral divergences: error-message prefixes,
  partial-model-on-error (ucif multi-error + partial state, xcif
  single-error + empty model), pair/loop source-order guarantee,
  and the xcif file_path fast-path / compression fallback
- the builder= contract re: the "data_" / "save_" prefix —
  third-party builders comparing full heading tokens must handle
  the prefix, since the default cif_model_builder strips it
- raise_if_errors / strict semantics

Expands the DEFAULT_ENGINE module-level comment to match. No
behavior change.
@olegsobolev
Copy link
Copy Markdown
Contributor Author

olegsobolev commented Apr 17, 2026

Thanks for the thorough review — I've addressed all items 1–7 plus a few of the minor ones. Summary, with commit references:

  • Item 2 (source order) — Fixed in bf2e49e. Block now records an ordered (kind, index) log; the iotbx.cif walker iterates it. Parity test tightened to compare block._set as a list, so a regression is caught in CI.
  • Item 5 (zero-copy parse_file) — Fixed in 2a8df9d. engine="xcif" with a plain uncompressed file_path now dispatches to xcif_ext.parse_file (mmap). Compressed files (.gz/.Z/.bz2) and file_object= callers keep the read-into-string path. Binary detection runs via detect_binary_file.from_initial_block on the first 1000 bytes.
  • Item 6 (engine validation) — Fixed in b1528af. Validation moved to the top of reader.__init__; a typo'd engine raises ValueError before any file I/O.
  • Items 1, 4, 7, 10 (docstrings & semantics) — Done in fbe030e. Class docstring on reader covers the ucif/xcif divergences (error-message prefix, partial-model-on-error, source order, fast-path). The builder= doc pins the data_/save_ prefix contract. Module comment clarifies DEFAULT_ENGINE is read at construction time.
  • Item 12 (rebase artifact) — Rebased onto current master; the mmtbx deletion noise is gone.

On item 3 (partial-model-on-error) — I went with "document loudly" rather than unify. The unification route needs a parallel non-throwing C++ API (parse_lenient/parse_file_lenient plus Document::parse_errors_) and would be ~100 lines of new public surface. No current cctbx caller uses raise_if_errors=False to salvage partial state — failures are either raised or error-count-checked and surfaced, not mined. The divergence is now called out explicitly in the engine= docstring so anyone hitting it knows why. Happy to add the lenient variants in a follow-up if you disagree.

Deferred (happy to take them in follow-ups or here if you prefer):

  • Items 8, 9, 11, 14 — all minor.
  • CI matrix job running the suite under DEFAULT_ENGINE="xcif". Best done as its own PR since it touches .github/ infra rather than the library.

Ready for another pass.
Edit: Going to merge as it is now.

Copy link
Copy Markdown
Member

@bkpoon bkpoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review (delta)

Verified the four new commits against the Important items from my previous review. Scope: only what changed since f5099ea.

Item-by-item verification

  • Item 1 — builder contract re: data_/save_ prefix (fbe030ed): ✅ Docstring under builder= now explicitly states that headings are passed with the prefix and that the default strips-before-first-underscore.
  • Item 2 — pair/loop source order (bf2e49ee): ✅ xcif::Block::source_order_ records insertion order; walker iterates it. The parity test fixture now interleaves pair-loop-pair-loop-pair and _compare_block_like asserts list(blk_u._set) == list(blk_x._set) before value comparison — the pre-fix bug would fail the new test.
  • Item 3 — raise_if_errors=False semantics (doc-only decision): ✅ with one caveat worth noting. The docstring explicitly directs salvage-semantics callers to engine="ucif", which makes the decision defensible. One in-tree caller actually does mine post-parse state: mmtbx/chemical_components/cif_parser.py:259 iterates cm.items() after cif.reader(filename, strict=False, raise_if_errors=False).model(). Unaffected as long as DEFAULT_ENGINE="ucif"; worth remembering when the default flips so that this caller is either (a) migrated to a lenient-xcif variant, or (b) pinned to engine="ucif". The only two other raise_if_errors=False callers in the repo are a unit test and cctbx/web/iotbx_cif_validate.py:28, so the exposure is narrow.
  • Item 4 — reader docstring (fbe030ed): ✅ Class now carries a full docstring covering all kwargs + engine divergences.
  • Item 5 — zero-copy parse_file (2a8df9da): ✅ Fast path gated on engine=="xcif" AND file_object is None AND uncompressed plain path, with binary detection preserved. Walker factored into shared _walk_xcif_doc so both parse and parse_file paths funnel through identical source-order / error-translation logic. Tests monkey-patch xcif_ext.parse / xcif_ext.parse_file to verify dispatch.
  • Item 6 — validate engine= before I/O (b1528af0): ✅ Check moved to right after the file-source assertion. New regression (tst_iotbx_cif_reader_engine.py::exercise_engine_validated_before_io) uses a guaranteed-nonexistent path to prove the check short-circuits before smart_open.
  • Item 7 — DEFAULT_ENGINE thread-safety: ⚠️ Read-at-construction is now documented in the module comment; an explicit "mutating concurrently is a data race" caveat is not. Low risk in practice (GIL + Phenix's fork-based multiprocessing); acceptable as is.

Minor observations (non-blocking)

  • Fast-path error messages for missing files now come from xcif rather than smart_open's Sorry("Cannot find..."). Callers grepping logs for the Sorry wording would see a different string under engine="xcif". Not covered by the docstring's error-prefix note.
  • Per-block list allocations in the walker (pair_tags, pair_values, loops, save_frames) — O(items-per-block), dwarfed by the parse_file win.

Verdict

Ready to merge. All Important items are addressed in substance; item 7 is partially addressed but low-risk for an opt-in engine. The new tests would catch regressions of items 2, 5, and 6. No new bugs introduced by the four follow-up commits.

The mmtbx/chemical_components/cif_parser.py caller noted above and a CI matrix job running the suite under DEFAULT_ENGINE="xcif" are the natural follow-up items alongside flipping the default.


Delta review assisted by Claude Code.

@bkpoon
Copy link
Copy Markdown
Member

bkpoon commented Apr 17, 2026

LGTM. One other comment is that when the switch is made to xcif, does the run function in mmtbx/chemical_components.py need any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants