xcif parser now wired to the iotbx.cif.reader #1149
xcif parser now wired to the iotbx.cif.reader #1149olegsobolev wants to merge 25 commits intocctbx:masterfrom
Conversation
…trtod_l
using a static C locale — thread-safe, zero overhead
- Debug prints removed (tst_tokenizer.cpp): 3 leftover cout lines gone
- Missing tests added (tst_numeric.cpp): as_int at INT_MAX/INT_MIN, overflow
above/below both bounds, as_double_with_su with exponent+SU ("1.5e+3(2)" →
(1500.0, 0.2))
- Python test added (tst_bindings.py): column_as_flex_int with a . cell
raises ValueError
- nullptr style (xcif_ext.cpp): two p(0) initializers changed to p(nullptr)
tst_numeric.cpp (3 new tests in §5b):
- test_double_long_value_nan — values ≥32 chars hit the buffer limit and return NaN
- test_su_missing_close_paren — "1.5(" without closing ) returns (1.5, 0.0) gracefully
- test_su_nonnumeric_in_su_content — "1.5(3a5)" accumulates digits, skips 'a' → su=3.5
tst_tokenizer.cpp (3 new tests):
- test_unterminated_single_quote_returns_rest — graceful degradation, no crash
- test_unterminated_double_quote_returns_rest — same for double quotes
- test_null_byte_truncates_unquoted_token — '\0' stops unquoted value at "val"
tst_error_handling.cpp (2 new tests in §5):
- test_unterminated_quoted_string_graceful — parse succeeds, tag is accessible
- test_unclosed_semicolon_field_graceful — parse succeeds, tag is accessible
tst_api_compat.py (new exercise_loop_category):
- Empty list, single tag (with/without dot), multi-tag dot-category, multi-tag underscore-category edge
cases
Root cause: locale_t, newlocale(), and strtod_l() are POSIX extensions — they exist on macOS and Linux but not on Windows/MSVC. MSVC provides equivalent but differently-named APIs: _locale_t, _create_locale(), and _strtod_l(). Fix in numeric.cpp: Replaced the single POSIX-only xcif_c_locale() function with a platform-selected xcif_strtod_c() wrapper using #ifdef _WIN32:
Two related changes, both needed for parsing the cctbx monomer
library (mon_lib_list.cif and friends start with a CIF 1.1
`global_` block).
- Tokenizer: the exact 7-char token `global_` (case-insensitive)
now produces TOKEN_BLOCK_HEADER instead of TOKEN_VALUE. Parser
preserves the full `global_` name (no `data_` prefix strip) and
backs the Block name_ string_view with static storage so it
survives Document copy/move.
- parse()/parse_file() take bool strict=true; xcif_ext exposes it
as a kwarg. With strict=false, pair items or loops appearing
before the first data_ block attach to an implicit `global_`
block, matching ucif's non-strict behaviour.
Tests: new xcif/regression/cpp/tst_non_strict_mode.cpp and
xcif/regression/tst_non_strict_mode.py. tst_error_handling and
all other strict-mode assertions unchanged.
example.cif was enriched in df37a0d ("Adjust mmCIF example to conform dictionary") — gaining extra loops and tags — but the tokenize/parse demo tests were left asserting the old counts.
whitespace or EOF
CIF 1.1 §2.2.7.3 specifies that the closing delimiter of a quoted
string (' for single, " for double) is only treated as a close when
followed by whitespace or end-of-input. A delimiter followed by any
other character is part of the string content.
xcif's tokenizer previously terminated on the first delimiter
unconditionally, breaking real-world values like the cctbx monomer
library's
'N-METHYL-PYRIDOXAL-5'-PHOSPHATE ' (mon_lib_list.cif:1688)
where the inner `'` precedes `-` (non-whitespace). Fix: when
encountering the delimiter, peek at the next char and include the
delimiter in the string unless that next char is whitespace or EOF.
Tests: new xcif/regression/cpp/tst_quoted_string_delimiter.cpp and
xcif/regression/tst_quoted_string_delimiter.py cover plain strings,
embedded delimiters (single and double quotes), multiple embedded
apostrophes, and close-by-tab / close-by-EOL / close-by-EOF.
Add engine="ucif"|"xcif" to iotbx.cif.reader and a module-level
DEFAULT_ENGINE (default "ucif"). Callers can opt in per-instance
with reader(..., engine="xcif").
When engine="xcif", the reader calls xcif_ext.parse() and walks
the resulting Document in Python, driving the existing
cif_model_builder via its add_data_block / add_data_item /
add_loop / start_save_frame / end_save_frame callbacks.
Downstream code still receives mutable iotbx.cif.model.cif
objects, so mutation sites (restraint_file_merge,
pdb_interpretation, as_cif_block writers) are unaffected.
Details:
- Walker wraps both ValueError and RuntimeError from
xcif_ext.parse as CifParserError.
- Heading prefixes ("data_", "save_", literal "global_") are
restored before calling the builder.
- error_count() and show_errors() tolerate self.parser=None.
Test: xcif/regression/tst_iotbx_cif_reader_parity.py parses a
corpus (7 inline fixtures + 3 real files) with both engines
and asserts structural equality of the resulting
iotbx.cif.model. Uses the engine= kwarg directly, so it's safe
under parallel test runners.
CIF 1.1 §2.2.7.4 requires the closing delimiter of a semicolon text field to appear as the first character of a line. xcif's tokenizer previously returned a partial TOKEN_VALUE on EOF and let the parse succeed — silently accepting content where the user may have mistaken a mid-line `;` for a terminator. read_semicolon_field now throws CifError when the input ends before a column-1 `;` is found. The existing "graceful" regression (tst_error_handling.cpp::test_unclosed_semicolon_field) is flipped to assert the raise. New regression fixture xcif/regression/cpp/tst_semicolon_field_strict.cpp and xcif/regression/tst_semicolon_field_strict.py cover: EOF mid-field, EOF on the opening-`;` line, a trailing mid-line `;` mistaken for a terminator (the shape that prompted this fix), plus several well-formed positive cases.
… fixes):
xcif: align strict-mode enforcement with ucif
Four related changes to make xcif behave like ucif when used behind
iotbx.cif.reader — required so iotbx.cif tests that exercise syntax
errors pass under engine="xcif".
- xcif Parser rejects `global_` block headers in strict mode with
CifError. `global_` is a STAR/DDL2 reserved word, not part of
strict CIF 1.1; ucif treats it as a strict-only rejection. The
cctbx monomer library (mon_lib_list.cif and friends) uses
`global_` and loads with strict=false.
- xcif Tokenizer raises CifError on unterminated single- or
double-quoted strings instead of silently returning the partial
content (see the grammar in the CIF 1.1 syntax spec at
https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax).
Two existing regression tests
(test_unterminated_{single,double}_quote_returns_rest and
test_unterminated_quoted_string_graceful) are flipped to assert
the raise.
- iotbx/cif/__init__.py:
- `_drive_builder_from_xcif` takes a strict= argument and passes
it to xcif_ext.parse(strict=). reader(..., strict=False) now
actually relaxes xcif, not just ucif.
- The walker drops `global_` blocks entirely (matching ucif/cif.g:
"global blocks are ignored"), so str(model) of a file with a
leading `global_` plus a data_ block shows only the data_ block,
same as ucif.
- `_drive_builder_from_xcif` now returns a list of error messages
rather than raising, so `reader(raise_if_errors=False)` gets a
populated error_count() instead of an exception — matching
ucif's error-accumulation contract.
- Tests in xcif/regression/tst_non_strict_mode (C++ and Python) that
previously asserted `global_` works in strict mode are updated to
assert the opposite (strict rejects, strict=false accepts), plus a
new test for the strict rejection.
Replace the invented "§2.2.7.3 / §2.2.7.4" references with the actual paragraph numbers from the IUCr CIF 1.1 syntax specification (https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax): - Paragraph 15 — quoted strings may embed their delimiter provided the delimiter is not followed by whitespace; the closing `'`/`"` is only recognised when followed by whitespace or end-of-input. - Paragraph 17 — a semicolon text field is delimited by `;` appearing as the first character of a line.
xcif's parse_loop raised "loop value count is not a multiple of tag
count" with only a source:line:col prefix. For large CIF files with
many similar loops, line number alone is not enough to locate the
bad loop — ucif's "Wrong number of data items for loop containing
_TAG" names the first tag, which is greppable.
Message now includes the first tag of the offending loop plus the
actual value/tag counts, e.g.:
loop containing tag '_atom_site.group_PDB' has 57 values, not a
multiple of its 18 tags
Existing regression (tst_error_handling.cpp) tightened from a bare
"multiple" substring check to assert the tag name and both counts
are present.
Three tests in iotbx/cif/tests/tst_model_builder.py
(test_repeated_loop_1, test_repeated_loop_2, test_repeated_tags)
and one in iotbx/pdb/tst_pdb.py (exercise_pdb_input_error_handling)
asserted the exact error strings emitted by ucif's cif_model_builder
at model-build time. xcif rejects the same malformed inputs at parse
time with different wording, so these tests break when
iotbx.cif.reader runs with engine="xcif".
Replace the strict equality checks with explicit both-wording
assertions — both messages listed verbatim in an `accepted` tuple,
the xcif branch matched with endswith(": " + m) to allow for the
"<source>:line:col: " prefix xcif error translators add.
Both wordings are preserved in the source so a future reader can
see exactly what each parser says, and either parser's wording can
change deliberately with a loud test failure rather than silently
regressing.
SHELX and several other tools append a Ctrl-Z / SUB byte (0x1A) to
.hkl and .cif files they write. CIF 1.1's character set doesn't
include it, but ucif silently treats it as end-of-file, so real
files parse cleanly. xcif was tokenising it as a stray unquoted
value, which pushed the enclosing loop's value count one past a
multiple of its tag count and triggered a spurious
loop containing tag '_refln_index_h' has N values,
not a multiple of its M tags
error on otherwise well-formed SHELX-style output.
Tokenizer::next() now checks for 0x1A after skip_whitespace_and_comments,
consumes the rest of the buffer, and returns TOKEN_EOF. Any content
after 0x1A is ignored — matching DOS-era semantics and what ucif
does in practice.
- Add "[skip cache]" to the commit message to tell GitHub Actions to not use the build cache in the quick tests jobs - A new cache will be created with the new build
bkpoon
left a comment
There was a problem hiding this comment.
Code Review
Verdict: Ready to merge as opt-in — a few items should be resolved before the follow-up PR that flips DEFAULT_ENGINE to "xcif".
Because DEFAULT_ENGINE = "ucif" is preserved, merging this PR today ships zero behavior change to the ~120 existing iotbx.cif.reader(...) call sites. The items below are mostly scoped at the default-flip, not this merge.
Strengths
- Default preserved (
iotbx/cif/__init__.py:41); opt-in viaengine="xcif"is the right shipping path for a dual-parser change. - Walker docstring on
_drive_builder_from_xcif(iotbx/cif/__init__.py:48-81) is load-bearing — explains strict/non-strict semantics, thedata_/save_prefix round-trip to match the builder's[find('_')+1:]logic, and the ordering limitation. - Real parity test at
xcif/regression/tst_iotbx_cif_reader_parity.pydrivesengine="xcif"against ucif on 7 inline fixtures plus file fixtures, comparing block names, pair items, loop tag sets and column content, and save-frame structure. - Non-trivial correctness fixes bundled in:
- CIF 1.1-compliant embedded-delimiter handling (
xcif/tokenizer.cpp:118-142) — fixes theN-METHYL-PYRIDOXAL-5'-PHOSPHATEmonomer-library case. - Unterminated-quote / semicolon now throws (previously returned partial values silently).
global_recognition for mon_lib files; DOS0x1AEOF compat for SHELX output.- Locale fix via
strtod_l/_strtod_lbehind a C-locale handle (xcif/numeric.cpp:12-27), thread-safe static init. - Integer parsing via
strtolwith explicit overflow check (xcif/numeric.cpp:63-80); old hand-rolled loop silently overflowed atINT_MAX. Paired withINT_MAX/INT_MINtests intst_numeric.cpp. string_viewbounds now assertable in debug (xcif/include/xcif/string_view.h).
- CIF 1.1-compliant embedded-delimiter handling (
- GIL released during
parse/parse_fileinxcif/xcif_ext.cpp:83-100; Python walker runs after heavy C++ work with GIL reacquired.
Issues
Critical
None.
Important (should be resolved before flipping the default)
-
Builder contract fragility with save frames.
iotbx/cif/__init__.py:102-103synthesises the save-frame heading as"save_" + sf.name. This round-trips only becauseiotbx/cif/builders.py:68-72strips it back withsave_frame_heading[save_frame_heading.find('_')+1:]. Third-partybuilderobjects passed viareader(builder=...)that string-compare the fullsave_<name>token will silently diverge between engines. At minimum, document the builder contract on thebuilder=parameter. -
Pair/loop source order not preserved by xcif. The walker explicitly emits all pairs before all loops within a block (
iotbx/cif/__init__.py:87-94) becausexcif::Blockstorespairs_andloops_in separate vectors. ucif preserves source order. The parity test's_compare_block_like(xcif/regression/tst_iotbx_cif_reader_parity.py:159-178) compares viadict(blk_u._items), so order is discarded and CI will not catch this.block.show()output will change once the default flips, which makes file diffs noisy and can break snapshot tests downstream.- Fix (minimum): tighten the parity test to assert list-order equivalence. Then either store original order in xcif and honour it in the walker, or flag this as a known behavioral difference in release notes.
-
raise_if_errors=Falsesemantics diverge. On parse failure,_drive_builder_from_xcifreturns before the walker touches the builder at all (iotbx/cif/__init__.py:67-74) — xcif yields an emptyself.model(). ucif accumulates errors while partially populating the model. Callers that only checkerror_count() == 0are unaffected, but callers trying to salvage a partial model from a damaged file behave radically differently. Should at least be documented in theengine=docstring. -
No docstring on
reader.engine. Thereaderclass has no class docstring at all;DEFAULT_ENGINEhas a one-line comment but doesn't document runtime-override semantics. Nothing insphinx/iotbx/iotbx.cif.rstis updated either. For a switch that affects ~120 call sites, this is inadequate. -
Zero-copy
parse_filepath is not used. Whenfile_path=is given, reader reads the file into a Python string (iotbx/cif/__init__.py:127-130) then callsxcif_ext.parse. xcif has aparse_fileentry point that memory-maps and parses directly. This is plausibly why the headline speedup is 6× throughiotbx.cif.readerbut 21× for the raw parser on the same 34 MB file. Suggest dispatching toparse_filewhenfile_pathis given and engine isxcif; binary-detection only reads the first few KB so it remains cheap. -
Engine validation happens after I/O.
iotbx/cif/__init__.py:134-140validatesengineonly after the file has been read and binary-detected. A typo'd engine forces needless I/O beforeValueError. Easy to move the check to the top of__init__. -
DEFAULT_ENGINEthread-safety undocumented. Reader readsDEFAULT_ENGINEat construction time (line 137), so a concurrent test mutating it during another test's parse races. Unlikely given Phenix's spawn/fork model but worth noting in the module docstring. Alternatively, freeze it to"ucif"and expose only the per-call kwarg until the flip — making the flip literally a one-line commit as described.
Minor
_VALID_ENGINES = ("ucif", "xcif"): fine for 2 items; adictmapping engine → walker factory scales better when a third option lands.- Walker catches
(ValueError, RuntimeError)atiotbx/cif/__init__.py:68. If astd::exceptionsubclass ever escapes without a custom translator, it would bypass_xcif_errorsand theraise_if_errors=Falsecontract. Defensively broadening the except with a class filter is cheap insurance. - xcif error messages prefixed
"<source>:line:col: "vs ucif's plain messages. Test updates already accommodate both wordings (iotbx/cif/tests/tst_model_builder.py:34-52,iotbx/pdb/tst_pdb.py:718-728) — worth calling out in release notes for anyone grepping logs for error patterns. .gitignoreaddsCLAUDE.md/*/CLAUDE.md. Tooling-specific patterns like this are better in~/.gitignore_globalor.git/info/excluderather than the shared project.gitignore. Trivial nit.- Two-dot diff shows
mmtbx/ligands/rdkit_utils.pyandmmtbx/validation/validate_ligands.pyas deletions; this is a rebase artifact (branch is behind master by3077f4bc11andf37c1b9f65). Not a real issue, but rebasing onto master would clean the diff view. tags = list(xloop.tags)at walker — minor allocation, not worth fixing.xcif/regression/example.cifwas modified to add_chem_comp/_entity/_atom_typestubs with matchingtst_example_tokenize.pygolden-count updates. Verify no other cctbx modules reference this fixture.
Recommendations
- Keep the merge two-step. Ship this PR as opt-in, then flip the default in a separate one-line PR with a bake-in window. Do not squash — makes any regression cleanly attributable to the flip and trivially revertible.
- Add a CI matrix job that runs the cctbx test suite under
DEFAULT_ENGINE="xcif". The "whole test suite passes" claim is currently author-local and will decay with each new PR that lands. An env-var-honoring path iniotbx/cif/__init__.pyis enough. - Strengthen the parity test to fail on source-order mismatches (see item 2).
- Document semantic differences in the
engine=docstring: partial-model-on-error, error-wording prefixes, pair/loop source order, file-read path performance characteristics. - Consider extracting the walker to
iotbx/cif/_xcif_walker.py. It's ~40 lines of non-trivial Python glue in a module that already carries a lot.
Assessment
Merging this PR today is low-risk because the default path is untouched. Items 1–4 and the CI-matrix recommendation are the blocking work for the subsequent default-flip PR; 5 is a sizeable perf win left on the table; 6–7 are small-effort hardening.
Review assisted by Claude Code.
Block now records a (kind, index) log of entries in the order they appeared in the source. The iotbx.cif walker iterates this log so that str(model) output through engine="xcif" matches the ucif engine when a block interleaves pair items, loops, and save frames. The parity regression fixture now interleaves pairs and loops and asserts list-order equality of block._set (the OrderedSet that drives block.show() / str(model) serialization), catching pair/loop interleave mismatches that a separate _items vs .loops comparison would miss.
Move the engine check to the top of reader.__init__. A typo'd engine name now raises ValueError immediately instead of after smart_open has already read the file (or raised IOError on a missing path). Adds tst_iotbx_cif_reader_engine.py to xcif/regression with two exercises: engine validation short-circuits before any file open, and both engines parse a trivial fixture.
When the reader is given a plain (uncompressed) file_path and engine="xcif", dispatch to xcif_ext.parse_file (memory-mapped in C++) instead of reading the file into a Python string and calling xcif_ext.parse. Avoids a whole-file str() allocation on the hot path. Binary detection runs via detect_binary_file.from_initial_block on the first 1000 bytes, preserving the "Binary file detected" abort. Compressed files (.gz/.Z/.bz2) and file_object= callers continue to use the existing read-into-string path; parse_file cannot mmap a compressed stream or an open file object. Walker logic factored out of _drive_builder_from_xcif into a shared _walk_xcif_doc helper; both parse/parse_file wrappers reuse it. Tests verify file_path + xcif invokes parse_file (not parse) and that the resulting model matches the input_string path.
Adds a class docstring to reader covering: - what ucif and xcif are, and why ucif is the default - engine= behavioral divergences: error-message prefixes, partial-model-on-error (ucif multi-error + partial state, xcif single-error + empty model), pair/loop source-order guarantee, and the xcif file_path fast-path / compression fallback - the builder= contract re: the "data_" / "save_" prefix — third-party builders comparing full heading tokens must handle the prefix, since the default cif_model_builder strips it - raise_if_errors / strict semantics Expands the DEFAULT_ENGINE module-level comment to match. No behavior change.
|
Thanks for the thorough review — I've addressed all items 1–7 plus a few of the minor ones. Summary, with commit references:
On item 3 (partial-model-on-error) — I went with "document loudly" rather than unify. The unification route needs a parallel non-throwing C++ API ( Deferred (happy to take them in follow-ups or here if you prefer):
Ready for another pass. |
bkpoon
left a comment
There was a problem hiding this comment.
Follow-up Review (delta)
Verified the four new commits against the Important items from my previous review. Scope: only what changed since f5099ea.
Item-by-item verification
- Item 1 — builder contract re:
data_/save_prefix (fbe030ed): ✅ Docstring underbuilder=now explicitly states that headings are passed with the prefix and that the default strips-before-first-underscore. - Item 2 — pair/loop source order (
bf2e49ee): ✅xcif::Block::source_order_records insertion order; walker iterates it. The parity test fixture now interleaves pair-loop-pair-loop-pair and_compare_block_likeassertslist(blk_u._set) == list(blk_x._set)before value comparison — the pre-fix bug would fail the new test. - Item 3 —
raise_if_errors=Falsesemantics (doc-only decision): ✅ with one caveat worth noting. The docstring explicitly directs salvage-semantics callers toengine="ucif", which makes the decision defensible. One in-tree caller actually does mine post-parse state:mmtbx/chemical_components/cif_parser.py:259iteratescm.items()aftercif.reader(filename, strict=False, raise_if_errors=False).model(). Unaffected as long asDEFAULT_ENGINE="ucif"; worth remembering when the default flips so that this caller is either (a) migrated to a lenient-xcif variant, or (b) pinned toengine="ucif". The only two otherraise_if_errors=Falsecallers in the repo are a unit test andcctbx/web/iotbx_cif_validate.py:28, so the exposure is narrow. - Item 4 —
readerdocstring (fbe030ed): ✅ Class now carries a full docstring covering all kwargs + engine divergences. - Item 5 — zero-copy
parse_file(2a8df9da): ✅ Fast path gated onengine=="xcif"ANDfile_object is NoneAND uncompressed plain path, with binary detection preserved. Walker factored into shared_walk_xcif_docso bothparseandparse_filepaths funnel through identical source-order / error-translation logic. Tests monkey-patchxcif_ext.parse/xcif_ext.parse_fileto verify dispatch. - Item 6 — validate
engine=before I/O (b1528af0): ✅ Check moved to right after the file-source assertion. New regression (tst_iotbx_cif_reader_engine.py::exercise_engine_validated_before_io) uses a guaranteed-nonexistent path to prove the check short-circuits beforesmart_open. - Item 7 —
DEFAULT_ENGINEthread-safety:⚠️ Read-at-construction is now documented in the module comment; an explicit "mutating concurrently is a data race" caveat is not. Low risk in practice (GIL + Phenix's fork-based multiprocessing); acceptable as is.
Minor observations (non-blocking)
- Fast-path error messages for missing files now come from xcif rather than
smart_open'sSorry("Cannot find..."). Callers grepping logs for theSorrywording would see a different string underengine="xcif". Not covered by the docstring's error-prefix note. - Per-block list allocations in the walker (
pair_tags,pair_values,loops,save_frames) — O(items-per-block), dwarfed by theparse_filewin.
Verdict
Ready to merge. All Important items are addressed in substance; item 7 is partially addressed but low-risk for an opt-in engine. The new tests would catch regressions of items 2, 5, and 6. No new bugs introduced by the four follow-up commits.
The mmtbx/chemical_components/cif_parser.py caller noted above and a CI matrix job running the suite under DEFAULT_ENGINE="xcif" are the natural follow-up items alongside flipping the default.
Delta review assisted by Claude Code.
|
LGTM. One other comment is that when the switch is made to |
xcif — a fast C++ CIF parser — is now wired into
iotbx.cif.readerwith a single global switch iniotbx/cif/__init__.py(the module-levelDEFAULT_ENGINEconstant) plus a per-callengine=kwarg on the reader.DEFAULT_ENGINEstays"ucif"in this PR; callers who want to opt in today can passengine="xcif"explicitly, and a one-line follow-up will flip the default after bake-in.Whole set of tests including Phenix ones are passing with the switch flipped.
Benchmark on a 34 MB mmCIF (
4v9d.cif): ~6× speedup end-to-end throughiotbx.cif.reader, ~21× for the raw parser alone.