v1.0: rewrite parsing backend on pypdfium2 by codereverser · Pull Request #124 · codereverser/casparser

codereverser · 2026-05-18T11:22:52Z

Summary

v1.0 is a full rewrite of the CAS parsing layer:

Backend: pdfminer.six + PyMuPDF → pypdfium2 (Apache-2.0 / BSD-3). casparser is now pure MIT end-to-end — no transitive GPL/AGPL obligations.
Per-issuer parsers: CAMS/KFin DETAILED, CAMS/KFin SUMMARY, NSDL, and CDSL each have a dedicated parser tuned to its template family, replacing the single regex-on-text pipeline.
Drops the mupdf / fast extras and the --force-pdfminer CLI flag (force_pdfminer= kwarg kept as a no-op with DeprecationWarning).
Minimum Python: 3.11.

Why

v0.8 had three NSDL/CDSL regex bugs corrupting holdings (misplaced-UCC-as-folio, space-merged folio+units, dropped NSDL HDFC subaccount on CDSL multi-account statements), plus a MutualFund.fix_float validator miss on Optional[Decimal] aliased fields.
New CAMS/KFin 2026 templates broke the existing SUMMARY and MF-Holdings regexes (12 folios returning 0; zero-balance schemes dropped).
PyMuPDF's GPL/AGPL footprint complicated downstream packaging.

What's in the box

New parsers under casparser/parsers/:
- pageobj.py — shared page-object atom extractor (NSDL/CDSL)
- extract.py — char/line extractor (CAMS/KFin)
- cams_detailed.py, cams_summary.py, nsdl.py, cdsl.py
- detect.py — file-type sniffer; wraps PdfiumError into CASParseError / IncorrectPasswordError
- _classify.py, _isin.py, _investor.py — shared helpers
All v0.8 fields populated: investor_info, folio.PAN/KYC/PANKYC, scheme.isin/amfi/type, scheme.nominees, scheme.valuation.cost.
investor_info is now required on both CASData and NSDLCASData (matches the contract of "every CAS contains an investor block").
New "Fund House" AMC suffix recognised (Zerodha).
CDSL multi-account statements (PAYTM + NEXTBILLION + FINWIZARD + HDFC + ZERODHA on one PDF) parse correctly; DIRECT (non-ARN) distribution-mode rows populate PnL/return.
ISIN/AMFI enrichment via casparser-isin with a direct-ISIN fallback path for templates where multi-line registrar rendering mangles the RTA token.

Bug fixes that landed alongside the rewrite

CAMS SUMMARY valuation.date was mis-parsing to date(201, 1, 1) — column boundary + Pydantic coercion fix.
KFin SUMMARY 2026 zero-balance schemes (HMTOGT, HPREG) no longer dropped.
CAMS SUMMARY 2026 (with new ISIN column) parses again.

Breaking changes

casparser.types.CASData.investor_info: Optional[InvestorInfo] → InvestorInfo (parser raises CASParseError if it can't find the block).
casparser.types.NSDLCASData.investor_info: same change.
casparser.types.NSDLCASData.file_type: Optional[FileType] = None → FileType.
ProcessedCASData and PartialCASData removed from casparser.types (they were internal to the old pipeline).
casparser.process package removed; surviving helpers moved to casparser.parsers._classify and casparser.parsers._isin.
--force-pdfminer / force_pdfminer= is a no-op (emits DeprecationWarning).

Testing

24/24 unit + integration tests pass (tests/test_pypdfium.py, tests/test_helpers.py, tests/test_gains.py, tests/casparser/test_cli.py).
13/13 production samples (CAMS × 5, KFin × 5, NSDL × 1, CDSL × 2) parse cleanly end-to-end with populated investor_info, ISIN/AMFI/type, PAN/KYC, valuation.cost, nominees.

Test plan

CI green on Python 3.11 / 3.12 / 3.13
casparser CLI on a sample CAMS/KFin/NSDL/CDSL PDF
pip install -U casparser (or uv sync) installs without pulling pdfminer.six / PyMuPDF
CHANGELOG.md and README.md changes reflect the new state

codecov · 2026-05-18T12:07:04Z

Codecov Report

❌ Patch coverage is 96.63592% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.99%. Comparing base (ed208ee) to head (220edcf).

Files with missing lines	Patch %	Lines
casparser/parsers/cams_detailed.py	95.61%	13 Missing ⚠️
casparser/parsers/nsdl.py	97.23%	11 Missing ⚠️
casparser/parsers/pageobj.py	95.09%	11 Missing ⚠️
casparser/parsers/cams_summary.py	96.52%	7 Missing ⚠️
casparser/parsers/extract.py	95.91%	7 Missing ⚠️
casparser/parsers/cdsl.py	98.93%	3 Missing ⚠️
casparser/types.py	86.96%	3 Missing ⚠️
casparser/parsers/__init__.py	96.00%	2 Missing ⚠️
casparser/parsers/_investor.py	96.93%	2 Missing ⚠️
casparser/parsers/_isin.py	88.89%	2 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #124      +/-   ##
==========================================
+ Coverage   88.30%   96.99%   +8.70%     
==========================================
  Files          18       19       +1     
  Lines        1469     2391     +922     
==========================================
+ Hits         1297     2319    +1022     
+ Misses        172       72     -100

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Rebuilds the parsing layer for v1.0 on top of pypdfium2 (Apache-2.0 / BSD-3) so casparser ships pure MIT end-to-end; the prior pdfminer.six + PyMuPDF dependencies are dropped along with the entire `casparser/ process/` regex-tokenisation pipeline they fed. Engine (casparser/parsers/extract.py, pageobj.py) ================================================= `extract.py` walks PDF page objects (one atom per text-show op), maps glyphs to their parent atom via PDFium's `FPDFText_GetTextObject`, deduplicates same-font overlapping atoms, then emits `Char`/`Line`/`Page` shaped output that downstream parsers consume. Atom-level dedup replaces all per-character overlay heuristics: when two atoms share a font, x-overlap by >=50% of the narrower atom's width, and sit 0.05-3.0pt apart in y, we drop the one further from the row's median baseline. That handles the date-twin artefact (same date column rendered twice with a small y-offset, glyphs interleaving by x to produce garbage like `2020 -> 22002200`) without the multi-stage sub-cluster filters earlier prototypes used. `pageobj.py` exposes the atoms + their column/block grouping that the NSDL/CDSL parsers operate on directly. The same Atom primitive backs the investor extractor. Per-issuer parsers ================== - `cams_detailed.py` / `cams_summary.py` consume the Line stream for CAMS + KFin DETAILED and SUMMARY templates. - `nsdl.py` reads the page-2 account roster, walks per-account holdings sections (equities + mutual funds + corporate bonds in both summary and detailed forms). Section-aware routing handles the case where multiple holding types share the same 18-cell detailed table header by tracking `cur_section` from the preceding marker block. The page-2 roster accepts both the 4-cell (broker + DP/Client joined) and 5-cell (broker, then DP/Client) variants. - `cdsl.py` mirrors NSDL's structure for the CDSL CAS template. Types ===== - Adds `Bond` model with optional coupon_rate / coupon_frequency / maturity_date / face_value / market_price; required fields are isin, num_bonds, value. Surfaces on `DematAccount.bonds`. - `investor_info` is now required on `CASData` and `NSDLCASData`. Performance =========== The dispatcher opens the PDF document exactly once per `read_cas_pdf` call and threads the handle through detect / parser / investor extractor via an `_doc=` kwarg. NSDL/CDSL additionally share the extracted atoms between the holdings parser and the investor extractor.

Replaces the v0.8 pdfminer / PyMuPDF test files with a per-issuer e2e suite plus a focused unit-test layer. Layout ====== - `tests/conftest.py` — module-scoped fixtures for each fixture PDF (CAMS / KFin / NSDL / CDSL detailed + summary). Each fixture loader skips its dependent tests when the corresponding env var isn't set, so contributors without the encrypted bundle can still run the unit-test portion. - `tests/_assertions.py` — invariant helpers shared across the e2e suite. Designed to lock in correctness without encoding the real rupee figures from private fixtures. - `tests/test_cams.py`, `test_kfin.py`, `test_nsdl.py`, `test_cdsl.py` — per-issuer e2e tests. - `tests/test_errors.py` — error-path + back-compat shim tests. - `tests/test_demat_units.py` — NSDL/CDSL parser unit tests using synthetic Block/Cell fixtures (no real ISINs, names, or IDs). - `tests/test_helpers.py`, `tests/test_gains.py`, `tests/test_gains_e2e.py` — existing helper / gains coverage, retained. Arithmetic invariants ===================== The e2e tests verify parsing correctness without depending on specific rupee amounts: - **CAMS / KFin DETAILED**: scheme.close * scheme.valuation.nav == scheme.valuation.value and scheme.open + sum(txn.units) == scheme.close. - **NSDL / CDSL**: sum(eq.value + mf.value + bd.value) == account.balance per account; mf.balance * mf.nav == mf.value; bond.num_bonds * bond.face_value == bond.value (summary form); bond.num_bonds * bond.market_price == bond.value (detailed form). These catch column-swap, decimal-parse, anchor-drift, and missed- transaction bugs without encoding portfolio totals in the repo. Removed ======= - `tests/test_pdfminer.py`, `tests/test_mupdf.py`, `tests/test_process.py` — backend-specific suites for the v0.8 stack. - `tests/test_pypdfium.py`, `tests/base.py` — the intermediate single-file test suite is superseded by the per-issuer split. - `tests/pytest.ini` — empty file masked the pyproject.toml pytest config.

- `pyproject.toml`: bumps version to 1.0.0, drops pdfminer.six (AGPL-3.0+) and PyMuPDF (GPL-3.0+ / commercial) from runtime deps, replaces with pypdfium2 (Apache-2.0 / BSD-3). Loosens remaining version bounds where compatible (click <10, rich <16, pypdfium2 <7, pydantic <3, etc.) and refreshes the dev-group upper bounds (pytest <10, pytest-cov <8, ipython, coverage). - Python floor lifts to 3.11 (3.10 EOL anyway). - `uv.lock`: regenerated against the new dep set. - `.github/workflows/run-pytest.yml`: switches CI to Python 3.12, decrypts `tests/files.enc` for the encrypted fixture bundle, and exposes the per-fixture env-var matrix to pytest. PyPI publish workflow updated to drop the dropped backends. - `licenses/AGPL-3.0+.txt` + `licenses/GPL-3.0+.txt`: removed — no longer required to redistribute since the GPL/AGPL deps are gone. - `README.md`: documents the v1.0 backend swap and refreshes external links. `CHANGELOG.md` gets a 1.0.0 section.

v0.9.0 shipped a PyMuPDF-1.25 compatibility fix on top of v0.8's existing backend; v1.0 has already replaced that backend with pypdfium2, so the v0.9 parser patches don't apply. The merge keeps v1.0's parser layer and folds in the v0.9 metadata changes that are still relevant: - `casparser-isin>=2026.5.1` (DB format v2 with sebi_category / last_seen / ISIN-first lookup priority) — adopted. - `pdfminer.six` and the `mupdf`/`fast` PyMuPDF extras stay removed (1.0.0's pure-pypdfium2 stack). - `MutualFund.fix_float` aliased-field bug fix (v0.9 patched it on the v0.8 model; v1.0's model already carries the same fix). - CI matrix: adopt v0.9's `[3.11, 3.12, 3.13]` Python matrix; drop `--all-extras` from `uv sync` (no extras to install any more). - CHANGELOG keeps the 1.0.0 entry on top and a condensed 0.9.0 entry below it for historical record. Files v0.9 modified that v1.0 had already deleted are kept deleted: - casparser/parsers/mupdf.py - casparser/process/{__init__,cas_detailed,cas_summary,cdsl_statement, nsdl_statement,regex,utils}.py Tests: 151/151 with private fixtures, 87/87 + 64 skipped without.

GitHub deprecation notice — actions running on Node.js 20 will be forced to Node.js 24 from 2026-06-02. Bump every referenced action to its current Node-24-native major: actions/checkout v4 → v6 actions/setup-python v5 → v6 astral-sh/setup-uv v5 → v8 codecov/codecov-action v5 → v6 These are all backward-compatible at the workflow-input level; no input changes required.

The Finance (No. 2) Act 2024 split the FY2024-25 equity-LTCG regime on 23-Jul-2024 (rate 10% -> 12.5%, exemption 1L -> 1.25L). The AY 2025-26 Schedule 112A CSV template added a column 1b, "Share/Unit Transferred", flagging which side of that date each transfer (sale) falls on (BE before / AE on-or-after) — casparser's 112A export was missing it. - `GainEntry112A` gains a `transferred` field (BE/AE from sale date). - `generate_112a` keys the after-31-Jan-2018-acquired consolidation on the transfer flag, so a fund sold both before and on/after 23-Jul-2024 within FY2024-25 produces one row per side (the utility taxes the two sides at different rates) instead of one ambiguous merged row. - `generate_112a_csv_data` inserts the `Share/Unit Transferred(1b)` header + value between columns 1a and 2, but only for FY2024-25 and later (`_fy_needs_transfer_col`); older FYs keep the 14-column layout their ITR utility expects. Columns 2-14 were verified unchanged against the official 112A_115AD_CSV_Instructions.pdf (col 7 = max(8, 9), col 9 = min(FMV, sale) for grandfathered lots, 31-Jan-2018 grandfathering intact). Tests: 6 new unit tests in TestSchedule112A covering the transfer flag, the FY gate, the cross-cutoff consolidation split, and CSV header/column placement for both new and legacy FYs.

The Cost Inflation Index table had FY2024-25 as 365; the CBDT-notified value is 363 (Notification 44/2024). The wrong value slightly mis-indexed debt-fund LTCG cost of acquisition for FY2024-25 sales. Also adds FY2025-26 = 376 (Notification 70/2025, applicable AY 2026-27 onward). All earlier years (FY2001-02..FY2023-24) verified correct against the CBDT table — unchanged. Test asserts the three most recent notified values.

A third CDSL "MUTUAL FUND UNITS HELD" row variant exists in the wild: a distribution-mode column (ARN/DIRECT) is present but there is NO separate "invested / total cost" column, so the row carries only three value columns — units | NAV | current-value. `_parse_mf_holdings_row` assumed >= 4 numerics whenever a distribution-mode column was detected, assigning the third numeric to `invested` and defaulting `value` to 0. That zeroed the holding's current value (and, for single-MF folio accounts, the whole MF-Folios balance). Fix: when only three numerics survive, treat the third as the current value (a holdings statement always prints current value; the cost column is the optional one) and leave `total_cost` None. The >= 4 numerics path is unchanged. Found via a batch run over a third-party corpus of 193 CAS PDFs: 161 valid CAS files parsed with zero crashes; this was the one material correctness gap surfaced by the sum(holdings)==balance invariant. Two new regression tests cover the full and reduced distribution-mode templates.

CDSL occasionally prints sub-unit balances without the leading zero (e.g. an 0.196 unit balance renders as .196). The shared NSDL/CDSL NUMERIC_RE required at least one digit (or comma) before the decimal point, so those cells were classified as text and the row layout silently shifted - producing a Σholdings ≠ balance mismatch on affected accounts. Add a leading-dot alternative to the regex in both casparser/parsers/cdsl.py and casparser/parsers/nsdl.py and extend the existing _looks_numeric unit tests to cover .196 / -.5 / 0.196 (positive cases) plus naked '.' / '-' (still rejected). Verified on a private CDSL fixture: the affected account's Σholdings invariant moves from fail to pass with no other change.

…ismatch Some KFin templates print '... - Reversed' rows (notably the Franklin wound-up debt schemes' 'Payment - Units Extinguished-Reversed' entries) with cosmetic parentheses around the units value even though the semantic sign is the opposite of the original. The parser's _decimal() helper treats parens as negative, so these rows landed with the wrong sign and the open + Σunits == close invariant broke on the parent scheme by exactly twice the reversed-row units. Add a post-parse running-balance validator to cams_detailed.parse: for each scheme, walk the transactions and check whether prev_balance + units == balance (within tolerance). When the parsed sign disagrees but prev_balance - units == balance, flip units (and the matching amount) and reclassify via get_transaction_type so the type label tracks the corrected sign. Rows without units (STT/Stamp/TDS/MISC) or without a parsed balance are skipped; rows whose printed balance matches neither sign are left untouched. This is a generic safety net for cosmetic-parens sign mis-parses on any AMC / template, not just the Franklin case. Covered by four new TestBalanceSignFix unit tests in tests/test_helpers.py plus the private KFin batch (open+Σunits invariant moves to 100% pass).

Add a 'Supported inputs' section before Installation that lists the four recognised issuers (CAMS / KFintech / NSDL / CDSL) and their statement variants, then a 'Known unsupported inputs' subsection covering the three classes that surfaced during the v1.0 batch testing: - Re-printed PDFs (Microsoft Print to PDF, browser save-as-PDF, macOS print preview, etc.) - the watermark gets flattened into a bitmap and the original generator metadata is wiped, so the detector can no longer prove what it's looking at. - MF Central statements - different template / generator, out of scope for v1.0. - Third-party-reformatted statements - same reason as re-prints. Users who hit these flows should keep the original issuer- delivered PDF alongside any redistributed copy and feed the original to casparser.

…osals In FIFOUnits.sell, when a buy lot is partially consumed by a sale, the lot was re-queued onto the FIFO deque with the full original purchase_tax — not the unallocated remainder. The next disposal from that lot would then run the proportional allocation against the full original stamp again, re-claiming a slice of stamp that had already been allocated to an earlier disposal. Worked example: 300-unit lot with ₹1.25 stamp paid, consumed across three 100-unit disposals. Disposal 1: round(1.25 × 100/300, 2) = ₹0.42 → lot re-queued (200, 1.25) Disposal 2: round(1.25 × 100/200, 2) = ₹0.62 → lot re-queued (100, 1.25) Disposal 3: round(1.25 × 100/100, 2) = ₹1.25 Total claimed: ₹2.29 vs ₹1.25 paid (84% over-claim) Re-queue with (purchase_tax - stamp_duty) so the lot carries only the unallocated remainder. The rounding residual gets absorbed into the final partial consumption, so the invariant sum(stamp_claimed) == stamp_paid holds exactly across any number of splits. Section 48 only permits deducting actual stamp paid as a transfer-related expense, so the prior behaviour over-stated the deduction on Schedule 112A. The magnitude is small on a per-lot basis (MF stamp is capped at ~₹1.25 per SIP since Jul 2020), but compounds across hundreds of disposals per FY and grows worse with split depth. Surfaced by a cross-engine comparison with folioman's FIFO implementation, which has carried the correct proportional-with-reduction logic all along. Add tests/test_gains.py::TestGainsClass::test_stamp_duty_split_lot_does_not_exceed_paid to lock the invariant.

Some CAS generators insert a U+00AD soft hyphen at the point where a long token wraps across two display lines. For a 12-char ISIN this produced e.g. INF179K01<shy>WN9, which matched neither the anchored ISIN regex nor its leftover fragment, so the holding row was silently dropped (CDSL issue #127). Handle it at the cell-join layer in pageobj._cells_from_block_atoms, the single chokepoint all NSDL/CDSL holdings parsing flows through. New _join_column_atoms() helper: - a fragment ENDING with a soft hyphen is a continuation marker: the next fragment is spliced on with no separator and the hyphen dropped (reconstructs INF179K01<shy> + WN9 -> INF179K01WN9); - any remaining embedded soft hyphen is stripped, so a single-atom INF179K01<shy>WN9 normalises the same way; - cells without a soft hyphen are newline-joined exactly as before. Add TestSoftHyphen (5 cases): embedded single-atom, two-atom split, chained 3-fragment split, no-op for normal multi-line cells, and an end-to-end check through _cells_from_block_atoms asserting the reconstructed string matches INF_ISIN_RE. Scope: reconstruction works when the wrapped fragments fall in the same row block (the normal-scale case). ISINs split across blocks by a non-standard page scale (e.g. a statement re-printed/re-saved at a larger scale) are not covered. Full suite stays green (171 passed); the private 193-PDF batch is unchanged (all arithmetic invariants still 100% pass).

main carried two CDSL fixes to casparser/process/cdsl_statement.py (soft-hyphen ISIN reconstruction #129, and an IDCW terminator regex refactor). v1.0 removed the entire casparser/process/ package and reimplemented CDSL parsing structurally under casparser/parsers/, so the only conflict was a modify/delete on that file. Resolved by keeping the deletion — both fixes are already covered in v1.0: - the soft-hyphen ISIN reconstruction is implemented for the new structural parser in casparser/parsers/pageobj.py (837f5e6); - v1.0's CDSL parser is immune to the IDCW premature-termination bug by design (mode-based section state, no content-token terminators).

- Drop rotated-watermark text at extraction (extract.py + pageobj.py): fixes garbage RTA / polluted scheme names on all four issuers. - CAMS detailed: reconstruct wrapped "(Non Demat)" headers — correct RTA + ISIN, no phantom schemes from continuation lines. - CAMS/KFin summary: stop the last scheme swallowing the Total row and trailing disclaimers. - Tests: scheme-name / RTA invariants + soft-hyphen reconstruction. Verified on a 217-file private corpus: RTA garbage 78->0, all arithmetic invariants 100%, repo suite 173 passed.

- read_cas_pdf now closes the PdfDocument it opens (try/finally), so pdfium no longer warns "objects still open" at teardown. - CDSL MF holdings: splice a wrapped "<digits>/<digits>" folio tail back onto the head (e.g. "910121125" | "82/0" -> 91012112582/0) instead of truncating the folio and swallowing the tail as the mode column. Regression tests added for both. Full suite green (175); batch invariants 100%.

codereverser added 3 commits May 20, 2026 11:58

codereverser force-pushed the feature/v1.0 branch from cd21b91 to 01e87c1 Compare May 20, 2026 02:02

codereverser added 2 commits May 22, 2026 15:34

codereverser force-pushed the feature/v1.0 branch from b41c14a to 09773ac Compare May 22, 2026 05:41

codereverser changed the title ~~v1.0: rewrite parsing backend on pypdfium2; full feature parity with v0.8~~ v1.0: rewrite parsing backend on pypdfium2 May 22, 2026

codereverser added 11 commits May 28, 2026 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0: rewrite parsing backend on pypdfium2#124

v1.0: rewrite parsing backend on pypdfium2#124
codereverser wants to merge 16 commits into
mainfrom
feature/v1.0

codereverser commented May 18, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

codereverser commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What's in the box

Bug fixes that landed alongside the rewrite

Breaking changes

Testing

Test plan

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codereverser commented May 18, 2026 •

edited

Loading

codecov Bot commented May 18, 2026 •

edited

Loading