Skip to content

v1.0: rewrite parsing backend on pypdfium2#124

Open
codereverser wants to merge 16 commits into
mainfrom
feature/v1.0
Open

v1.0: rewrite parsing backend on pypdfium2#124
codereverser wants to merge 16 commits into
mainfrom
feature/v1.0

Conversation

@codereverser
Copy link
Copy Markdown
Owner

@codereverser codereverser commented May 18, 2026

Summary

v1.0 is a full rewrite of the CAS parsing layer:

  • Backend: pdfminer.six + PyMuPDF → pypdfium2 (Apache-2.0 / BSD-3). casparser is now pure MIT end-to-end — no transitive GPL/AGPL obligations.
  • Per-issuer parsers: CAMS/KFin DETAILED, CAMS/KFin SUMMARY, NSDL, and CDSL each have a dedicated parser tuned to its template family, replacing the single regex-on-text pipeline.
  • Drops the mupdf / fast extras and the --force-pdfminer CLI flag (force_pdfminer= kwarg kept as a no-op with DeprecationWarning).
  • Minimum Python: 3.11.

Why

  • v0.8 had three NSDL/CDSL regex bugs corrupting holdings (misplaced-UCC-as-folio, space-merged folio+units, dropped NSDL HDFC subaccount on CDSL multi-account statements), plus a MutualFund.fix_float validator miss on Optional[Decimal] aliased fields.
  • New CAMS/KFin 2026 templates broke the existing SUMMARY and MF-Holdings regexes (12 folios returning 0; zero-balance schemes dropped).
  • PyMuPDF's GPL/AGPL footprint complicated downstream packaging.

What's in the box

  • New parsers under casparser/parsers/:
    • pageobj.py — shared page-object atom extractor (NSDL/CDSL)
    • extract.py — char/line extractor (CAMS/KFin)
    • cams_detailed.py, cams_summary.py, nsdl.py, cdsl.py
    • detect.py — file-type sniffer; wraps PdfiumError into CASParseError / IncorrectPasswordError
    • _classify.py, _isin.py, _investor.py — shared helpers
  • All v0.8 fields populated: investor_info, folio.PAN/KYC/PANKYC, scheme.isin/amfi/type, scheme.nominees, scheme.valuation.cost.
  • investor_info is now required on both CASData and NSDLCASData (matches the contract of "every CAS contains an investor block").
  • New "Fund House" AMC suffix recognised (Zerodha).
  • CDSL multi-account statements (PAYTM + NEXTBILLION + FINWIZARD + HDFC + ZERODHA on one PDF) parse correctly; DIRECT (non-ARN) distribution-mode rows populate PnL/return.
  • ISIN/AMFI enrichment via casparser-isin with a direct-ISIN fallback path for templates where multi-line registrar rendering mangles the RTA token.

Bug fixes that landed alongside the rewrite

  • CAMS SUMMARY valuation.date was mis-parsing to date(201, 1, 1) — column boundary + Pydantic coercion fix.
  • KFin SUMMARY 2026 zero-balance schemes (HMTOGT, HPREG) no longer dropped.
  • CAMS SUMMARY 2026 (with new ISIN column) parses again.

Breaking changes

  • casparser.types.CASData.investor_info: Optional[InvestorInfo]InvestorInfo (parser raises CASParseError if it can't find the block).
  • casparser.types.NSDLCASData.investor_info: same change.
  • casparser.types.NSDLCASData.file_type: Optional[FileType] = NoneFileType.
  • ProcessedCASData and PartialCASData removed from casparser.types (they were internal to the old pipeline).
  • casparser.process package removed; surviving helpers moved to casparser.parsers._classify and casparser.parsers._isin.
  • --force-pdfminer / force_pdfminer= is a no-op (emits DeprecationWarning).

Testing

  • 24/24 unit + integration tests pass (tests/test_pypdfium.py, tests/test_helpers.py, tests/test_gains.py, tests/casparser/test_cli.py).
  • 13/13 production samples (CAMS × 5, KFin × 5, NSDL × 1, CDSL × 2) parse cleanly end-to-end with populated investor_info, ISIN/AMFI/type, PAN/KYC, valuation.cost, nominees.

Test plan

  • CI green on Python 3.11 / 3.12 / 3.13
  • casparser CLI on a sample CAMS/KFin/NSDL/CDSL PDF
  • pip install -U casparser (or uv sync) installs without pulling pdfminer.six / PyMuPDF
  • CHANGELOG.md and README.md changes reflect the new state

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 96.63592% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.99%. Comparing base (ed208ee) to head (220edcf).

Files with missing lines Patch % Lines
casparser/parsers/cams_detailed.py 95.61% 13 Missing ⚠️
casparser/parsers/nsdl.py 97.23% 11 Missing ⚠️
casparser/parsers/pageobj.py 95.09% 11 Missing ⚠️
casparser/parsers/cams_summary.py 96.52% 7 Missing ⚠️
casparser/parsers/extract.py 95.91% 7 Missing ⚠️
casparser/parsers/cdsl.py 98.93% 3 Missing ⚠️
casparser/types.py 86.96% 3 Missing ⚠️
casparser/parsers/__init__.py 96.00% 2 Missing ⚠️
casparser/parsers/_investor.py 96.93% 2 Missing ⚠️
casparser/parsers/_isin.py 88.89% 2 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #124      +/-   ##
==========================================
+ Coverage   88.30%   96.99%   +8.70%     
==========================================
  Files          18       19       +1     
  Lines        1469     2391     +922     
==========================================
+ Hits         1297     2319    +1022     
+ Misses        172       72     -100     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Rebuilds the parsing layer for v1.0 on top of pypdfium2 (Apache-2.0 /
BSD-3) so casparser ships pure MIT end-to-end; the prior pdfminer.six
+ PyMuPDF dependencies are dropped along with the entire `casparser/
process/` regex-tokenisation pipeline they fed.

Engine (casparser/parsers/extract.py, pageobj.py)
=================================================

`extract.py` walks PDF page objects (one atom per text-show op),
maps glyphs to their parent atom via PDFium's
`FPDFText_GetTextObject`, deduplicates same-font overlapping atoms,
then emits `Char`/`Line`/`Page` shaped output that downstream
parsers consume. Atom-level dedup replaces all per-character
overlay heuristics: when two atoms share a font, x-overlap by
>=50% of the narrower atom's width, and sit 0.05-3.0pt apart in
y, we drop the one further from the row's median baseline. That
handles the date-twin artefact (same date column rendered twice
with a small y-offset, glyphs interleaving by x to produce garbage
like `2020 -> 22002200`) without the multi-stage sub-cluster
filters earlier prototypes used.

`pageobj.py` exposes the atoms + their column/block grouping that
the NSDL/CDSL parsers operate on directly. The same Atom primitive
backs the investor extractor.

Per-issuer parsers
==================

- `cams_detailed.py` / `cams_summary.py` consume the Line stream
  for CAMS + KFin DETAILED and SUMMARY templates.
- `nsdl.py` reads the page-2 account roster, walks per-account
  holdings sections (equities + mutual funds + corporate bonds in
  both summary and detailed forms). Section-aware routing handles
  the case where multiple holding types share the same 18-cell
  detailed table header by tracking `cur_section` from the
  preceding marker block. The page-2 roster accepts both the
  4-cell (broker + DP/Client joined) and 5-cell (broker, then
  DP/Client) variants.
- `cdsl.py` mirrors NSDL's structure for the CDSL CAS template.

Types
=====

- Adds `Bond` model with optional coupon_rate / coupon_frequency /
  maturity_date / face_value / market_price; required fields are
  isin, num_bonds, value. Surfaces on `DematAccount.bonds`.
- `investor_info` is now required on `CASData` and `NSDLCASData`.

Performance
===========

The dispatcher opens the PDF document exactly once per
`read_cas_pdf` call and threads the handle through detect /
parser / investor extractor via an `_doc=` kwarg. NSDL/CDSL
additionally share the extracted atoms between the holdings
parser and the investor extractor.
Replaces the v0.8 pdfminer / PyMuPDF test files with a per-issuer
e2e suite plus a focused unit-test layer.

Layout
======

- `tests/conftest.py` — module-scoped fixtures for each fixture PDF
  (CAMS / KFin / NSDL / CDSL detailed + summary). Each fixture
  loader skips its dependent tests when the corresponding env var
  isn't set, so contributors without the encrypted bundle can still
  run the unit-test portion.
- `tests/_assertions.py` — invariant helpers shared across the e2e
  suite. Designed to lock in correctness without encoding the real
  rupee figures from private fixtures.
- `tests/test_cams.py`, `test_kfin.py`, `test_nsdl.py`,
  `test_cdsl.py` — per-issuer e2e tests.
- `tests/test_errors.py` — error-path + back-compat shim tests.
- `tests/test_demat_units.py` — NSDL/CDSL parser unit tests using
  synthetic Block/Cell fixtures (no real ISINs, names, or IDs).
- `tests/test_helpers.py`, `tests/test_gains.py`,
  `tests/test_gains_e2e.py` — existing helper / gains coverage,
  retained.

Arithmetic invariants
=====================

The e2e tests verify parsing correctness without depending on
specific rupee amounts:

- **CAMS / KFin DETAILED**:
  scheme.close * scheme.valuation.nav == scheme.valuation.value
  and scheme.open + sum(txn.units) == scheme.close.
- **NSDL / CDSL**:
  sum(eq.value + mf.value + bd.value) == account.balance per
  account; mf.balance * mf.nav == mf.value;
  bond.num_bonds * bond.face_value == bond.value (summary form);
  bond.num_bonds * bond.market_price == bond.value (detailed
  form).

These catch column-swap, decimal-parse, anchor-drift, and missed-
transaction bugs without encoding portfolio totals in the repo.

Removed
=======

- `tests/test_pdfminer.py`, `tests/test_mupdf.py`,
  `tests/test_process.py` — backend-specific suites for the
  v0.8 stack.
- `tests/test_pypdfium.py`, `tests/base.py` — the intermediate
  single-file test suite is superseded by the per-issuer split.
- `tests/pytest.ini` — empty file masked the pyproject.toml
  pytest config.
- `pyproject.toml`: bumps version to 1.0.0, drops pdfminer.six
  (AGPL-3.0+) and PyMuPDF (GPL-3.0+ / commercial) from runtime
  deps, replaces with pypdfium2 (Apache-2.0 / BSD-3). Loosens
  remaining version bounds where compatible (click <10, rich <16,
  pypdfium2 <7, pydantic <3, etc.) and refreshes the dev-group
  upper bounds (pytest <10, pytest-cov <8, ipython, coverage).
- Python floor lifts to 3.11 (3.10 EOL anyway).
- `uv.lock`: regenerated against the new dep set.
- `.github/workflows/run-pytest.yml`: switches CI to Python 3.12,
  decrypts `tests/files.enc` for the encrypted fixture bundle, and
  exposes the per-fixture env-var matrix to pytest. PyPI publish
  workflow updated to drop the dropped backends.
- `licenses/AGPL-3.0+.txt` + `licenses/GPL-3.0+.txt`: removed —
  no longer required to redistribute since the GPL/AGPL deps are
  gone.
- `README.md`: documents the v1.0 backend swap and refreshes
  external links. `CHANGELOG.md` gets a 1.0.0 section.
v0.9.0 shipped a PyMuPDF-1.25 compatibility fix on top of v0.8's
existing backend; v1.0 has already replaced that backend with
pypdfium2, so the v0.9 parser patches don't apply. The merge keeps
v1.0's parser layer and folds in the v0.9 metadata changes that
are still relevant:

- `casparser-isin>=2026.5.1` (DB format v2 with sebi_category /
  last_seen / ISIN-first lookup priority) — adopted.
- `pdfminer.six` and the `mupdf`/`fast` PyMuPDF extras stay
  removed (1.0.0's pure-pypdfium2 stack).
- `MutualFund.fix_float` aliased-field bug fix (v0.9 patched it
  on the v0.8 model; v1.0's model already carries the same fix).
- CI matrix: adopt v0.9's `[3.11, 3.12, 3.13]` Python matrix; drop
  `--all-extras` from `uv sync` (no extras to install any more).
- CHANGELOG keeps the 1.0.0 entry on top and a condensed 0.9.0
  entry below it for historical record.

Files v0.9 modified that v1.0 had already deleted are kept deleted:
- casparser/parsers/mupdf.py
- casparser/process/{__init__,cas_detailed,cas_summary,cdsl_statement,
  nsdl_statement,regex,utils}.py

Tests: 151/151 with private fixtures, 87/87 + 64 skipped without.
GitHub deprecation notice — actions running on Node.js 20 will be
forced to Node.js 24 from 2026-06-02. Bump every referenced action
to its current Node-24-native major:

  actions/checkout            v4 → v6
  actions/setup-python        v5 → v6
  astral-sh/setup-uv          v5 → v8
  codecov/codecov-action      v5 → v6

These are all backward-compatible at the workflow-input level; no
input changes required.
@codereverser codereverser changed the title v1.0: rewrite parsing backend on pypdfium2; full feature parity with v0.8 v1.0: rewrite parsing backend on pypdfium2 May 22, 2026
The Finance (No. 2) Act 2024 split the FY2024-25 equity-LTCG regime
on 23-Jul-2024 (rate 10% -> 12.5%, exemption 1L -> 1.25L). The
AY 2025-26 Schedule 112A CSV template added a column 1b,
"Share/Unit Transferred", flagging which side of that date each
transfer (sale) falls on (BE before / AE on-or-after) — casparser's
112A export was missing it.

- `GainEntry112A` gains a `transferred` field (BE/AE from sale date).
- `generate_112a` keys the after-31-Jan-2018-acquired consolidation
  on the transfer flag, so a fund sold both before and on/after
  23-Jul-2024 within FY2024-25 produces one row per side (the
  utility taxes the two sides at different rates) instead of one
  ambiguous merged row.
- `generate_112a_csv_data` inserts the `Share/Unit Transferred(1b)`
  header + value between columns 1a and 2, but only for FY2024-25
  and later (`_fy_needs_transfer_col`); older FYs keep the 14-column
  layout their ITR utility expects.

Columns 2-14 were verified unchanged against the official
112A_115AD_CSV_Instructions.pdf (col 7 = max(8, 9), col 9 =
min(FMV, sale) for grandfathered lots, 31-Jan-2018 grandfathering
intact).

Tests: 6 new unit tests in TestSchedule112A covering the transfer
flag, the FY gate, the cross-cutoff consolidation split, and CSV
header/column placement for both new and legacy FYs.
The Cost Inflation Index table had FY2024-25 as 365; the CBDT-notified
value is 363 (Notification 44/2024). The wrong value slightly
mis-indexed debt-fund LTCG cost of acquisition for FY2024-25 sales.

Also adds FY2025-26 = 376 (Notification 70/2025, applicable AY
2026-27 onward). All earlier years (FY2001-02..FY2023-24) verified
correct against the CBDT table — unchanged.

Test asserts the three most recent notified values.
A third CDSL "MUTUAL FUND UNITS HELD" row variant exists in the wild:
a distribution-mode column (ARN/DIRECT) is present but there is NO
separate "invested / total cost" column, so the row carries only
three value columns — units | NAV | current-value.

`_parse_mf_holdings_row` assumed >= 4 numerics whenever a
distribution-mode column was detected, assigning the third numeric
to `invested` and defaulting `value` to 0. That zeroed the holding's
current value (and, for single-MF folio accounts, the whole
MF-Folios balance).

Fix: when only three numerics survive, treat the third as the
current value (a holdings statement always prints current value;
the cost column is the optional one) and leave `total_cost` None.
The >= 4 numerics path is unchanged.

Found via a batch run over a third-party corpus of 193 CAS PDFs:
161 valid CAS files parsed with zero crashes; this was the one
material correctness gap surfaced by the sum(holdings)==balance
invariant. Two new regression tests cover the full and reduced
distribution-mode templates.
CDSL occasionally prints sub-unit balances without the leading zero
(e.g. an 0.196 unit balance renders as .196). The shared NSDL/CDSL
NUMERIC_RE required at least one digit (or comma) before the
decimal point, so those cells were classified as text and the row
layout silently shifted - producing a Σholdings ≠ balance mismatch
on affected accounts.

Add a leading-dot alternative to the regex in both casparser/parsers/cdsl.py
and casparser/parsers/nsdl.py and extend the existing _looks_numeric
unit tests to cover .196 / -.5 / 0.196 (positive cases) plus naked
'.' / '-' (still rejected).

Verified on a private CDSL fixture: the affected account's
Σholdings invariant moves from fail to pass with no other change.
…ismatch

Some KFin templates print '... - Reversed' rows (notably the
Franklin wound-up debt schemes' 'Payment - Units Extinguished-Reversed'
entries) with cosmetic parentheses around the units value even
though the semantic sign is the opposite of the original. The
parser's _decimal() helper treats parens as negative, so these rows
landed with the wrong sign and the open + Σunits == close invariant
broke on the parent scheme by exactly twice the reversed-row units.

Add a post-parse running-balance validator to cams_detailed.parse:
for each scheme, walk the transactions and check whether
prev_balance + units == balance (within tolerance). When the
parsed sign disagrees but prev_balance - units == balance, flip
units (and the matching amount) and reclassify via
get_transaction_type so the type label tracks the corrected sign.
Rows without units (STT/Stamp/TDS/MISC) or without a parsed
balance are skipped; rows whose printed balance matches neither
sign are left untouched.

This is a generic safety net for cosmetic-parens sign mis-parses
on any AMC / template, not just the Franklin case. Covered by
four new TestBalanceSignFix unit tests in tests/test_helpers.py
plus the private KFin batch (open+Σunits invariant moves to 100%
pass).
Add a 'Supported inputs' section before Installation that lists
the four recognised issuers (CAMS / KFintech / NSDL / CDSL) and
their statement variants, then a 'Known unsupported inputs'
subsection covering the three classes that surfaced during the
v1.0 batch testing:

- Re-printed PDFs (Microsoft Print to PDF, browser save-as-PDF,
  macOS print preview, etc.) - the watermark gets flattened into
  a bitmap and the original generator metadata is wiped, so the
  detector can no longer prove what it's looking at.
- MF Central statements - different template / generator,
  out of scope for v1.0.
- Third-party-reformatted statements - same reason as re-prints.

Users who hit these flows should keep the original issuer-
delivered PDF alongside any redistributed copy and feed the
original to casparser.
…osals

In FIFOUnits.sell, when a buy lot is partially consumed by a sale,
the lot was re-queued onto the FIFO deque with the full original
purchase_tax — not the unallocated remainder. The next disposal
from that lot would then run the proportional allocation against
the full original stamp again, re-claiming a slice of stamp that
had already been allocated to an earlier disposal.

Worked example: 300-unit lot with ₹1.25 stamp paid, consumed
across three 100-unit disposals.

  Disposal 1: round(1.25 × 100/300, 2) = ₹0.42  → lot re-queued (200, 1.25)
  Disposal 2: round(1.25 × 100/200, 2) = ₹0.62  → lot re-queued (100, 1.25)
  Disposal 3: round(1.25 × 100/100, 2) = ₹1.25
  Total claimed: ₹2.29 vs ₹1.25 paid  (84% over-claim)

Re-queue with (purchase_tax - stamp_duty) so the lot carries only
the unallocated remainder. The rounding residual gets absorbed
into the final partial consumption, so the invariant
sum(stamp_claimed) == stamp_paid holds exactly across any number
of splits.

Section 48 only permits deducting actual stamp paid as a
transfer-related expense, so the prior behaviour over-stated the
deduction on Schedule 112A. The magnitude is small on a per-lot
basis (MF stamp is capped at ~₹1.25 per SIP since Jul 2020), but
compounds across hundreds of disposals per FY and grows worse
with split depth.

Surfaced by a cross-engine comparison with folioman's FIFO
implementation, which has carried the correct
proportional-with-reduction logic all along.

Add tests/test_gains.py::TestGainsClass::test_stamp_duty_split_lot_does_not_exceed_paid
to lock the invariant.
Some CAS generators insert a U+00AD soft hyphen at the point where a
long token wraps across two display lines. For a 12-char ISIN this
produced e.g. INF179K01<shy>WN9, which matched neither the anchored
ISIN regex nor its leftover fragment, so the holding row was silently
dropped (CDSL issue #127).

Handle it at the cell-join layer in pageobj._cells_from_block_atoms,
the single chokepoint all NSDL/CDSL holdings parsing flows through.
New _join_column_atoms() helper:

- a fragment ENDING with a soft hyphen is a continuation marker: the
  next fragment is spliced on with no separator and the hyphen dropped
  (reconstructs INF179K01<shy> + WN9 -> INF179K01WN9);
- any remaining embedded soft hyphen is stripped, so a single-atom
  INF179K01<shy>WN9 normalises the same way;
- cells without a soft hyphen are newline-joined exactly as before.

Add TestSoftHyphen (5 cases): embedded single-atom, two-atom split,
chained 3-fragment split, no-op for normal multi-line cells, and an
end-to-end check through _cells_from_block_atoms asserting the
reconstructed string matches INF_ISIN_RE.

Scope: reconstruction works when the wrapped fragments fall in the
same row block (the normal-scale case). ISINs split across blocks by
a non-standard page scale (e.g. a statement re-printed/re-saved at a
larger scale) are not covered.

Full suite stays green (171 passed); the private 193-PDF batch is
unchanged (all arithmetic invariants still 100% pass).
main carried two CDSL fixes to casparser/process/cdsl_statement.py
(soft-hyphen ISIN reconstruction #129, and an IDCW terminator regex
refactor). v1.0 removed the entire casparser/process/ package and
reimplemented CDSL parsing structurally under casparser/parsers/, so
the only conflict was a modify/delete on that file.

Resolved by keeping the deletion — both fixes are already covered in
v1.0:
- the soft-hyphen ISIN reconstruction is implemented for the new
  structural parser in casparser/parsers/pageobj.py (837f5e6);
- v1.0's CDSL parser is immune to the IDCW premature-termination bug
  by design (mode-based section state, no content-token terminators).
- Drop rotated-watermark text at extraction (extract.py + pageobj.py):
  fixes garbage RTA / polluted scheme names on all four issuers.
- CAMS detailed: reconstruct wrapped "(Non Demat)" headers — correct
  RTA + ISIN, no phantom schemes from continuation lines.
- CAMS/KFin summary: stop the last scheme swallowing the Total row and
  trailing disclaimers.
- Tests: scheme-name / RTA invariants + soft-hyphen reconstruction.

Verified on a 217-file private corpus: RTA garbage 78->0, all
arithmetic invariants 100%, repo suite 173 passed.
- read_cas_pdf now closes the PdfDocument it opens (try/finally), so
  pdfium no longer warns "objects still open" at teardown.
- CDSL MF holdings: splice a wrapped "<digits>/<digits>" folio tail back
  onto the head (e.g. "910121125" | "82/0" -> 91012112582/0) instead of
  truncating the folio and swallowing the tail as the mode column.

Regression tests added for both. Full suite green (175); batch
invariants 100%.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant