v1.0: rewrite parsing backend on pypdfium2#124
Open
codereverser wants to merge 16 commits into
Open
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #124 +/- ##
==========================================
+ Coverage 88.30% 96.99% +8.70%
==========================================
Files 18 19 +1
Lines 1469 2391 +922
==========================================
+ Hits 1297 2319 +1022
+ Misses 172 72 -100 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Rebuilds the parsing layer for v1.0 on top of pypdfium2 (Apache-2.0 / BSD-3) so casparser ships pure MIT end-to-end; the prior pdfminer.six + PyMuPDF dependencies are dropped along with the entire `casparser/ process/` regex-tokenisation pipeline they fed. Engine (casparser/parsers/extract.py, pageobj.py) ================================================= `extract.py` walks PDF page objects (one atom per text-show op), maps glyphs to their parent atom via PDFium's `FPDFText_GetTextObject`, deduplicates same-font overlapping atoms, then emits `Char`/`Line`/`Page` shaped output that downstream parsers consume. Atom-level dedup replaces all per-character overlay heuristics: when two atoms share a font, x-overlap by >=50% of the narrower atom's width, and sit 0.05-3.0pt apart in y, we drop the one further from the row's median baseline. That handles the date-twin artefact (same date column rendered twice with a small y-offset, glyphs interleaving by x to produce garbage like `2020 -> 22002200`) without the multi-stage sub-cluster filters earlier prototypes used. `pageobj.py` exposes the atoms + their column/block grouping that the NSDL/CDSL parsers operate on directly. The same Atom primitive backs the investor extractor. Per-issuer parsers ================== - `cams_detailed.py` / `cams_summary.py` consume the Line stream for CAMS + KFin DETAILED and SUMMARY templates. - `nsdl.py` reads the page-2 account roster, walks per-account holdings sections (equities + mutual funds + corporate bonds in both summary and detailed forms). Section-aware routing handles the case where multiple holding types share the same 18-cell detailed table header by tracking `cur_section` from the preceding marker block. The page-2 roster accepts both the 4-cell (broker + DP/Client joined) and 5-cell (broker, then DP/Client) variants. - `cdsl.py` mirrors NSDL's structure for the CDSL CAS template. Types ===== - Adds `Bond` model with optional coupon_rate / coupon_frequency / maturity_date / face_value / market_price; required fields are isin, num_bonds, value. Surfaces on `DematAccount.bonds`. - `investor_info` is now required on `CASData` and `NSDLCASData`. Performance =========== The dispatcher opens the PDF document exactly once per `read_cas_pdf` call and threads the handle through detect / parser / investor extractor via an `_doc=` kwarg. NSDL/CDSL additionally share the extracted atoms between the holdings parser and the investor extractor.
Replaces the v0.8 pdfminer / PyMuPDF test files with a per-issuer e2e suite plus a focused unit-test layer. Layout ====== - `tests/conftest.py` — module-scoped fixtures for each fixture PDF (CAMS / KFin / NSDL / CDSL detailed + summary). Each fixture loader skips its dependent tests when the corresponding env var isn't set, so contributors without the encrypted bundle can still run the unit-test portion. - `tests/_assertions.py` — invariant helpers shared across the e2e suite. Designed to lock in correctness without encoding the real rupee figures from private fixtures. - `tests/test_cams.py`, `test_kfin.py`, `test_nsdl.py`, `test_cdsl.py` — per-issuer e2e tests. - `tests/test_errors.py` — error-path + back-compat shim tests. - `tests/test_demat_units.py` — NSDL/CDSL parser unit tests using synthetic Block/Cell fixtures (no real ISINs, names, or IDs). - `tests/test_helpers.py`, `tests/test_gains.py`, `tests/test_gains_e2e.py` — existing helper / gains coverage, retained. Arithmetic invariants ===================== The e2e tests verify parsing correctness without depending on specific rupee amounts: - **CAMS / KFin DETAILED**: scheme.close * scheme.valuation.nav == scheme.valuation.value and scheme.open + sum(txn.units) == scheme.close. - **NSDL / CDSL**: sum(eq.value + mf.value + bd.value) == account.balance per account; mf.balance * mf.nav == mf.value; bond.num_bonds * bond.face_value == bond.value (summary form); bond.num_bonds * bond.market_price == bond.value (detailed form). These catch column-swap, decimal-parse, anchor-drift, and missed- transaction bugs without encoding portfolio totals in the repo. Removed ======= - `tests/test_pdfminer.py`, `tests/test_mupdf.py`, `tests/test_process.py` — backend-specific suites for the v0.8 stack. - `tests/test_pypdfium.py`, `tests/base.py` — the intermediate single-file test suite is superseded by the per-issuer split. - `tests/pytest.ini` — empty file masked the pyproject.toml pytest config.
- `pyproject.toml`: bumps version to 1.0.0, drops pdfminer.six (AGPL-3.0+) and PyMuPDF (GPL-3.0+ / commercial) from runtime deps, replaces with pypdfium2 (Apache-2.0 / BSD-3). Loosens remaining version bounds where compatible (click <10, rich <16, pypdfium2 <7, pydantic <3, etc.) and refreshes the dev-group upper bounds (pytest <10, pytest-cov <8, ipython, coverage). - Python floor lifts to 3.11 (3.10 EOL anyway). - `uv.lock`: regenerated against the new dep set. - `.github/workflows/run-pytest.yml`: switches CI to Python 3.12, decrypts `tests/files.enc` for the encrypted fixture bundle, and exposes the per-fixture env-var matrix to pytest. PyPI publish workflow updated to drop the dropped backends. - `licenses/AGPL-3.0+.txt` + `licenses/GPL-3.0+.txt`: removed — no longer required to redistribute since the GPL/AGPL deps are gone. - `README.md`: documents the v1.0 backend swap and refreshes external links. `CHANGELOG.md` gets a 1.0.0 section.
cd21b91 to
01e87c1
Compare
v0.9.0 shipped a PyMuPDF-1.25 compatibility fix on top of v0.8's
existing backend; v1.0 has already replaced that backend with
pypdfium2, so the v0.9 parser patches don't apply. The merge keeps
v1.0's parser layer and folds in the v0.9 metadata changes that
are still relevant:
- `casparser-isin>=2026.5.1` (DB format v2 with sebi_category /
last_seen / ISIN-first lookup priority) — adopted.
- `pdfminer.six` and the `mupdf`/`fast` PyMuPDF extras stay
removed (1.0.0's pure-pypdfium2 stack).
- `MutualFund.fix_float` aliased-field bug fix (v0.9 patched it
on the v0.8 model; v1.0's model already carries the same fix).
- CI matrix: adopt v0.9's `[3.11, 3.12, 3.13]` Python matrix; drop
`--all-extras` from `uv sync` (no extras to install any more).
- CHANGELOG keeps the 1.0.0 entry on top and a condensed 0.9.0
entry below it for historical record.
Files v0.9 modified that v1.0 had already deleted are kept deleted:
- casparser/parsers/mupdf.py
- casparser/process/{__init__,cas_detailed,cas_summary,cdsl_statement,
nsdl_statement,regex,utils}.py
Tests: 151/151 with private fixtures, 87/87 + 64 skipped without.
GitHub deprecation notice — actions running on Node.js 20 will be forced to Node.js 24 from 2026-06-02. Bump every referenced action to its current Node-24-native major: actions/checkout v4 → v6 actions/setup-python v5 → v6 astral-sh/setup-uv v5 → v8 codecov/codecov-action v5 → v6 These are all backward-compatible at the workflow-input level; no input changes required.
b41c14a to
09773ac
Compare
The Finance (No. 2) Act 2024 split the FY2024-25 equity-LTCG regime on 23-Jul-2024 (rate 10% -> 12.5%, exemption 1L -> 1.25L). The AY 2025-26 Schedule 112A CSV template added a column 1b, "Share/Unit Transferred", flagging which side of that date each transfer (sale) falls on (BE before / AE on-or-after) — casparser's 112A export was missing it. - `GainEntry112A` gains a `transferred` field (BE/AE from sale date). - `generate_112a` keys the after-31-Jan-2018-acquired consolidation on the transfer flag, so a fund sold both before and on/after 23-Jul-2024 within FY2024-25 produces one row per side (the utility taxes the two sides at different rates) instead of one ambiguous merged row. - `generate_112a_csv_data` inserts the `Share/Unit Transferred(1b)` header + value between columns 1a and 2, but only for FY2024-25 and later (`_fy_needs_transfer_col`); older FYs keep the 14-column layout their ITR utility expects. Columns 2-14 were verified unchanged against the official 112A_115AD_CSV_Instructions.pdf (col 7 = max(8, 9), col 9 = min(FMV, sale) for grandfathered lots, 31-Jan-2018 grandfathering intact). Tests: 6 new unit tests in TestSchedule112A covering the transfer flag, the FY gate, the cross-cutoff consolidation split, and CSV header/column placement for both new and legacy FYs.
The Cost Inflation Index table had FY2024-25 as 365; the CBDT-notified value is 363 (Notification 44/2024). The wrong value slightly mis-indexed debt-fund LTCG cost of acquisition for FY2024-25 sales. Also adds FY2025-26 = 376 (Notification 70/2025, applicable AY 2026-27 onward). All earlier years (FY2001-02..FY2023-24) verified correct against the CBDT table — unchanged. Test asserts the three most recent notified values.
A third CDSL "MUTUAL FUND UNITS HELD" row variant exists in the wild: a distribution-mode column (ARN/DIRECT) is present but there is NO separate "invested / total cost" column, so the row carries only three value columns — units | NAV | current-value. `_parse_mf_holdings_row` assumed >= 4 numerics whenever a distribution-mode column was detected, assigning the third numeric to `invested` and defaulting `value` to 0. That zeroed the holding's current value (and, for single-MF folio accounts, the whole MF-Folios balance). Fix: when only three numerics survive, treat the third as the current value (a holdings statement always prints current value; the cost column is the optional one) and leave `total_cost` None. The >= 4 numerics path is unchanged. Found via a batch run over a third-party corpus of 193 CAS PDFs: 161 valid CAS files parsed with zero crashes; this was the one material correctness gap surfaced by the sum(holdings)==balance invariant. Two new regression tests cover the full and reduced distribution-mode templates.
CDSL occasionally prints sub-unit balances without the leading zero (e.g. an 0.196 unit balance renders as .196). The shared NSDL/CDSL NUMERIC_RE required at least one digit (or comma) before the decimal point, so those cells were classified as text and the row layout silently shifted - producing a Σholdings ≠ balance mismatch on affected accounts. Add a leading-dot alternative to the regex in both casparser/parsers/cdsl.py and casparser/parsers/nsdl.py and extend the existing _looks_numeric unit tests to cover .196 / -.5 / 0.196 (positive cases) plus naked '.' / '-' (still rejected). Verified on a private CDSL fixture: the affected account's Σholdings invariant moves from fail to pass with no other change.
…ismatch Some KFin templates print '... - Reversed' rows (notably the Franklin wound-up debt schemes' 'Payment - Units Extinguished-Reversed' entries) with cosmetic parentheses around the units value even though the semantic sign is the opposite of the original. The parser's _decimal() helper treats parens as negative, so these rows landed with the wrong sign and the open + Σunits == close invariant broke on the parent scheme by exactly twice the reversed-row units. Add a post-parse running-balance validator to cams_detailed.parse: for each scheme, walk the transactions and check whether prev_balance + units == balance (within tolerance). When the parsed sign disagrees but prev_balance - units == balance, flip units (and the matching amount) and reclassify via get_transaction_type so the type label tracks the corrected sign. Rows without units (STT/Stamp/TDS/MISC) or without a parsed balance are skipped; rows whose printed balance matches neither sign are left untouched. This is a generic safety net for cosmetic-parens sign mis-parses on any AMC / template, not just the Franklin case. Covered by four new TestBalanceSignFix unit tests in tests/test_helpers.py plus the private KFin batch (open+Σunits invariant moves to 100% pass).
Add a 'Supported inputs' section before Installation that lists the four recognised issuers (CAMS / KFintech / NSDL / CDSL) and their statement variants, then a 'Known unsupported inputs' subsection covering the three classes that surfaced during the v1.0 batch testing: - Re-printed PDFs (Microsoft Print to PDF, browser save-as-PDF, macOS print preview, etc.) - the watermark gets flattened into a bitmap and the original generator metadata is wiped, so the detector can no longer prove what it's looking at. - MF Central statements - different template / generator, out of scope for v1.0. - Third-party-reformatted statements - same reason as re-prints. Users who hit these flows should keep the original issuer- delivered PDF alongside any redistributed copy and feed the original to casparser.
…osals In FIFOUnits.sell, when a buy lot is partially consumed by a sale, the lot was re-queued onto the FIFO deque with the full original purchase_tax — not the unallocated remainder. The next disposal from that lot would then run the proportional allocation against the full original stamp again, re-claiming a slice of stamp that had already been allocated to an earlier disposal. Worked example: 300-unit lot with ₹1.25 stamp paid, consumed across three 100-unit disposals. Disposal 1: round(1.25 × 100/300, 2) = ₹0.42 → lot re-queued (200, 1.25) Disposal 2: round(1.25 × 100/200, 2) = ₹0.62 → lot re-queued (100, 1.25) Disposal 3: round(1.25 × 100/100, 2) = ₹1.25 Total claimed: ₹2.29 vs ₹1.25 paid (84% over-claim) Re-queue with (purchase_tax - stamp_duty) so the lot carries only the unallocated remainder. The rounding residual gets absorbed into the final partial consumption, so the invariant sum(stamp_claimed) == stamp_paid holds exactly across any number of splits. Section 48 only permits deducting actual stamp paid as a transfer-related expense, so the prior behaviour over-stated the deduction on Schedule 112A. The magnitude is small on a per-lot basis (MF stamp is capped at ~₹1.25 per SIP since Jul 2020), but compounds across hundreds of disposals per FY and grows worse with split depth. Surfaced by a cross-engine comparison with folioman's FIFO implementation, which has carried the correct proportional-with-reduction logic all along. Add tests/test_gains.py::TestGainsClass::test_stamp_duty_split_lot_does_not_exceed_paid to lock the invariant.
Some CAS generators insert a U+00AD soft hyphen at the point where a long token wraps across two display lines. For a 12-char ISIN this produced e.g. INF179K01<shy>WN9, which matched neither the anchored ISIN regex nor its leftover fragment, so the holding row was silently dropped (CDSL issue #127). Handle it at the cell-join layer in pageobj._cells_from_block_atoms, the single chokepoint all NSDL/CDSL holdings parsing flows through. New _join_column_atoms() helper: - a fragment ENDING with a soft hyphen is a continuation marker: the next fragment is spliced on with no separator and the hyphen dropped (reconstructs INF179K01<shy> + WN9 -> INF179K01WN9); - any remaining embedded soft hyphen is stripped, so a single-atom INF179K01<shy>WN9 normalises the same way; - cells without a soft hyphen are newline-joined exactly as before. Add TestSoftHyphen (5 cases): embedded single-atom, two-atom split, chained 3-fragment split, no-op for normal multi-line cells, and an end-to-end check through _cells_from_block_atoms asserting the reconstructed string matches INF_ISIN_RE. Scope: reconstruction works when the wrapped fragments fall in the same row block (the normal-scale case). ISINs split across blocks by a non-standard page scale (e.g. a statement re-printed/re-saved at a larger scale) are not covered. Full suite stays green (171 passed); the private 193-PDF batch is unchanged (all arithmetic invariants still 100% pass).
main carried two CDSL fixes to casparser/process/cdsl_statement.py (soft-hyphen ISIN reconstruction #129, and an IDCW terminator regex refactor). v1.0 removed the entire casparser/process/ package and reimplemented CDSL parsing structurally under casparser/parsers/, so the only conflict was a modify/delete on that file. Resolved by keeping the deletion — both fixes are already covered in v1.0: - the soft-hyphen ISIN reconstruction is implemented for the new structural parser in casparser/parsers/pageobj.py (837f5e6); - v1.0's CDSL parser is immune to the IDCW premature-termination bug by design (mode-based section state, no content-token terminators).
- Drop rotated-watermark text at extraction (extract.py + pageobj.py): fixes garbage RTA / polluted scheme names on all four issuers. - CAMS detailed: reconstruct wrapped "(Non Demat)" headers — correct RTA + ISIN, no phantom schemes from continuation lines. - CAMS/KFin summary: stop the last scheme swallowing the Total row and trailing disclaimers. - Tests: scheme-name / RTA invariants + soft-hyphen reconstruction. Verified on a 217-file private corpus: RTA garbage 78->0, all arithmetic invariants 100%, repo suite 173 passed.
- read_cas_pdf now closes the PdfDocument it opens (try/finally), so pdfium no longer warns "objects still open" at teardown. - CDSL MF holdings: splice a wrapped "<digits>/<digits>" folio tail back onto the head (e.g. "910121125" | "82/0" -> 91012112582/0) instead of truncating the folio and swallowing the tail as the mode column. Regression tests added for both. Full suite green (175); batch invariants 100%.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v1.0 is a full rewrite of the CAS parsing layer:
mupdf/fastextras and the--force-pdfminerCLI flag (force_pdfminer=kwarg kept as a no-op withDeprecationWarning).Why
MutualFund.fix_floatvalidator miss onOptional[Decimal]aliased fields.What's in the box
casparser/parsers/:pageobj.py— shared page-object atom extractor (NSDL/CDSL)extract.py— char/line extractor (CAMS/KFin)cams_detailed.py,cams_summary.py,nsdl.py,cdsl.pydetect.py— file-type sniffer; wrapsPdfiumErrorintoCASParseError/IncorrectPasswordError_classify.py,_isin.py,_investor.py— shared helpersinvestor_info,folio.PAN/KYC/PANKYC,scheme.isin/amfi/type,scheme.nominees,scheme.valuation.cost.investor_infois now required on bothCASDataandNSDLCASData(matches the contract of "every CAS contains an investor block").DIRECT(non-ARN) distribution-mode rows populate PnL/return.casparser-isinwith a direct-ISIN fallback path for templates where multi-line registrar rendering mangles the RTA token.Bug fixes that landed alongside the rewrite
valuation.datewas mis-parsing todate(201, 1, 1)— column boundary + Pydantic coercion fix.Breaking changes
casparser.types.CASData.investor_info:Optional[InvestorInfo]→InvestorInfo(parser raisesCASParseErrorif it can't find the block).casparser.types.NSDLCASData.investor_info: same change.casparser.types.NSDLCASData.file_type:Optional[FileType] = None→FileType.ProcessedCASDataandPartialCASDataremoved fromcasparser.types(they were internal to the old pipeline).casparser.processpackage removed; surviving helpers moved tocasparser.parsers._classifyandcasparser.parsers._isin.--force-pdfminer/force_pdfminer=is a no-op (emitsDeprecationWarning).Testing
tests/test_pypdfium.py,tests/test_helpers.py,tests/test_gains.py,tests/casparser/test_cli.py).Test plan
casparserCLI on a sample CAMS/KFin/NSDL/CDSL PDFpip install -U casparser(oruv sync) installs without pulling pdfminer.six / PyMuPDF