Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
fca1baf
feat(parsers): atom-based pypdfium2 backend + bonds + section routing
codereverser May 20, 2026
eea7847
test: pypdfium2-based test suite with arithmetic invariants
codereverser May 20, 2026
01e87c1
chore: bump to 1.0.0, drop GPL/AGPL deps, refresh CI + lockfile
codereverser May 20, 2026
0ea82f3
Merge origin/main (v0.9.0) into feature/v1.0
codereverser May 22, 2026
09773ac
ci: bump GitHub Actions to Node.js 24-native releases
codereverser May 22, 2026
ff00cd9
feat(gains): Schedule 112A column 1b for the 23-Jul-2024 LTCG split
codereverser May 28, 2026
226c7d0
fix(gains): correct FY2024-25 CII to 363 + add FY2025-26 (376)
codereverser May 28, 2026
dc6b4ba
fix(cdsl): parse current value on reduced MF-holdings rows
codereverser May 28, 2026
f960976
Fix NUMERIC_RE to accept leading-dot decimals in NSDL/CDSL parsers
codereverser May 28, 2026
4060488
Cross-check transaction units against running balance, flip sign on m…
codereverser May 28, 2026
84cdbdc
docs(readme): document supported inputs + known unsupported variants
codereverser May 28, 2026
e3a5f71
Fix stamp-duty over-allocation when a purchase lot splits across disp…
codereverser May 28, 2026
837f5e6
Reconstruct soft-hyphen-wrapped ISINs in NSDL/CDSL holdings
codereverser May 29, 2026
c30a4b6
Merge origin/main into feature/v1.0
codereverser May 29, 2026
789865d
Fix CAMS/KFin RTA & ISIN extraction and SUMMARY footer bleed
codereverser Jun 1, 2026
220edcf
Fix pypdfium document leak and wrapped CDSL MF folios
codereverser Jun 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/pypi-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: '3.11'
python-version: '3.12'
- name: Install uv
uses: astral-sh/setup-uv@v5
uses: astral-sh/setup-uv@v8.1.0
- name: Build
run: uv build
- name: Publish
Expand Down
12 changes: 7 additions & 5 deletions .github/workflows/run-pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@ jobs:
python-version: ['3.11', '3.12', '3.13']

steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python-version }}
- name: Install uv
uses: astral-sh/setup-uv@v5
uses: astral-sh/setup-uv@v8.1.0
- name: Install dependencies
run: uv sync --all-extras --dev
run: uv sync --dev
- name: Extract test files
run: ./.github/scripts/extract_files.sh
env:
Expand All @@ -44,8 +44,10 @@ jobs:
KFINTECH_CAS_FILE_NEW: ${{ secrets.KFINTECH_CAS_FILE_NEW }}
KFINTECH_CAS_PASSWORD: ${{ secrets.KFINTECH_CAS_PASSWORD }}
NSDL_CAS_FILE_1: ${{ secrets.NSDL_CAS_FILE_1 }}
CDSL_CAS_FILE_1: ${{ secrets.CDSL_CAS_FILE_1 }}
CDSL_CAS_PASSWORD: ${{ secrets.CDSL_CAS_PASSWORD }}
- name: Upload coverage report to codecov
uses: codecov/codecov-action@v5
uses: codecov/codecov-action@v6
with:
files: ./coverage.xml
token: ${{ secrets.CODECOV_TOKEN }}
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ dmypy.json
tests/files/**
tests/files.tar
tests/files.tar.bz2
tests/samples/**
.DS_Store

casparser.code-workspace
Expand Down
134 changes: 119 additions & 15 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,122 @@
# Changelog

## 1.0.0

Major release. The parsing backend was rewritten from scratch on
[pypdfium2](https://github.com/pypdfium2-team/pypdfium2) (Apache-2.0 /
BSD-3) and the four supported CAS issuers now each have a dedicated
parser tuned to their template family.

### Breaking changes

- **pdfminer.six and PyMuPDF backends removed.** `casparser.read_cas_pdf`
no longer dispatches between them. The `mupdf` / `fast` extras in
`pyproject.toml` are gone. The `--force-pdfminer` CLI flag and the
`force_pdfminer=` kwarg on `read_cas_pdf` are kept as no-ops; the
kwarg emits a `DeprecationWarning` and is otherwise ignored.
- **License simplified to pure MIT.** With the GPL/AGPL-licensed
PyMuPDF dependency gone, the `licenses/` directory of GPL/AGPL
copies has been removed. pypdfium2 is dual Apache-2.0 / BSD-3 and
doesn't impose any copyleft obligation on users of casparser.
- **Minimum Python is now 3.11.** 3.9 / 3.10 classifiers dropped from
`pyproject.toml`.
- **`CASData.investor_info` is now `Optional[InvestorInfo]`** (matches
the `NSDLCASData.investor_info` shape that already existed). It is
populated on every supported issuer, but consumers should still
guard against the `None` case for unfamiliar templates.
- **Internal `casparser.process` package removed.** The two helpers
downstream code still imports from it are now at
`casparser.parsers._classify` (`get_parsed_scheme_name`,
`get_transaction_type`) and `casparser.parsers._isin` (`isin_search`).

### New

- **First-class NSDL and CDSL parsers.** Drops the regex-on-text
approach the 0.8 NSDL/CDSL code used; the new parsers consume
structured `Block`/`Cell` records directly from `pypdfium2`. Several
bugs the v0.8 NSDL/CDSL code shipped with are no longer in scope
(misplaced-UCC-as-folio on NSDL MF Holdings, space-merged
folio+units cells on CDSL, the silently-dropped NSDL HDFC
subaccount on CDSL multi-account statements, `Optional[Decimal]`
comma-strip miss in the `MutualFund` validator).
- **CAMS / KFin 2026 templates supported** out of the box. The newer
CAMS SUMMARY template added an ISIN column the v0.8 regex didn't
match; v1.0 parses all rows. The newer KFin SUMMARY template emits
zero-balance schemes with single-space-separated trio cells that
the v0.8 regex required `\t\t` between; v1.0 picks them up too.
- **AMC-header detection extended** to include the `Fund House`
suffix. v0.8's regex only matched `Mutual Fund` / `MF` suffixes,
so schemes from a few newer AMCs whose names end in `Fund House`
ended up bucketed under the previous AMC.
- **ISIN / AMFI enrichment has a direct-ISIN fallback** path via
`MFISINDb.direct_isin_lookup` for the case where multi-line
`Registrar:` rendering corrupts the RTA token.
- **Schedule 112A column 1b** ("Share/Unit Transferred") is emitted
for FY2024-25 onward, per the AY 2025-26 ITR utility template. The
Finance (No. 2) Act 2024 split the equity-LTCG regime on
23-Jul-2024; the 112A CSV now flags each transfer `BE`/`AE` against
that date and splits an after-31-Jan-2018-acquired fund into
separate rows when it was sold on both sides of the cutoff. Older
FYs keep the 14-column layout their utility expects.
- **Cost Inflation Index extended to FY2025-26 (376)** and the
FY2024-25 value corrected from `365` to the CBDT-notified `363`
(the wrong value slightly mis-indexed debt-fund LTCG cost of
acquisition for FY2024-25 sales).

### Fixed

- **CAMS SUMMARY `valuation.date` no longer mis-parses to year 201**
(was a column-boundary bug — the NAVDate column treated as
right-aligned with a 42pt width clipped the trailing year digit,
then Pydantic mis-coerced the `01-Jan-201` string).
- **CDSL multi-account statements** (5+ demat accounts on one PDF) are
now parsed correctly. Earlier the page-3+ scan only kicked in from
page 8, dropping holdings sections that landed on pages 4-7.
- **CDSL MF holdings** rows with `DIRECT` (or any non-`ARN-XXXX`
distribution-mode token) now correctly populate `pnl` and `return_`.
- **Leading-dot decimals** (`.196`, `-.5`) are now recognised as
numeric by the NSDL / CDSL cell classifier. CDSL occasionally
drops the leading zero on sub-unit balances; under the old regex
those cells were mis-bucketed as text, shifting the row layout
and producing a silent `Σholdings ≠ balance` mismatch.
- **KFin `... - Reversed` transaction rows** (e.g. the Franklin
wound-up debt schemes' `Payment - Units Extinguished-Reversed`
entries) had their units cell printed with cosmetic parentheses
even though the semantic sign is the opposite of the original.
A new running-balance cross-check in the CAMS / KFin DETAILED
parser flips the sign (and matching amount) on any single-row
mis-parse and reclassifies the transaction type via the
positive-units branch of `get_transaction_type`. Acts as a
self-validating safety net for the entire transaction stream,
not just the Franklin case.
- **Stamp-duty allocation on split lots.** When a single purchase
lot is consumed across N disposals the `FIFOUnits.sell` re-queue
carried the *full* original stamp duty on every remaining slice,
so the proportional allocation re-claimed the same paid stamp
on each disposal — total stamp claimed grew unboundedly with
split depth. (Worked example: 300-unit lot with ₹1.25 stamp,
split 100/100/100 → total claimed ₹2.29 vs ₹1.25 paid, an 84%
over-claim that further widens for deeper splits.) The lot is
now re-queued with the *unallocated remainder* (`purchase_tax -
stamp_duty`), so the invariant `Σ(stamp_claimed) == stamp_paid`
holds exactly across any number of partial consumptions. Section
48 only permits deducting the stamp actually paid as a
transfer expense, so the prior behaviour over-stated the
deduction on Schedule 112A.
- **Soft-hyphen-wrapped tokens in NSDL / CDSL holdings.** Some CAS
generators insert a `U+00AD` soft hyphen at the point where a long
token (notably a 12-char ISIN) wraps across two display lines, e.g.
`INF179K01<soft-hyphen>WN9`. The block/cell extractor now treats a trailing
soft hyphen as a continuation marker (splicing the next fragment on
with no separator) and strips any embedded soft hyphens, so the
token is reconstructed intact. Previously the wrapped ISIN matched
neither the anchored ISIN regex nor its leftover, and the holding
was silently dropped. (Reconstruction works when the wrapped
fragments fall in the same row block; ISINs split across blocks by
a non-standard page scale are not covered.)

## 0.9.0 - 2026-05-22
- Add support for CDSL sttements
- Add support for CDSL statements
- Drop support for Python 3.9 and 3.10; minimum supported version is now 3.11
- Support PyMuPDF >= 1.25 (1.27.x tested). Older `<1.25` pin removed.
- Bump `casparser-isin` to `>= 2026.5.1` (new DB format v2 with
Expand All @@ -11,20 +126,9 @@
field (Python attribute `return_`) also gets the comma-stripping
treatment; previously NSDL MF folio rows with a return value of
1 lakh or more would fail Decimal validation.
- Parser robustness fixes for PyMuPDF 1.25+ text extraction quirks:
- Re-emit visual rows as separate blocks for CAMS/KFINTECH so the
table header / folio header no longer get merged when the new
block grouping collapses them into a single PyMuPDF block.
- Recover the registrar value (e.g. `KFINTECH`) when it wraps to the
next line.
- Recover the advisor value when the scheme name wraps before the
advisor closing paren.
- Pull ISIN/Advisor onto the scheme line when long scheme names wrap.
- Tax transactions (`*** Stamp Duty ***`, STT, TDS) no longer absorb
spurious units when an adjacent column wraps onto the same row.
- NSDL holdings: widen the y-band tolerance, drop the strict
multiline `$` anchoring, and accept tab-separated wrapped names so
the regexes match consistently across Python 3.11–3.14.
- Parser robustness fixes for PyMuPDF 1.25+ text extraction quirks
(all superseded in 1.0.0 by the pypdfium2 rewrite, kept here for
the historical record).

## 0.8.1 - 2025-09-21
- NSDL parser bug fixes
Expand Down
66 changes: 51 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,26 +6,56 @@
[![codecov](https://codecov.io/gh/codereverser/casparser/branch/main/graph/badge.svg?token=DYZ7TXWRGI)](https://codecov.io/gh/codereverser/casparser)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/casparser)

Parse Consolidated Account Statement (CAS) PDF files generated from CAMS/KFINTECH
Parse Consolidated Account Statement (CAS) PDF files generated from
CAMS, KFintech, NSDL, and CDSL.

`casparser` also includes a command line tool with the following analysis tools
- `summary`- print portfolio summary
- (**BETA**) `gains` - Print capital gains report (summary and detailed)
- with option to generate csv files for ITR in schedule 112A format


## Supported inputs

`casparser` parses **original** CAS PDFs delivered by the four
recognised issuers:

| Issuer | Variant(s) | Source |
|-----------|-------------------------|-------------------------------------|
| CAMS | Detailed, Summary | `mailback.camsonline.com` request |
| KFintech | Detailed, Summary | `mfs.kfintech.com` request |
| NSDL | Demat consolidated | NSDL CAS email (monthly) |
| CDSL | Demat consolidated | CDSL CAS email (monthly) |

### Known unsupported inputs

- **Re-printed PDFs.** If you "print to PDF" an existing CAS
(Microsoft Print to PDF, "Save as PDF" via a browser print
dialog, macOS print preview → save, etc.) the watermark gets
flattened from selectable text into a bitmap and the original
generator metadata is wiped. The visual appearance is
identical but `casparser` can no longer prove what it's
looking at, and will reject the file. Re-request the
statement from the issuer directly and parse the original.
- **MF Central statements.** MF Central's CAS uses a different
template / generator and is not in scope for v1.0.
- **Third-party-reformatted statements** (broker portals that
re-render CAS data, Excel/CSV exports converted back to PDF,
etc.) — same reason as re-prints.

If you need to support one of these flows for downstream
tooling, the recommended path is to keep the original
issuer-delivered PDF alongside any redistributed copy and feed
the original to `casparser`.


## Installation
```bash
pip install -U casparser
```

### with faster PyMuPDF parser
```bash
pip install -U 'casparser[fast]'
```

**Note:** Enabling this dependency could result in licensing changes. Check the
[License](#license) section for more details
Since v1.0 the parser is built on [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
(Apache-2.0 / BSD-3) — no optional PDF backends, no GPL/AGPL dependencies.


## Usage
Expand All @@ -50,7 +80,7 @@ csv_str = casparser.read_cas_pdf("/path/to/cas/file.pdf", "password", output="cs
"from": "YYYY-MMM-DD",
"to": "YYYY-MMM-DD"
},
"file_type": "CAMS/KARVY/UNKNOWN",
"file_type": "CAMS/KFINTECH/NSDL/CDSL/UNKNOWN",
"cas_type": "DETAILED/SUMMARY",
"investor_info": {
"email": "string",
Expand Down Expand Up @@ -122,6 +152,9 @@ Notes:
- `MISC`
- `dividend_rate` is applicable only for `DIVIDEND_PAYOUT` and
`DIVIDEND_REINVESTMENT` transactions.
- NSDL and CDSL statements return a different top-level shape with
`accounts[].equities[]` and `accounts[].mutual_funds[]` instead of
`folios[].schemes[]`. See `casparser.types.NSDLCASData` for details.

### CLI

Expand All @@ -143,8 +176,6 @@ Usage: casparser [-o output_file.json|output_file.csv] [-p password] [-s] [-a] C
--gains-112a ask|FY2020-21 Generate Capital Gains Report - 112A format for
a given financial year - Use 'ask' for a prompt
from available options (BETA)
--force-pdfminer Force PDFMiner parser even if MuPDF is
detected

--version Show the version and exit.
-h, --help Show this message and exit.
Expand Down Expand Up @@ -199,11 +230,16 @@ failing scheme name(s).

## License

CASParser is distributed under MIT license by default. However enabling the optional dependency
`mupdf/fast` would imply the use of [PyMuPDF](https://github.com/pymupdf/PyMuPDF) /
[MuPDF](https://mupdf.com/license.html) and hence the licenses GNU GPL v3 and GNU Affero GPL v3
would apply. Copies of all licenses have been included in this repository. - _IANAL_
CASParser is distributed under the MIT license. Up to v0.8 the optional
`mupdf` / `fast` extra pulled in [PyMuPDF](https://github.com/pymupdf/PyMuPDF) /
[MuPDF](https://mupdf.com/license.html), which would have caused GNU GPL v3
and GNU Affero GPL v3 to apply transitively. v1.0 dropped that extra
(the PyMuPDF and pdfminer.six backends are gone; the parser now runs on
[pypdfium2](https://github.com/pypdfium2-team/pypdfium2), which is dual
Apache-2.0 / BSD-3), so casparser is now pure MIT end-to-end.

## Resources
1. [CAS from CAMS](https://www.camsonline.com/Investors/Statements/Consolidated-Account-Statement)
2. [CAS from Karvy/Kfintech](https://mfs.kfintech.com/investor/General/ConsolidatedAccountStatement)
3. [NSDL Consolidated Account Statement](https://nsdlcas.nsdl.com/)
4. [CDSL Consolidated Account Statement](https://www.cdslindia.com/Investors/Cas.html)
2 changes: 1 addition & 1 deletion casparser/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@
"CapitalGainsReport",
]

__version__ = "0.9.0"
__version__ = "1.0.0"
Loading