A structured dataset of Indian Lok Sabha parliamentary question-and-answer records for NLP research and transparency.
Part of the OpenSansad initiative — a personal project to make Sansad's (Indian Parliament's) workings more accessible and transparent through open data and open-source tooling. Data sourced from Digital Sansad.
| 18th Lok Sabha | 17th Lok Sabha | 16th Lok Sabha | |
|---|---|---|---|
| Period | 2024–2026 | 2019–2024 | 2014–2019 |
| Sessions | 2–7 | 1–12, 14–15 | 5–17 |
| Questions | 30,220 | 60,548 | 60,040 |
| Text extracted | 30,219 | 60,529 | 59,990 |
Total: 150,800+ records covering both starred (oral) and unstarred (written) parliamentary questions.
from datasets import load_dataset
ds = load_dataset("opensansad/lok-sabha-qa")
# Filter by ministry
health = ds["train"].filter(lambda x: "HEALTH" in (x["ministry"] or ""))
# Filter by Lok Sabha and session
lok18_s4 = ds["train"].filter(lambda x: x["lok_no"] == 18 and x["session_no"] == 4)
# Starred questions only
starred = ds["train"].filter(lambda x: x["type"] == "STARRED")This repo contains the full data pipeline — self-contained, no external dependencies:
1. Curate → Fetch metadata from sansad.in API
2. Download → Download source files (PDF / DOCX / DOC)
3. Extract → Extract text via Docling (+ EasyOCR fallback for scanned PDFs,
+ LibreOffice pre-conversion for .doc)
4. Build → Assemble into Parquet dataset
5. Publish → Push to HuggingFace Hub
6. Source Issues → (optional) Aggregate upstream data issues into SOURCE_ISSUES.md
uv syncOptional system dependency: LibreOffice is required for legacy .doc files (binary OLE2). The pipeline auto-detects soffice on PATH or in the standard install locations — install via brew install --cask libreoffice (macOS) or your distro's package manager. .docx and .pdf work without it.
# 1. Curate metadata for a Lok Sabha
uv run python -m lok_sabha_dataset.pipeline.curate --lok 18
# 2. Download PDFs
uv run python -m lok_sabha_dataset.pipeline.download --lok 18
# 3. Extract text from PDFs (two-pass for speed)
uv run python -m lok_sabha_dataset.pipeline.extract run --lok 18 --engine docling
uv run python -m lok_sabha_dataset.pipeline.extract run --lok 18 --engine easyocr --retry-low-confidence
# 4. Build parquet (auto-discovers all loks in data/)
uv run python -m lok_sabha_dataset.build
# 5. Publish to HuggingFace
uv run python -m lok_sabha_dataset.publish --push
# 6. (optional) Refresh the public-facing source-issues report
uv run python -m lok_sabha_dataset.source_issuesuv run python -m lok_sabha_dataset.build --lok 18 --sessions 6-7Output is written to output/lok_sabha_qa.parquet.
Override the data directory via environment variable:
export LOKSABHA_DATA_DIR=/path/to/dataDefault is data/ in the repo root.
Integration tests cover the full pipeline against live sansad.in endpoints and a small fixture corpus:
| File | What it covers |
|---|---|
tests/test_curate.py |
curate pipeline against the live sansad.in API (LS16-S5 + LS18-S2, scoped to ~5 records per session via --page-size 5 --max-pages 1) |
tests/test_download.py |
download pipeline against real sansad.in URLs — successful PDF + DOCX downloads, server-error and missing-protocol failures, no-URL skips, and idempotency |
tests/test_extract_doc.py |
extract pipeline across .pdf/.docx/.doc source formats and the --retry-low-confidence --engine easyocr retry path with strict golden-file comparison |
tests/test_splitter.py |
Q/A splitter unit tests |
Run the full suite:
uv run --extra dev pytest tests/ -vAfter making changes that legitimately alter extracted text, regenerate the golden references:
uv run python tests/update_golden.pyThe curate and download tests hit live sansad.in endpoints — failures generally signal real upstream changes worth investigating.
Upstream data issues observed at sansad.in (broken downloads, NIL-only documents, mismatched files, etc.) are tracked in SOURCE_ISSUES.md with the full machine-readable list at data/source_issues.jsonl. Hand-curated entries live in data/source_issues_manual.jsonl — append a JSON line and re-run step 6 to publish.
- lok-sabha-rag — RAG application built on this dataset for querying parliamentary proceedings
- opensansad/lok-sabha-qa — The published dataset on HuggingFace
