Skip to content

sammitjain/lok-sabha-dataset

Repository files navigation

OpenSansad Logo

OpenSansad — Lok Sabha Q&A Dataset

CC BY 4.0 HuggingFace Dataset 150,800+ records

A structured dataset of Indian Lok Sabha parliamentary question-and-answer records for NLP research and transparency.

View on HuggingFace

Part of the OpenSansad initiative — a personal project to make Sansad's (Indian Parliament's) workings more accessible and transparent through open data and open-source tooling. Data sourced from Digital Sansad.

Dataset at a Glance

18th Lok Sabha 17th Lok Sabha 16th Lok Sabha
Period 2024–2026 2019–2024 2014–2019
Sessions 2–7 1–12, 14–15 5–17
Questions 30,220 60,548 60,040
Text extracted 30,219 60,529 59,990

Total: 150,800+ records covering both starred (oral) and unstarred (written) parliamentary questions.

Quick Start

from datasets import load_dataset

ds = load_dataset("opensansad/lok-sabha-qa")

# Filter by ministry
health = ds["train"].filter(lambda x: "HEALTH" in (x["ministry"] or ""))

# Filter by Lok Sabha and session
lok18_s4 = ds["train"].filter(lambda x: x["lok_no"] == 18 and x["session_no"] == 4)

# Starred questions only
starred = ds["train"].filter(lambda x: x["type"] == "STARRED")

Pipeline

This repo contains the full data pipeline — self-contained, no external dependencies:

Ingestion Pipeline

1. Curate       →  Fetch metadata from sansad.in API
2. Download     →  Download source files (PDF / DOCX / DOC)
3. Extract      →  Extract text via Docling (+ EasyOCR fallback for scanned PDFs,
                                              + LibreOffice pre-conversion for .doc)
4. Build        →  Assemble into Parquet dataset
5. Publish      →  Push to HuggingFace Hub
6. Source Issues →  (optional) Aggregate upstream data issues into SOURCE_ISSUES.md

Setup

uv sync

Optional system dependency: LibreOffice is required for legacy .doc files (binary OLE2). The pipeline auto-detects soffice on PATH or in the standard install locations — install via brew install --cask libreoffice (macOS) or your distro's package manager. .docx and .pdf work without it.

Run the pipeline

# 1. Curate metadata for a Lok Sabha
uv run python -m lok_sabha_dataset.pipeline.curate --lok 18

# 2. Download PDFs
uv run python -m lok_sabha_dataset.pipeline.download --lok 18

# 3. Extract text from PDFs (two-pass for speed)
uv run python -m lok_sabha_dataset.pipeline.extract run --lok 18 --engine docling
uv run python -m lok_sabha_dataset.pipeline.extract run --lok 18 --engine easyocr --retry-low-confidence

# 4. Build parquet (auto-discovers all loks in data/)
uv run python -m lok_sabha_dataset.build

# 5. Publish to HuggingFace
uv run python -m lok_sabha_dataset.publish --push

# 6. (optional) Refresh the public-facing source-issues report
uv run python -m lok_sabha_dataset.source_issues

Build specific sessions

uv run python -m lok_sabha_dataset.build --lok 18 --sessions 6-7

Output is written to output/lok_sabha_qa.parquet.

Configuration

Override the data directory via environment variable:

export LOKSABHA_DATA_DIR=/path/to/data

Default is data/ in the repo root.

Testing

Integration tests cover the full pipeline against live sansad.in endpoints and a small fixture corpus:

File What it covers
tests/test_curate.py curate pipeline against the live sansad.in API (LS16-S5 + LS18-S2, scoped to ~5 records per session via --page-size 5 --max-pages 1)
tests/test_download.py download pipeline against real sansad.in URLs — successful PDF + DOCX downloads, server-error and missing-protocol failures, no-URL skips, and idempotency
tests/test_extract_doc.py extract pipeline across .pdf/.docx/.doc source formats and the --retry-low-confidence --engine easyocr retry path with strict golden-file comparison
tests/test_splitter.py Q/A splitter unit tests

Run the full suite:

uv run --extra dev pytest tests/ -v

After making changes that legitimately alter extracted text, regenerate the golden references:

uv run python tests/update_golden.py

The curate and download tests hit live sansad.in endpoints — failures generally signal real upstream changes worth investigating.

Source Issues

Upstream data issues observed at sansad.in (broken downloads, NIL-only documents, mismatched files, etc.) are tracked in SOURCE_ISSUES.md with the full machine-readable list at data/source_issues.jsonl. Hand-curated entries live in data/source_issues_manual.jsonl — append a JSON line and re-run step 6 to publish.

Related

About

HuggingFace dataset of Indian Lok Sabha parliamentary Q&A records. Data sourced from Digital Sansad (sansad.in).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages