OpenSansad — Lok Sabha Q&A Dataset

OpenSansad Logo

OpenSansad — Lok Sabha Q&A Dataset

A structured dataset of Indian Lok Sabha parliamentary question-and-answer records for NLP research and transparency.

Part of the OpenSansad initiative — a personal project to make Sansad's (Indian Parliament's) workings more accessible and transparent through open data and open-source tooling. Data sourced from Digital Sansad.

Dataset at a Glance

	18th Lok Sabha	17th Lok Sabha	16th Lok Sabha
Period	2024–2026	2019–2024	2014–2019
Sessions	2–7	1–12, 14–15	5–17
Questions	30,220	60,548	60,040
Text extracted	30,219	60,529	59,990

Total: 150,800+ records covering both starred (oral) and unstarred (written) parliamentary questions.

Quick Start

from datasets import load_dataset

ds = load_dataset("opensansad/lok-sabha-qa")

# Filter by ministry
health = ds["train"].filter(lambda x: "HEALTH" in (x["ministry"] or ""))

# Filter by Lok Sabha and session
lok18_s4 = ds["train"].filter(lambda x: x["lok_no"] == 18 and x["session_no"] == 4)

# Starred questions only
starred = ds["train"].filter(lambda x: x["type"] == "STARRED")

Pipeline

This repo contains the full data pipeline — self-contained, no external dependencies:

1. Curate       →  Fetch metadata from sansad.in API
2. Download     →  Download source files (PDF / DOCX / DOC)
3. Extract      →  Extract text via Docling (+ EasyOCR fallback for scanned PDFs,
                                              + LibreOffice pre-conversion for .doc)
4. Build        →  Assemble into Parquet dataset
5. Publish      →  Push to HuggingFace Hub
6. Source Issues →  (optional) Aggregate upstream data issues into SOURCE_ISSUES.md

Setup

uv sync

Optional system dependency: LibreOffice is required for legacy .doc files (binary OLE2). The pipeline auto-detects soffice on PATH or in the standard install locations — install via brew install --cask libreoffice (macOS) or your distro's package manager. .docx and .pdf work without it.

Run the pipeline

# 1. Curate metadata for a Lok Sabha
uv run python -m lok_sabha_dataset.pipeline.curate --lok 18

# 2. Download PDFs
uv run python -m lok_sabha_dataset.pipeline.download --lok 18

# 3. Extract text from PDFs (two-pass for speed)
uv run python -m lok_sabha_dataset.pipeline.extract run --lok 18 --engine docling
uv run python -m lok_sabha_dataset.pipeline.extract run --lok 18 --engine easyocr --retry-low-confidence

# 4. Build parquet (auto-discovers all loks in data/)
uv run python -m lok_sabha_dataset.build

# 5. Publish to HuggingFace
uv run python -m lok_sabha_dataset.publish --push

# 6. (optional) Refresh the public-facing source-issues report
uv run python -m lok_sabha_dataset.source_issues

Build specific sessions

uv run python -m lok_sabha_dataset.build --lok 18 --sessions 6-7

Output is written to output/lok_sabha_qa.parquet.

Configuration

Override the data directory via environment variable:

export LOKSABHA_DATA_DIR=/path/to/data

Default is data/ in the repo root.

Testing

Integration tests cover the full pipeline against live sansad.in endpoints and a small fixture corpus:

File	What it covers
`tests/test_curate.py`	`curate` pipeline against the live sansad.in API (LS16-S5 + LS18-S2, scoped to ~5 records per session via `--page-size 5 --max-pages 1`)
`tests/test_download.py`	`download` pipeline against real sansad.in URLs — successful PDF + DOCX downloads, server-error and missing-protocol failures, no-URL skips, and idempotency
`tests/test_extract_doc.py`	`extract` pipeline across `.pdf`/`.docx`/`.doc` source formats and the `--retry-low-confidence --engine easyocr` retry path with strict golden-file comparison
`tests/test_splitter.py`	Q/A splitter unit tests

Run the full suite:

uv run --extra dev pytest tests/ -v

After making changes that legitimately alter extracted text, regenerate the golden references:

uv run python tests/update_golden.py

The curate and download tests hit live sansad.in endpoints — failures generally signal real upstream changes worth investigating.

Source Issues

Upstream data issues observed at sansad.in (broken downloads, NIL-only documents, mismatched files, etc.) are tracked in SOURCE_ISSUES.md with the full machine-readable list at data/source_issues.jsonl. Hand-curated entries live in data/source_issues_manual.jsonl — append a JSON line and re-run step 6 to publish.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
docs/assets		docs/assets
src/lok_sabha_dataset		src/lok_sabha_dataset
tests		tests
.gitignore		.gitignore
.python-version		.python-version
DATASET_CARD.md		DATASET_CARD.md
LICENSE		LICENSE
README.md		README.md
SOURCE_ISSUES.md		SOURCE_ISSUES.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenSansad — Lok Sabha Q&A Dataset

Dataset at a Glance

Quick Start

Pipeline

Setup

Run the pipeline

Build specific sessions

Configuration

Testing

Source Issues

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenSansad — Lok Sabha Q&A Dataset

Dataset at a Glance

Quick Start

Pipeline

Setup

Run the pipeline

Build specific sessions

Configuration

Testing

Source Issues

Related

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages