Skip to content

feat: add remote storage loaders and writers (S3, GCS, Azure Blob, SQL)#1102

Open
CoronRing wants to merge 5 commits into
feature-branch-ragfrom
guan/1090/remote_store
Open

feat: add remote storage loaders and writers (S3, GCS, Azure Blob, SQL)#1102
CoronRing wants to merge 5 commits into
feature-branch-ragfrom
guan/1090/remote_store

Conversation

@CoronRing
Copy link
Copy Markdown
Contributor

Summary

  • Adds S3Loader / S3Writer, GCSLoader / GCSWriter, AzureBlobLoader / AzureBlobWriter, and SQLLoader / SQLWriter under railtracks.loaders and railtracks.writers
  • All providers use optional extras (railtracks[aws], railtracks[gcp], railtracks[azure-blob], railtracks[sql]) so the core package stays lean
  • All loaders/writers expose sync and async interfaces (load/aload, write/awrite)
  • SQL classes include a context-manager (with SQLLoader(...) as l) and explicit close() for engine lifecycle management
  • SQL identifier arguments validated against a strict allowlist at construction time to prevent injection
  • Full unit test coverage across all 8 classes (127 tests, all passing)
  • Comprehensive developer docs under docs/integrations/storage/ with pip + uv install tabs, security callouts, and provider-specific auth guidance

Security hardening

  • SQL table/column names validated at __init__ time — raises ValueError on any metacharacter ([A-Za-z_][A-Za-z0-9_$]* allowlist, supports schema.table)
  • Helpful ValueError when content_column is missing from query results (was a bare KeyError)
  • __repr__ on all classes exposes only non-sensitive fields (bucket/container name); credentials never appear in repr
  • UserWarning emitted when prefix is passed to SQLLoader.load() or SQLWriter.write() (unsupported, silently ignored before)
  • All ImportError messages include both pip install and uv add forms

Limitations documented

  • CTE (WITH …) queries not supported as table_or_query; workaround shown in docs
  • aload/awrite are thread-backed (asyncio.to_thread) not true-async; noted in docs with guidance for high-concurrency cases
  • SQLWriter.write() is all-or-nothing (single transaction); partial-failure pattern documented

Test plan

  • 127 unit tests passing across all 4 providers × loader + writer
  • SQL tests use real in-memory SQLite (no mocks for correctness)
  • Cloud tests (S3/GCS/Azure) use provider SDK mocks
  • Async variants covered for all classes

🤖 Generated with Claude Code

@CoronRing CoronRing changed the base branch from main to feature-branch-rag May 19, 2026 18:32
@CoronRing CoronRing force-pushed the guan/1090/remote_store branch from 4cdece1 to b354919 Compare May 19, 2026 18:36
Comment thread uv.lock
Comment thread pyproject.toml
Comment thread .gitignore Outdated
@CoronRing CoronRing force-pushed the guan/1090/remote_store branch 2 times, most recently from 5a4c7fe to 57be8ec Compare May 21, 2026 23:37
CoronRing and others added 4 commits May 21, 2026 16:44
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…iter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@CoronRing CoronRing force-pushed the guan/1090/remote_store branch from 57be8ec to e0e6f9e Compare May 21, 2026 23:45
Copy link
Copy Markdown
Contributor

@Pooria90 Pooria90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @CoronRing

I went over the code. A few substantial changes needs to be applied:

  1. We need to move loaders and writers modules to packages/railtracks/src/railtracks/retrieval/loaders as they are being built on top of Amir's work on loaders. I recommend having all classes for a specific provider under one single module: for example having loaders/s3.py and writers/s3.py all under one s3.py.
    Or we can also have them under a loaders/cloud folder. We can discuss the structure.
  2. The modules are returning Chunk which is located in our old vector_stores module. That module is deprecated now and will be removed from the framework. The new type that we use for our retrieval module is called Document which is located in packages/railtracks/src/railtracks/retrieval/models.py. Please refer to Amir's loaders for examples.
  3. Please adjust the Base classes, data models, and the docs accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants