Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions docs/integrations/storage/azure_blob.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Azure Blob Storage

`AzureBlobLoader` fetches blobs from an Azure Blob Storage container and returns them as
[`Chunk`](../../rag/vector_stores/vector_store_info.md) objects containing
UTF-8 decoded content plus source metadata (`source`, `account_url`, `container`,
`blob_name`).

## Installation

=== "pip"

```bash
pip install railtracks[azure-blob]
```

=== "uv"

```bash
uv add railtracks[azure-blob]
```

## Authentication

Authentication defaults to **`DefaultAzureCredential`**, which automatically resolves
credentials from the following sources (in order):

1. Environment variables (`AZURE_CLIENT_ID`, `AZURE_TENANT_ID`, `AZURE_CLIENT_SECRET`)
2. Workload identity (Kubernetes)
3. Managed identity (Azure-hosted compute)
4. Azure CLI (`az login`)
5. Azure PowerShell / Visual Studio / IntelliJ

Pass an explicit `credential` to override.

!!! tip "Prefer managed identity over connection strings"
Managed identity is the recommended authentication method for Azure-hosted
workloads — it requires no secrets and rotates automatically. Avoid
embedding storage account keys or SAS tokens in source code; store them
in Azure Key Vault or environment variables instead.

## Basic usage

```python
--8<-- "docs/scripts/storage_loaders.py:azure_basic"
```

## Load by prefix

```python
--8<-- "docs/scripts/storage_loaders.py:azure_prefix"
```

## Load specific blobs

```python
--8<-- "docs/scripts/storage_loaders.py:azure_load_keys"
```

## Async usage

```python
--8<-- "docs/scripts/storage_loaders.py:azure_async"
```

!!! note "Async is thread-backed"
`aload()` and `aload_keys()` run the synchronous `azure-storage-blob`
client on a thread-pool thread via `asyncio.to_thread()`. This is correct
for most workloads; for very high concurrency consider the async Azure SDK
(`azure.storage.blob.aio`).

## Override credentials

**SAS token**

```python
--8<-- "docs/scripts/storage_loaders.py:azure_sas"
```

**System-assigned or user-assigned managed identity**

```python
--8<-- "docs/scripts/storage_loaders.py:azure_managed_identity"
```

## Chunk metadata

Each returned `Chunk` carries:

| Key | Value |
|---|---|
| `source` | Full blob URL: `https://<account>.blob.core.windows.net/<container>/<blob>` |
| `account_url` | Storage account URL |
| `container` | Container name |
| `blob_name` | Blob name (path within the container) |

## Full RAG pipeline example

```python
--8<-- "docs/scripts/storage_loaders.py:pipeline_azure_to_rag"
```

---

## Writing to Azure Blob Storage

`AzureBlobWriter` uploads text content to a blob container. Existing blobs at
the same name are overwritten.

### Basic write

```python
--8<-- "docs/scripts/storage_writers.py:azure_write_basic"
```

### SAS token credential

```python
--8<-- "docs/scripts/storage_writers.py:azure_write_sas"
```

### Async write

```python
--8<-- "docs/scripts/storage_writers.py:azure_write_async"
```
110 changes: 110 additions & 0 deletions docs/integrations/storage/gcs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Google Cloud Storage

`GCSLoader` fetches objects from a GCS bucket and returns them as
[`Chunk`](../../rag/vector_stores/vector_store_info.md) objects containing
UTF-8 decoded content plus source metadata (`source`, `bucket`, `name`).

## Installation

=== "pip"

```bash
pip install railtracks[gcp]
```

=== "uv"

```bash
uv add railtracks[gcp]
```

## Authentication

Authentication uses **Application Default Credentials (ADC)** by default:

1. `GOOGLE_APPLICATION_CREDENTIALS` environment variable (path to a service-account JSON)
2. `gcloud auth application-default login` (developer workstation)
3. Workload Identity / attached service account (GCE, GKE, Cloud Run, Cloud Functions …)

Pass explicit `credentials` to override ADC.

!!! tip "Prefer Workload Identity over service-account key files"
Service-account JSON key files are long-lived credentials that require
manual rotation. On GCP-hosted compute, Workload Identity or attached
service accounts are more secure and require zero key management.

## Basic usage

```python
--8<-- "docs/scripts/storage_loaders.py:gcs_basic"
```

## Load by prefix

```python
--8<-- "docs/scripts/storage_loaders.py:gcs_prefix"
```

## Load specific objects

```python
--8<-- "docs/scripts/storage_loaders.py:gcs_load_keys"
```

## Async usage

```python
--8<-- "docs/scripts/storage_loaders.py:gcs_async"
```

!!! note "Async is thread-backed"
`aload()` and `aload_keys()` run the synchronous `google-cloud-storage`
client on a thread-pool thread via `asyncio.to_thread()`. This is correct
for most workloads.

## Override credentials (service account key file)

```python
--8<-- "docs/scripts/storage_loaders.py:gcs_service_account"
```

## Chunk metadata

Each returned `Chunk` carries:

| Key | Value |
|---|---|
| `source` | `gs://<bucket>/<name>` |
| `bucket` | GCS bucket name |
| `name` | Object name (path within the bucket) |

## Full RAG pipeline example

```python
--8<-- "docs/scripts/storage_loaders.py:pipeline_gcs_to_rag"
```

---

## Writing to GCS

`GCSWriter` uploads text content to a GCS bucket. Existing objects at the
same name are overwritten.

### Basic write

```python
--8<-- "docs/scripts/storage_writers.py:gcs_write_basic"
```

### Service account credentials

```python
--8<-- "docs/scripts/storage_writers.py:gcs_write_service_account"
```

### Async write

```python
--8<-- "docs/scripts/storage_writers.py:gcs_write_async"
```
136 changes: 136 additions & 0 deletions docs/integrations/storage/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Cloud Storage & Database Loaders / Writers

Railtracks ships first-class **loaders** and **writers** for popular cloud
storage providers and relational databases.

- **Loaders** fetch documents and return them as
[`Chunk`](../../rag/vector_stores/vector_store_info.md) objects — pipe remote
data straight into a vector store or agent without any glue code.
- **Writers** persist `Chunk` objects (or raw text) back to the same providers —
close the loop by saving AI-generated content to storage.

## Supported providers

| Provider | Loader | Writer | Install extra |
|---|---|---|---|
| AWS S3 | `S3Loader` | `S3Writer` | `railtracks[aws]` |
| Azure Blob Storage | `AzureBlobLoader` | `AzureBlobWriter` | `railtracks[azure-blob]` |
| Google Cloud Storage | `GCSLoader` | `GCSWriter` | `railtracks[gcp]` |
| SQL (PostgreSQL, Supabase, MySQL, SQLite …) | `SQLLoader` | `SQLWriter` | `railtracks[sql]` |

Install any combination:

=== "pip"

```bash
pip install "railtracks[aws,gcp,azure-blob,sql]"
```

=== "uv"

```bash
uv add "railtracks[aws,gcp,azure-blob,sql]"
```

## Loading — quick examples

=== "AWS S3"

```python
--8<-- "docs/scripts/storage_loaders.py:s3_basic"
```

=== "Azure Blob"

```python
--8<-- "docs/scripts/storage_loaders.py:azure_basic"
```

=== "Google Cloud Storage"

```python
--8<-- "docs/scripts/storage_loaders.py:gcs_basic"
```

=== "SQL / Database"

```python
--8<-- "docs/scripts/storage_loaders.py:sql_basic_postgres"
```

## Writing — quick examples

=== "AWS S3"

```python
--8<-- "docs/scripts/storage_writers.py:s3_write_basic"
```

=== "Azure Blob"

```python
--8<-- "docs/scripts/storage_writers.py:azure_write_basic"
```

=== "Google Cloud Storage"

```python
--8<-- "docs/scripts/storage_writers.py:gcs_write_basic"
```

=== "SQL / Database"

```python
--8<-- "docs/scripts/storage_writers.py:sql_write_basic"
```

## Feeding chunks into a RAG pipeline

All loaders return the same `Chunk` type that `ChromaVectorStore.upsert()` accepts,
making it trivial to build a full load → index → retrieve → answer pipeline:

```python
--8<-- "docs/scripts/storage_loaders.py:pipeline_s3_to_rag"
```

## Load → Generate → Write back

Writers make it easy to persist AI-generated content alongside source data:

```python
--8<-- "docs/scripts/storage_writers.py:pipeline_generate_and_write"
```

## Async support

Every loader and writer exposes async variants (`aload`, `aload_keys`, `awrite`,
`awrite_key`) that are safe to use in `async` agent pipelines:

```python
chunks = await loader.aload(prefix="reports/2024/")
uris = await writer.awrite(chunks, prefix="summaries/")
```

The async methods delegate to `asyncio.to_thread()`, so they are non-blocking
from the caller's perspective while the underlying SDK call runs on a thread-pool
thread.

## Key derivation for writers

When writing `Chunk` objects, the storage key (S3 key, GCS object name, blob
name, SQL id) is derived in this order:

1. Return value of `key_fn(chunk)` — if `key_fn` is provided
2. `chunk.id` — if set
3. `chunk.document` — if set
4. A freshly generated UUID4 — as a last resort

Pass `key_fn` to take full control of the naming scheme:

```python
writer = S3Writer("my-bucket", key_fn=lambda c: f"docs/{c.id}.txt")
```

!!! tip "Next steps"
- [AWS S3](s3.md) · [Azure Blob Storage](azure_blob.md) · [Google Cloud Storage](gcs.md) · [SQL](sql.md)
- [Cloud Storage Loaders Tutorial](../../tutorials/walkthroughs/storage_loaders_tutorial.md)
Loading