Multi-tenant webhook ingestion and job orchestration platform built with Rust.
Production webhook systems fail in predictable ways:
- duplicate deliveries create duplicate jobs
- direct publish to MQ causes DB/MQ inconsistency
- missing idempotency creates non-deterministic behavior
- weak observability makes incidents hard to debug
- multi-tenant secret management is often unsafe or ad hoc
- admin endpoints without RBAC are a security risk
Relayforge was built to solve these issues with production-style architecture, not a demo-only setup.
- Accepts webhooks via
POST /webhooks/{source} - Verifies signatures for:
generic(X-Signature)github(X-Hub-Signature-256)stripe(Stripe-Signature)
- Normalizes provider payloads into a shared
DomainEvent - Persists inbox + jobs in Postgres with idempotent upsert
- Uses Outbox Pattern for reliable RabbitMQ publishing
- Runs worker publisher/consumer with retry + dead-letter flow
- Exposes health/readiness:
GET /healthzGET /readyz
- Exposes observability:
GET /metrics(Prometheus)- structured logs with correlation IDs
- Exposes docs:
GET /openapi.jsonGET /docs/(Swagger UI)
- Protects admin routes (
/admin/*) with Bearer auth + RBAC (viewer/operator/admin)
flowchart LR
A["Webhook Provider"] --> B["API Axum"]
B --> C["Connector verify/parse/normalize"]
C --> D[("webhook_inbox")]
D --> E[("jobs")]
E --> F[("outbox_messages")]
F --> G["RabbitMQ"]
G --> H["Worker Consumer"]
H --> E
B --> I["/metrics"]
B --> J["/openapi.json + /docs/"]
| Technology | Where It Is Used | Why It Was Chosen |
|---|---|---|
tokio |
API and worker runtime | Mature async ecosystem in Rust |
axum |
HTTP routes/handlers | Clean integration with Tower middleware |
hyper |
Underlying HTTP engine | High-performance and stable transport layer |
tower, tower-http |
request-id, tracing, limits, panic catch | Composable middleware and hardening controls |
serde, serde_json |
payload parsing and response serialization | Standard Rust JSON tooling |
sqlx + Postgres |
persistence and migrations | type-safe queries + operationally reliable DB |
lapin + RabbitMQ |
publisher/consumer messaging | explicit control over publish/consume semantics |
tracing, tracing-subscriber |
structured logs and trace context | fast debugging with correlation IDs |
prometheus |
metrics registry/export | standard monitoring integration |
hmac, sha2, subtle |
signature verification | secure HMAC checks + constant-time compare |
testcontainers + reqwest |
integration and E2E tests | realistic tests against real dependencies |
k6 |
load testing | reproducible performance proof for resume |
- Signature verification (
Stripe-Signature) is required. - Timestamp freshness is enforced with tolerance window:
WEBHOOKS__STRIPE_SIGNATURE_TOLERANCE_SECONDS
- Event-level idempotency store is persisted in DB:
- table:
stripe_event_idempotency - uniqueness:
(tenant_id, event_id)
- table:
- Replay behavior:
- stale timestamp ->
401 - same event replay for same tenant ->
409
- stale timestamp ->
- Per-tenant token bucket rate limiter (in-memory):
HARDENING__TENANT_RATE_LIMIT_PER_SECONDHARDENING__TENANT_RATE_LIMIT_BURST
- Per-tenant daily quota (DB-backed atomic upsert):
HARDENING__TENANT_DAILY_QUOTA- table:
tenant_daily_quota_usage
- Response on limit/quota hit:
429
- Global ingress rate limiting on webhook routes
- Request body size limit (
413on oversized payload) - Provider payload schema validation (
serde+ strict parsing) - Tenant/provider secrets loaded from DB first, env as fallback
- Admin API key validation with hashed keys + RBAC
Implemented metrics:
http_requests_total{route,method,status}http_request_duration_seconds(histogram)jobs_created_total{source}jobs_processed_total{status}job_processing_duration_seconds(histogram)job_retries_totaldlq_messages_totaloutbox_pending_count(gauge)
Distributed tracing:
- OpenTelemetry (OTLP) pipeline via
tracing-opentelemetry - W3C propagation headers:
traceparent,tracestate,baggage - end-to-end context propagation across API -> outbox -> worker retries
- key env vars:
OTEL__ENABLED,OTEL__EXPORTER_OTLP_ENDPOINT,OTEL__EXPORTER_OTLP_TIMEOUT_MS
Local observability stack:
- Prometheus config:
observability/prometheus/prometheus.yml - Grafana provisioning and dashboard:
observability/grafana/ - Alert rules:
observability/prometheus/alerts.yml - SLO definitions:
observability/slo.md - Operational runbook:
observability/runbook.md - Start with:
docker compose up -d prometheus grafana alertmanager otel-collectorReproducible load script:
./scripts/perf/run_ingest_perf.shLatest recorded run (2026-02-14, RATE=300, DURATION=20s):
| Metric | Value |
|---|---|
| Throughput (req/s) | 164.50 |
| p95 latency (ms) | 2373.05 |
| p99 latency (ms) | 2392.12 |
| Error rate (%) | 0.00 |
| API peak CPU (%) | 71.00 |
| API peak RAM (MiB) | 48.28 |
Hot-path benchmark:
ITERATIONS=100000 ./scripts/perf/run_hot_path_bench.sh| Case | Payload bytes | Ops/s |
|---|---|---|
| hmac_signature_hex | 256 | 855,855.71 |
| payload_hash_hex | 256 | 1,878,018.03 |
| hmac_signature_hex | 1024 | 471,231.42 |
| payload_hash_hex | 1024 | 502,538.55 |
| hmac_signature_hex | 4096 | 136,971.16 |
| payload_hash_hex | 4096 | 143,859.95 |
webhook_inbox: raw body, headers, payload hash, acceptance metadatajobs: dedup key, status machine, attempts, last_error, first/last inbox refsoutbox_messages: reliable publish state with retry scheduling fieldstenant_provider_secrets: per-tenant/per-provider secretsadmin_api_keys: hashed keys with rolestripe_event_idempotency: anti-replay idempotency storetenant_daily_quota_usage: per-tenant daily usage counter
Run all API tests:
cargo test -p apiImportant coverage:
hardening_abuse: burst abuse (1000req) and oversized payload (413)level9_multi_tenant_connectors: tenant secret resolution + event normalizationadmin_rbac: role-based authorization checks for admin namespacedocs_openapi: OpenAPI/Swagger endpoints availabilitytenant_protection: Stripe anti-replay + per-tenant rate/quota enforcementworker/level5_e2e: retry, dead-letter, reprocess flow with real Postgres/RabbitMQ
docker compose up -d
cp .env.example .env
cargo run -p api --bin apiThen open:
http://127.0.0.1:8080/healthzhttp://127.0.0.1:8080/readyzhttp://127.0.0.1:8080/docs/http://127.0.0.1:8080/metrics
| Challenge | Implementation |
|---|---|
| Duplicate webhooks creating duplicate jobs | transactional upsert + unique dedup key + inbox linkage |
| DB/MQ consistency risk | Outbox pattern instead of direct publish |
| Stripe replay risks | freshness window + DB-backed event idempotency |
| Multi-tenant secret complexity | DB source of truth + env fallback |
| Abuse traffic and noisy tenants | ingress limiter + per-tenant limiter + daily quota |
| Admin endpoint exposure | Bearer auth + role hierarchy + hashed DB keys |
| Low observability in incidents | structured logs + correlation IDs + Prometheus + Grafana |