chore: improvements to clickhouse and data masking#2985
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
@claude review |
| <Callout type="warning"> | ||
|
|
||
| Server-side masking is a centralized safety net, not a replacement for client-side masking when sensitive data must never leave the application boundary. | ||
| In the self-hosted ingestion pipeline, events are written to the event blob storage bucket before the Worker calls the masking callback. | ||
| The callback masks data before it is processed into ClickHouse and downstream Langfuse views. | ||
|
|
||
| </Callout> |
There was a problem hiding this comment.
Blob storage data fate after masking is unspecified
The new warning correctly flags that events land in blob storage before the masking callback runs, but it doesn't say whether that unmasked blob-storage copy is ever replaced or deleted after the Worker processes the masked version. Looking at the sequence diagram (SDK → Web → S3 (unmasked) → Worker → Callback → ClickHouse), there is no "update S3" or "delete from S3" step. If unmasked data persists in blob storage indefinitely, users deploying this feature for GDPR, HIPAA, or PCI DSS compliance may believe their storage layer is protected when only ClickHouse and downstream views receive masked data. A single sentence clarifying the blob-storage retention behavior (e.g., whether it is retained as-is, or whether client-side masking is required to prevent PII reaching blob storage at all) would prevent a compliance blind-spot for operators who read this warning and draw the wrong conclusion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: content/self-hosting/security/data-masking.mdx
Line: 55-61
Comment:
**Blob storage data fate after masking is unspecified**
The new warning correctly flags that events land in blob storage *before* the masking callback runs, but it doesn't say whether that unmasked blob-storage copy is ever replaced or deleted after the Worker processes the masked version. Looking at the sequence diagram (SDK → Web → S3 (unmasked) → Worker → Callback → ClickHouse), there is no "update S3" or "delete from S3" step. If unmasked data persists in blob storage indefinitely, users deploying this feature for GDPR, HIPAA, or PCI DSS compliance may believe their storage layer is protected when only ClickHouse and downstream views receive masked data. A single sentence clarifying the blob-storage retention behavior (e.g., whether it is retained as-is, or whether client-side masking is required to prevent PII reaching blob storage at all) would prevent a compliance blind-spot for operators who read this warning and draw the wrong conclusion.
How can I resolve this? If you propose a fix, please make it concise.| | `CLICKHOUSE_DB` | `default` | Name of the ClickHouse database to use. | | ||
| | `CLICKHOUSE_CLUSTER_ENABLED` | `true` | Whether to run ClickHouse commands `ON CLUSTER`. Set to `false` for single-container setups. | | ||
| | `LANGFUSE_AUTO_CLICKHOUSE_MIGRATION_DISABLED` | `false` | Whether to disable automatic ClickHouse migrations. | | ||
| | `CLICKHOUSE_READ_ONLY_URL` | | Optional read-only endpoint used for public-API reads and selected UI/filter read queries. Falls back to `CLICKHOUSE_URL` when unset. Reuses `CLICKHOUSE_USER`, `CLICKHOUSE_PASSWORD`, and `CLICKHOUSE_DB`. Only useful on [compute-compute separated](https://clickhouse.com/docs/cloud/reference/warehouses) clusters. See [Scaling](/self-hosting/configuration/scaling#clickhouse-read-only-url). | |
There was a problem hiding this comment.
🔴 The CLICKHOUSE_READ_ONLY_URL description is narrowed here (lines 37 and 120) to 'public-API reads and selected UI/filter read queries', but two sibling docs still carry the original broader wording and now contradict this page: content/self-hosting/configuration/index.mdx:25 (master env-var index) still says 'UI and public-API read queries on traces, observations, scores, and sessions', and content/self-hosting/configuration/scaling.mdx:77 — the page this row links to as 'See [Scaling]' — still says 'route UI and public-API read queries to the given endpoint'. Please update those two locations to match the new narrower wording so users following the link don't get conflicting answers about what the env var actually routes.
Extended reasoning...
The bug
This PR intentionally narrows the description of CLICKHOUSE_READ_ONLY_URL in content/self-hosting/deployment/infrastructure/clickhouse.mdx in two places:
- Line 37 (env-var table): 'UI and public-API read queries on traces, observations, scores, and sessions' → 'public-API reads and selected UI/filter read queries'
- Line 120 (Cloud/BYOC narrative): 'read-heavy UI and public-API traffic' → 'public-API reads and selected UI/filter read traffic'
But the same variable is described in two other docs that were not touched and now disagree with the new wording.
Where the contradictions live
content/self-hosting/configuration/index.mdx:25— the master self-hosting env-var index table — still says:Optional read-only endpoint used for UI and public-API read queries on traces, observations, scores, and sessions.
content/self-hosting/configuration/scaling.mdx:72-77— the deep-dive page that this PR's new row explicitly links to as the canonical reference viaSee [Scaling](/self-hosting/configuration/scaling#clickhouse-read-only-url)— still says:This keeps dashboard and public-API read traffic from contending with ingestion inserts… Set
CLICKHOUSE_READ_ONLY_URLand Langfuse will route UI and public-API read queries to the given endpoint while writes, migrations, and ingestion continue to useCLICKHOUSE_URL.
Step-by-step proof a reader gets contradictory information
- A self-hosting operator lands on
/self-hosting/deployment/infrastructure/clickhouse.mdxand reads the new row at line 37: only public-API reads and selected UI/filter read queries are routed. - They click the in-row link
See [Scaling](/self-hosting/configuration/scaling#clickhouse-read-only-url)to learn more. - The target section on scaling.mdx claims all UI and public-API read queries are routed — a strictly broader scope than what the page they came from just promised.
- If they cross-check the central env-var index at
/self-hosting/configuration, they see yet a third variant of the same description (broader scope plus a specific entity list 'traces, observations, scores, and sessions' that the new wording dropped).
Impact
Documentation-only inconsistency, no runtime behavior change. But it is directly caused by this PR (the prior state of the docs was self-consistent at the broader wording), and the inconsistency is one hop away from the edited row via a link the PR itself places. A user trying to decide whether to provision a separate ClickHouse read compute group gets three different answers about what reads will be served from it.
Fix
Update the two sibling occurrences to the narrowed wording introduced here:
content/self-hosting/configuration/index.mdx:25— replace the env-var row description with wording consistent with the new clickhouse.mdx text.content/self-hosting/configuration/scaling.mdx:72-77— adjust 'dashboard and public-API read traffic' and 'route UI and public-API read queries' to reflect that only the public API plus selected UI/filter reads are routed.
Decide whether the dropped entity list ('traces, observations, scores, and sessions') should be retained anywhere, or whether 'selected UI/filter read queries' is sufficiently specific in the new framing.
Greptile Summary
This PR improves two self-hosting documentation pages: it adds a "Direct ClickHouse Access for Custom Tools" section to
clickhouse.mdx, and refinesdata-masking.mdxwith a new blob-storage timing warning, a corrected error-handling table (fail-open is the actual default, not fail-closed), and fixed container references (Worker vs Web) throughout the troubleshooting section.clickhouse.mdx: Adds guidance for users querying ClickHouse directly, clarifies theCLICKHOUSE_READ_ONLY_URLscope, and aligns the Cloud/BYOC recommendation paragraph with the updated description.data-masking.mdx: Fixes the inverted error-handling table (default is fail-open/false, not fail-closed), corrects "Langfuse Web container" → "Langfuse Worker container" in troubleshooting, adds a warning that events are written to blob storage before the masking callback runs, and clarifies that only invalid JSON (not structural schema mismatch) triggers error-handling behavior.Confidence Score: 4/5
Safe to merge for the ClickHouse page; the data-masking page would benefit from one additional sentence clarifying blob-storage retention before operators rely on it for compliance decisions.
Both files are documentation-only changes. The ClickHouse additions are clear and accurate. The data-masking corrections (table column order, container name fixes) are genuine improvements. The one gap is that the new blob-storage warning in data-masking.mdx omits whether unmasked blob-storage data is ever cleaned up — this matters for operators reading this page to evaluate GDPR/HIPAA compliance posture, and a misleading silence there could lead to incorrect architectural decisions.
content/self-hosting/security/data-masking.mdx — the blob-storage data retention clause in the new warning callout (lines 55–61).
Security Review
data-masking.mdxnow discloses that events are written to blob storage before the masking callback runs, meaning blob storage retains unmasked PII. The documentation does not clarify whether this copy is ever replaced or deleted. Operators deploying server-side masking for GDPR/HIPAA/PCI DSS compliance may incorrectly assume their entire storage layer is protected.Sequence Diagram
sequenceDiagram participant SDK as Langfuse SDK participant Web as Langfuse Web participant Bucket as Blob Storage (S3) participant Worker as Langfuse Worker participant Callback as Masking Callback participant CH as ClickHouse SDK->>Web: Send trace event (may contain PII) Web->>Bucket: Store UNMASKED event Note over Bucket: ⚠️ Unmasked data persists here Web->>Worker: Forward S3 reference (via Redis) Worker->>Bucket: Fetch unmasked event Worker->>Callback: POST OpenTelemetry object alt Callback succeeds Callback->>Worker: Return masked object Worker->>CH: Write MASKED data else Fail open (default) Worker->>CH: Write UNMASKED data, log warning else "Fail closed (FAIL_CLOSED=true)" Worker->>Worker: Drop event, log warning endPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "improvements to clickhouse and data mask..." | Re-trigger Greptile