diff --git a/src/content/changelog/r2-sql/2026-05-14-joins-subqueries-multi-table-queries.mdx b/src/content/changelog/r2-sql/2026-05-14-joins-subqueries-multi-table-queries.mdx new file mode 100644 index 000000000000000..3c808a6c8c1125a --- /dev/null +++ b/src/content/changelog/r2-sql/2026-05-14-joins-subqueries-multi-table-queries.mdx @@ -0,0 +1,71 @@ +--- +title: R2 SQL now supports JOINs, subqueries, and multi-table queries +description: Join multiple Iceberg tables, use subqueries, and write multi-table CTEs in R2 SQL. +products: + - r2-sql +date: 2026-05-15 +--- + +[R2 SQL](/r2-sql/) is Cloudflare's serverless, distributed SQL engine for querying [Apache Iceberg](https://iceberg.apache.org/) tables stored in [R2 Data Catalog](/r2/data-catalog/). R2 SQL runs directly on Cloudflare's global network with no infrastructure to manage, so you can analyze data in R2 without exporting it to an external warehouse. + +R2 SQL now supports joining multiple Iceberg tables in a single query. You can combine tables with JOINs, filter with subqueries, and define multi-table CTEs to build complex analytical queries. + +## New capabilities + +- **JOINs** — `INNER JOIN`, `LEFT JOIN`, `RIGHT JOIN`, `FULL OUTER JOIN`, `CROSS JOIN`, and implicit joins (comma-separated `FROM` with conditions in `WHERE`) +- **Subqueries** — `IN` / `NOT IN`, `EXISTS` / `NOT EXISTS`, scalar subqueries in `SELECT` / `WHERE` / `HAVING`, and derived tables (subqueries in `FROM`) +- **Multi-table CTEs** — `WITH` clauses can reference different tables and include JOINs +- **Self-joins** — join a table with itself using different aliases +- **Multi-way joins** — join three or more tables in a single query + +## Examples + +### Two-table JOIN with aggregation + +```sql +SELECT z.domain, z.plan, COUNT(*) AS request_count +FROM my_namespace.zones z +INNER JOIN my_namespace.http_requests h ON z.zone_id = h.zone_id +WHERE z.plan = 'enterprise' +GROUP BY z.domain, z.plan +ORDER BY request_count DESC +LIMIT 20 +``` + +### `EXISTS` subquery + +```sql +SELECT z.domain, z.plan +FROM my_namespace.zones z +WHERE EXISTS ( + SELECT 1 FROM my_namespace.firewall_events f + WHERE f.zone_id = z.zone_id AND f.action = 'block' +) +ORDER BY z.domain +LIMIT 20 +``` + +### Multi-table CTE with JOIN + +```sql +WITH top_zones AS ( + SELECT zone_id, COUNT(*) AS req_count + FROM my_namespace.http_requests + GROUP BY zone_id + ORDER BY req_count DESC + LIMIT 50 +), +zone_threats AS ( + SELECT zone_id, COUNT(*) AS threat_count + FROM my_namespace.firewall_events + WHERE risk_score > 0.5 + GROUP BY zone_id +) +SELECT tz.zone_id, tz.req_count, COALESCE(zt.threat_count, 0) AS threat_count +FROM top_zones tz +LEFT JOIN zone_threats zt ON tz.zone_id = zt.zone_id +ORDER BY tz.req_count DESC +LIMIT 20 +``` + +For the full syntax reference, refer to the [SQL reference](/r2-sql/sql-reference/). For performance guidance with joins, refer to [Limitations and best practices](/r2-sql/reference/limitations-best-practices/). diff --git a/src/content/docs/r2-sql/reference/limitations-best-practices.mdx b/src/content/docs/r2-sql/reference/limitations-best-practices.mdx index e1a4e22065350e2..ca3746a34a661dd 100644 --- a/src/content/docs/r2-sql/reference/limitations-best-practices.mdx +++ b/src/content/docs/r2-sql/reference/limitations-best-practices.mdx @@ -28,9 +28,14 @@ This page summarizes supported features, limitations, and best practices. | 33 aggregate functions | Yes | Basic, approximate, statistical, bitwise, boolean, positional | | Approximate aggregates | Yes | `approx_distinct`, `approx_median`, `approx_percentile_cont`, `approx_top_k` | | Struct / Array / Map column types | Yes | Bracket notation, `get_field()`, array functions, map functions | -| CTEs (`WITH ... AS`) | Yes | Single-table only. No JOINs or cross-table references within CTEs. | -| JOINs | No | Single-table only | -| Subqueries | No | | +| CTEs (`WITH ... AS`) | Yes | Can reference different tables and include JOINs | +| JOINs (INNER, LEFT, RIGHT, FULL OUTER, CROSS) | Yes | All standard join types | +| Implicit joins (comma FROM) | Yes | | +| Subqueries (`IN`, `NOT IN`) | Yes | `NOT IN` not supported on nullable columns — use `NOT EXISTS` instead | +| Subqueries (`EXISTS`, `NOT EXISTS`) | Yes | semi-join and anti-join patterns | +| Scalar subqueries | Yes | In SELECT, WHERE, HAVING | +| Derived tables (FROM subqueries) | Yes | Can be nested and joined. `LATERAL` derived tables not supported. | +| Self-joins | Yes | Same table with different aliases | | Window functions (`OVER`) | No | | | `SELECT DISTINCT` | No | Use `approx_distinct` | | `OFFSET` | No | | @@ -46,9 +51,6 @@ For the full SQL syntax, refer to the [SQL reference](/r2-sql/sql-reference/). | Feature | Error | | :---------------------------------------------------------------------------- | :------------------------------------------------------- | -| JOINs (any type) | `unsupported feature: JOIN operations are not supported` | -| Multi-table CTEs (JOINs or cross-table references within `WITH`) | Single-table CTEs are supported | -| Subqueries (FROM, WHERE, scalar) | `unsupported feature: subqueries` | | `SELECT DISTINCT` | `unsupported feature: SELECT DISTINCT is not supported` | | `OFFSET` | `unsupported feature: OFFSET clause is not supported` | | `UNION` / `INTERSECT` / `EXCEPT` | Set operations not supported | @@ -57,6 +59,8 @@ For the full SQL syntax, refer to the [SQL reference](/r2-sql/sql-reference/). | `CREATE` / `DROP` / `ALTER` | `only read-only queries are allowed` | | `UNNEST` / `PIVOT` / `UNPIVOT` | Not supported | | Wildcard modifiers (`ILIKE`, `EXCLUDE`, `EXCEPT`, `REPLACE`, `RENAME` on `*`) | Not supported | +| Nested (parenthesized) joins | Not supported | +| `LATERAL` derived tables | Not supported | | `LATERAL VIEW` / `QUALIFY` | Not supported | --- @@ -70,9 +74,7 @@ For the full SQL syntax, refer to the [SQL reference](/r2-sql/sql-reference/). | `MEDIAN` | Use [`approx_median`](/r2-sql/sql-reference/aggregate-functions/#approx_median) | | `ARRAY_AGG` | No alternative (unsupported for memory safety) | | `STRING_AGG` | No alternative (unsupported for memory safety) | -| Scalar subqueries (`SELECT ... WHERE x = (SELECT ...)`) | Not supported | -| `EXISTS (SELECT ...)` | Not supported | -| `IN (SELECT ...)` | Use `IN (value1, value2, ...)` with a literal list | +| `NOT IN` subquery on nullable columns | Use `NOT EXISTS` with a correlated subquery instead | --- @@ -80,7 +82,7 @@ For the full SQL syntax, refer to the [SQL reference](/r2-sql/sql-reference/). | Constraint | Details | | :----------------------------------- | :---------------------------------------------------------------------------------------------------- | -| Single table per query | Queries must reference exactly one table. No JOINs, no subqueries. CTEs may reference a single table. | +| Multi-table queries | JOINs, subqueries (IN, EXISTS, scalar, derived tables), and multi-table CTEs are supported. Performance depends on intermediate result size; use WHERE filters to manage join selectivity. | | Partitioned and unpartitioned tables | Both partitioned and unpartitioned Iceberg tables are supported. | | Parquet format only | No CSV, JSON, or other formats. | | Read-only | R2 SQL is a query engine, not a database. No writes. | @@ -106,3 +108,7 @@ For the full SQL syntax, refer to the [SQL reference](/r2-sql/sql-reference/). 4. Use approximate aggregation functions (`approx_distinct`, `approx_median`, `approx_percentile_cont`) instead of exact alternatives on large datasets. 5. Enable compaction in R2 Data Catalog to reduce the number of files scanned per query. 6. Use `EXPLAIN` to inspect the execution plan and verify predicate pushdown. +7. Use `WHERE` filters with multi-way joins to reduce intermediate result sizes. Joining three or more large tables without filters can exceed resource limits. +8. Join large fact tables through dimension tables rather than directly joining two large fact tables. For example, join `http_requests` to `firewall_events` through a shared `zones` dimension rather than cross-joining both fact tables. +9. Be cautious with `COUNT(DISTINCT)` across multi-way joins. This combination can produce very large intermediate results. Consider using `approx_distinct()` or breaking the query into smaller steps. +10. Use explicit `JOIN` syntax instead of implicit joins (comma-separated `FROM`) for readability and to ensure the optimizer can choose optimal join ordering. diff --git a/src/content/docs/r2-sql/sql-reference/index.mdx b/src/content/docs/r2-sql/sql-reference/index.mdx index 7537a8b1cbff95d..70227b97d4db20a 100644 --- a/src/content/docs/r2-sql/sql-reference/index.mdx +++ b/src/content/docs/r2-sql/sql-reference/index.mdx @@ -25,6 +25,7 @@ R2 SQL is Cloudflare's serverless, distributed, analytics query engine for query ```sql SELECT column_list | expression | aggregation_function FROM namespace_name.table_name +[JOIN namespace_name.table_name ON condition] [WHERE conditions] [GROUP BY column_list] [HAVING conditions] @@ -98,7 +99,7 @@ SELECT region, total_amount * 1.1 AS total_with_tax FROM my_namespace.sales_data ## Common table expressions (CTEs) -CTEs let you define named temporary result sets using `WITH` that you can reference in the main query. All CTEs must reference the same single table. +CTEs let you define named temporary result sets using `WITH` that you can reference in the main query. CTEs can reference different tables and can include JOINs. A CTE can also be joined with other CTEs or regular tables in the main query. ### Syntax @@ -113,7 +114,7 @@ SELECT ... FROM cte_name ### Chained CTEs -A CTE can reference a previously defined CTE. All CTEs in the chain must derive from the same underlying table. +A CTE can reference a previously defined CTE. ```sql WITH filtered AS ( @@ -134,9 +135,44 @@ WHERE order_count > 100 ORDER BY avg_amount DESC ``` -:::note -CTEs must reference a single table. Multi-table CTEs, JOINs within CTEs, and cross-table references are not supported. -::: +### CTE joined with another table + +```sql +WITH enterprise_zones AS ( + SELECT zone_id, domain, plan + FROM my_namespace.zones + WHERE plan = 'enterprise' +) +SELECT ez.domain, f.action, COUNT(*) AS cnt +FROM enterprise_zones ez +INNER JOIN my_namespace.firewall_events f ON ez.zone_id = f.zone_id +GROUP BY ez.domain, f.action +ORDER BY cnt DESC +LIMIT 20 +``` + +### Two CTEs joined together + +```sql +WITH top_zones AS ( + SELECT zone_id, COUNT(*) AS req_count + FROM my_namespace.http_requests + GROUP BY zone_id + ORDER BY req_count DESC + LIMIT 50 +), +zone_threats AS ( + SELECT zone_id, COUNT(*) AS threat_count + FROM my_namespace.firewall_events + WHERE risk_score > 0.5 + GROUP BY zone_id +) +SELECT tz.zone_id, tz.req_count, COALESCE(zt.threat_count, 0) AS threat_count +FROM top_zones tz +LEFT JOIN zone_threats zt ON tz.zone_id = zt.zone_id +ORDER BY tz.req_count DESC +LIMIT 20 +``` --- @@ -148,7 +184,233 @@ CTEs must reference a single table. Multi-table CTEs, JOINs within CTEs, and cro SELECT * FROM namespace_name.table_name ``` -R2 SQL queries reference exactly one table, specified as `namespace_name.table_name`. +R2 SQL queries can reference one or more tables. Tables are specified as `namespace_name.table_name`. Multiple tables can be combined using JOINs or comma-separated syntax. Refer to the [JOIN clause](#join-clause) section for details. + +--- + +## JOIN clause + +R2 SQL supports joining multiple Iceberg tables in a single query. All join types use standard SQL syntax. + +### Supported join types + +| Join type | Syntax | Description | +| :--------------- | :---------------------------------- | :----------------------------------------------------------------- | +| Inner join | `INNER JOIN ... ON` | Returns rows that match in both tables | +| Left outer join | `LEFT JOIN ... ON` | Returns all rows from the left table, NULLs for non-matching right | +| Right outer join | `RIGHT JOIN ... ON` | Returns all rows from the right table, NULLs for non-matching left | +| Full outer join | `FULL OUTER JOIN ... ON` | Returns all rows from both tables, NULLs where no match | +| Cross join | `CROSS JOIN` | Cartesian product of both tables | +| Implicit join | `FROM t1, t2 WHERE t1.id = t2.id` | Comma-separated tables with join condition in `WHERE` | + +### Syntax + +```sql +-- Explicit JOIN +SELECT columns +FROM namespace.table1 alias1 +[INNER | LEFT | RIGHT | FULL OUTER | CROSS] JOIN namespace.table2 alias2 + ON alias1.column = alias2.column +[WHERE conditions] + +-- Implicit join +SELECT columns +FROM namespace.table1 alias1, namespace.table2 alias2 +WHERE alias1.column = alias2.column +``` + +### Multi-way joins + +You can join three or more tables in a single query: + +```sql +SELECT z.domain, h.method, f.action, COUNT(*) AS cnt +FROM my_namespace.zones z +INNER JOIN my_namespace.http_requests h ON z.zone_id = h.zone_id +INNER JOIN my_namespace.firewall_events f ON z.zone_id = f.zone_id +WHERE h.status_code >= 400 +GROUP BY z.domain, h.method, f.action +ORDER BY cnt DESC +LIMIT 20 +``` + +### Self-joins + +A table can be joined with itself using different aliases: + +```sql +SELECT f1.source_ip, f1.zone_id AS zone1, f2.zone_id AS zone2 +FROM my_namespace.firewall_events f1 +INNER JOIN my_namespace.firewall_events f2 + ON f1.source_ip = f2.source_ip + AND f1.zone_id < f2.zone_id +WHERE f1.action = 'block' +LIMIT 20 +``` + +### Join conditions + +- Join conditions use the `ON` clause with equality (`=`) or expression-based predicates. +- Functions are supported in join predicates (for example, `ON LOWER(a.col) = LOWER(b.col)`). +- Multiple conditions can be combined with `AND`. + +:::note +Nested (parenthesized) joins are not supported. Write multi-way joins as a flat sequence of `JOIN` clauses instead of grouping them with parentheses. + +```sql +-- Not supported +SELECT * FROM (t1 JOIN t2 ON t1.id = t2.id) JOIN t3 ON t2.id = t3.id + +-- Supported +SELECT * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t2.id = t3.id +``` +::: + +### Best practices for joins + +- Include `WHERE` filters to reduce intermediate result sizes, especially for multi-way joins. +- Join large fact tables through a shared dimension table rather than directly cross-joining two large tables. +- Use `LIMIT` to cap result sizes. + +--- + +## Subqueries + +R2 SQL supports subqueries in multiple positions within a query. + +### Subqueries in FROM (derived tables) + +A subquery in the `FROM` clause creates a derived table that can be referenced in the outer query: + +```sql +SELECT sub.domain, sub.total_requests +FROM ( + SELECT z.domain, COUNT(*) AS total_requests + FROM my_namespace.zones z + INNER JOIN my_namespace.http_requests h ON z.zone_id = h.zone_id + GROUP BY z.domain +) sub +WHERE sub.total_requests > 1000 +ORDER BY sub.total_requests DESC +LIMIT 20 +``` + +:::note +`LATERAL` derived tables are not supported. Subqueries in `FROM` cannot reference columns from other tables in the same `FROM` clause. +::: + +Derived tables can be joined with other derived tables or regular tables: + +```sql +SELECT req.domain, req.total_reqs, fw.total_events +FROM ( + SELECT zone_id, domain, COUNT(*) AS total_reqs + FROM my_namespace.zones z + INNER JOIN my_namespace.http_requests h ON z.zone_id = h.zone_id + GROUP BY zone_id, domain +) req +INNER JOIN ( + SELECT zone_id, COUNT(*) AS total_events + FROM my_namespace.firewall_events + GROUP BY zone_id +) fw ON req.zone_id = fw.zone_id +ORDER BY fw.total_events DESC +LIMIT 20 +``` + +### `IN` / `NOT IN` subqueries + +Filter rows based on whether a value exists in the result of a subquery: + +```sql +-- Find requests from enterprise zones +SELECT method, status_code, COUNT(*) AS cnt +FROM my_namespace.http_requests +WHERE zone_id IN ( + SELECT zone_id FROM my_namespace.zones WHERE plan = 'enterprise' +) +GROUP BY method, status_code +ORDER BY cnt DESC +LIMIT 20 +``` + +```sql +-- NOT IN example +SELECT zone_id, COUNT(*) AS cnt +FROM my_namespace.http_requests +WHERE zone_id NOT IN ( + SELECT zone_id FROM my_namespace.firewall_events WHERE action = 'block' +) +GROUP BY zone_id +LIMIT 10 +``` + +:::caution +`NOT IN` subqueries are not supported on nullable columns. If the subquery column can contain `NULL` values, use `NOT EXISTS` instead. `SELECT DISTINCT` is also not supported inside subqueries — omit the `DISTINCT` keyword or use `NOT EXISTS`. + +```sql +-- Instead of NOT IN on a nullable column: +SELECT z.domain +FROM my_namespace.zones z +WHERE NOT EXISTS ( + SELECT 1 FROM my_namespace.firewall_events f + WHERE f.zone_id = z.zone_id +) +LIMIT 20 +``` +::: + +### `EXISTS` / `NOT EXISTS` subqueries + +Test for the existence of rows matching a correlated condition: + +```sql +-- Find zones with blocked firewall events (semi-join) +SELECT z.domain, z.plan +FROM my_namespace.zones z +WHERE EXISTS ( + SELECT 1 FROM my_namespace.firewall_events f + WHERE f.zone_id = z.zone_id AND f.action = 'block' +) +ORDER BY z.domain +LIMIT 20 +``` + +```sql +-- Find zones with NO firewall events (anti-join) +SELECT z.domain, z.plan +FROM my_namespace.zones z +WHERE NOT EXISTS ( + SELECT 1 FROM my_namespace.firewall_events f + WHERE f.zone_id = z.zone_id +) +ORDER BY z.domain +LIMIT 20 +``` + +### Scalar subqueries + +A subquery that returns a single value can be used in `SELECT`, `WHERE`, or `HAVING`: + +```sql +-- In SELECT (constant value per row) +SELECT z.domain, z.plan, + (SELECT COUNT(*) FROM my_namespace.zones) AS total_zones +FROM my_namespace.zones z +WHERE z.plan = 'enterprise' +LIMIT 10 +``` + +```sql +-- In WHERE (comparison) +SELECT z.domain, z.plan, z.requests_30d +FROM my_namespace.zones z +WHERE z.requests_30d > ( + SELECT AVG(requests_30d) FROM my_namespace.zones +) +ORDER BY z.requests_30d DESC +LIMIT 20 +``` --- diff --git a/src/content/docs/r2-sql/troubleshooting.mdx b/src/content/docs/r2-sql/troubleshooting.mdx index d09211d48a655c6..2cccfdc0c9228a3 100644 --- a/src/content/docs/r2-sql/troubleshooting.mdx +++ b/src/content/docs/r2-sql/troubleshooting.mdx @@ -38,45 +38,90 @@ WHERE status = 200 AND timestamp BETWEEN '2025-09-24T01:00:00Z' AND '2025-09-25T ## FROM clause issues -### Multiple tables +### Join performance issues -
- **Error**: `unsupported feature: JOIN operations are not supported` -
+**Symptom**: Query returns 502 Bad Gateway or times out. -**Problem**: R2 SQL queries reference exactly one table. JOINs and multiple tables are not supported. +**Problem**: Multi-way joins across large tables can exceed resource limits, especially with `COUNT(DISTINCT)` or other memory-intensive aggregations. ```sql --- Invalid - Multiple tables not supported -SELECT a.*, b.* FROM my_namespace.table1 a, my_namespace.table2 b WHERE a.id = b.id -SELECT * FROM my_namespace.events JOIN my_namespace.users ON events.user_id = users.id - --- Valid - Separate queries -SELECT * FROM my_namespace.table1 WHERE id IN ('id1', 'id2', 'id3') LIMIT 100 --- Then query the second table separately in your application -SELECT * FROM my_namespace.table2 WHERE id IN ('id1', 'id2', 'id3') LIMIT 100 +-- May timeout: cross-joining two large fact tables +SELECT COUNT(DISTINCT h.ray_id), COUNT(DISTINCT f.event_id) +FROM my_namespace.http_requests h +INNER JOIN my_namespace.firewall_events f ON h.zone_id = f.zone_id ``` **Solution**: -- Denormalize your data by including necessary fields in a single table. -- Perform multiple queries and join data in your application. +- Add `WHERE` filters to reduce intermediate result sizes. +- Join through dimension tables instead of directly joining fact tables. +- Use `approx_distinct()` instead of `COUNT(DISTINCT)` for approximate counts. +- Break complex multi-way joins into smaller queries using CTEs or sequential queries. + +```sql +-- Better: filter both sides and use approx_distinct +SELECT z.plan, + approx_distinct(h.ray_id) AS unique_requests +FROM my_namespace.zones z +INNER JOIN my_namespace.http_requests h ON z.zone_id = h.zone_id +WHERE z.plan = 'enterprise' + AND h.status_code >= 400 +GROUP BY z.plan +``` -### Subqueries +### `NOT IN` on nullable columns -
**Error**: `unsupported feature: subqueries`
+**Symptom**: `NOT IN` subquery returns unexpected results or errors. -**Problem**: Subqueries in `FROM`, `WHERE`, and scalar positions are not supported. +**Problem**: `NOT IN` subqueries are not supported when the subquery column can contain `NULL` values. ```sql --- Invalid - Subqueries not supported -SELECT * FROM (SELECT user_id FROM my_namespace.events WHERE status = 200) +-- Fails: nullable_col may contain NULLs +SELECT zone_id +FROM my_namespace.http_requests +WHERE zone_id NOT IN ( + SELECT nullable_col FROM my_namespace.other_table +) +LIMIT 20 +``` --- Valid - Use direct query with appropriate filters -SELECT user_id FROM my_namespace.events WHERE status = 200 LIMIT 100 +**Solution**: Use `NOT EXISTS` with a correlated subquery instead. + +```sql +-- Works: NOT EXISTS handles NULLs correctly +SELECT h.zone_id +FROM my_namespace.http_requests h +WHERE NOT EXISTS ( + SELECT 1 FROM my_namespace.other_table o + WHERE o.nullable_col = h.zone_id +) +LIMIT 20 ``` -**Solution**: Flatten your query logic or use multiple sequential queries. +### Correlated subquery performance + +**Symptom**: `EXISTS` or `NOT EXISTS` subquery runs slowly. + +**Problem**: Correlated subqueries with complex conditions can be slow because the inner query is evaluated for each row of the outer query. + +```sql +-- Slower: multiple filter conditions in correlated subquery +SELECT z.domain +FROM my_namespace.zones z +WHERE EXISTS ( + SELECT 1 FROM my_namespace.firewall_events f + WHERE f.zone_id = z.zone_id + AND f.risk_score > 0.9 + AND f.colo = 'SJC' +) +LIMIT 20 +``` + +**Solution**: + +- Simplify correlated conditions where possible. +- Consider rewriting as a `JOIN` with `GROUP BY` instead of `EXISTS`. +- Use an `IN` subquery with pre-aggregated results instead of `EXISTS`. ---