diff --git a/CONTRIBUTING-DATA-SOURCE.AGENT.md b/CONTRIBUTING-DATA-SOURCE.AGENT.md new file mode 100644 index 000000000..6654e8950 --- /dev/null +++ b/CONTRIBUTING-DATA-SOURCE.AGENT.md @@ -0,0 +1,742 @@ +# Adding a data source or check type — agent reference + +Dense reference for coding agents implementing a new Soda Core data source or +check type. Pair with [`CONTRIBUTING-DATA-SOURCE.md`](CONTRIBUTING-DATA-SOURCE.md) +for context. This document is intentionally written so a coding agent can +drive the implementation end-to-end with a human reviewing — the human guide +gives the *why*, this one gives the *what* (every override, every signature, +every failure mode). + +Every symbol mentioned here is real and resolvable in this repository as of +writing. + +> **Source of truth ordering:** when this guide and the actual code disagree, +> the code wins. Re-read the base classes; this document may have drifted. + +## Contents + +- §0 Operating contract +- §1 Base classes (read these before coding) +- §2 Adding a data source — step by step +- §3 Adding a check type — step by step +- §4 SQL-AST builders cheat sheet +- §5 Plugin discovery — how it actually works +- §6 Verification +- §7 Common failure modes +- §8 What you do **not** need to do +- §9 Pointers to canonical examples + +## Runtime flow (orientation) + +``` +$ soda contract verify -ds ds.yml -c contract.yml + │ + ▼ + cli.py ──► load_plugins() + │ + │ iterates entry_points() under soda.plugins.* + │ each ep.load() imports the target class + │ importing a *DataSourceImpl class triggers + │ DataSourceImpl.__init_subclass__(model_class=...) + │ which registers it keyed by model_class.get_class_type() + ▼ + parse YAML (data source + contract) + │ + ▼ + DataSourceImpl.from_yaml_source(yaml) + │ reads `type:` from YAML, looks up registered impl class, + │ instantiates {Name}DataSourceImpl(data_source_model=...) + ▼ + {Name}DataSourceImpl.__init__ + ├── _create_sql_dialect() → {Name}SqlDialect() + └── _create_data_source_connection() → {Name}DataSourceConnection(...) + │ + ▼ + ContractVerification.execute() + │ sql_ast builders + dialect.build_*_sql(...) → SQL string + │ connection.execute_query(sql) → rows + cursor.description + │ native types canonicalised via dialect's type-name dicts + ▼ + CheckImpl.evaluate(measurement_values) → CheckResult +``` + +--- + +## 0. Operating contract + +You will: + +1. Read the base classes listed in §1 before writing anything. +2. Copy an existing implementation (DuckDB = minimal, Postgres = full featured) + and rename, **then** adapt. Do not generate from scratch. +3. Use existing implementations as templates. Postgres = full featured, + DuckDB = minimal, Snowflake/Databricks = warehouse semantics. +4. Run the verification commands in §6 before claiming completion. Do not + declare a task done unless tests actually pass. + +If you are blocked on a single override (e.g. how this dialect spells regex), +**read another dialect implementation** before guessing. There are a dozen +existing dialect packages in this monorepo to compare against. + +--- + +## 1. Base classes (read these before coding) + +| Class | File | What you must know | +|---|---|---| +| `DataSourceImpl` | `soda-core/src/soda_core/common/data_source_impl.py` | Two abstract methods. Auto-registers via `__init_subclass__(model_class=...)`. | +| `DataSourceConnection` | `soda-core/src/soda_core/common/data_source_connection.py` | Abstract `_create_connection()`. Default `execute_query` uses cursor.fetchall + cursor.description. | +| `SqlDialect` | `soda-core/src/soda_core/common/sql_dialect.py` | ~150 hooks. Most have sane defaults. The required overrides are the type-name dicts and (if the SQL is non-standard) regex/cast/sample. | +| `DataSourceBase` | `soda-core/src/soda_core/model/data_source/data_source.py` | Pydantic `BaseModel` + `ConfigDict(frozen=True, extra="forbid")`. `type: Literal["{name}"]` is what the plugin loader matches against. | +| `DataSourceConnectionProperties` | `soda-core/src/soda_core/model/data_source/data_source_connection_properties.py` | Base for your YAML connection schema. | +| `CheckParser`, `CheckImpl`, `CheckYamlParser`, `CheckYaml` | `contracts/impl/contract_yaml.py` + `contract_verification_impl.py` | Two register-by-classmethod pairs. | +| `MetricImpl`, `AggregationMetricImpl`, `DerivedMetricImpl` | `contract_verification_impl.py` (grep `^class MetricImpl` to locate) | `AggregationMetricImpl.sql_expression()` returns a `SqlExpression` from `sql_ast`. | +| `Plugin` (Protocol) | `soda-core/src/soda_core/plugins.py` | `setup_cli(parser)` + `load()` classmethods. Discovered via `entry_points()` under `soda.plugins.*`. | +| `DataSourceTestHelper` | `soda-tests/src/helpers/data_source_test_helper.py` | Override `_create_data_source_yaml_str` (required) + database/schema name methods (often). | + +--- + +## 2. Adding a data source — step by step + +### 2.0. Research the engine first + +Each question below maps to a specific override. Resolve them before +scaffolding; unresolved, they surface as failures deep in the +cross-data-source suite. The closest reference adapter (§9) answers most of +them. + +| Question | Drives | +|---|---| +| Identifier quoting char? (`"` standard; backtick for Databricks) | `quote_default()` / `quote_for_ddl()` | +| Does it upper/lower/preserve **unquoted** identifiers? | `default_casify()`, `metadata_casify()` | +| Pagination — `LIMIT`/`OFFSET`, `FETCH FIRST n ROWS ONLY`, `TOP n`? Any required clause order? | `_build_limit_sql()`, `_build_offset_sql()` | +| Does table/column metadata live in `INFORMATION_SCHEMA` or a vendor catalog? | `build_columns_metadata_query_str()` / `build_all_columns_metadata_query_str()` (or the per-field accessors they call) | +| Does `SELECT 1` work without a `FROM`? | `build_select_sql()` | +| Date/datetime/time literal format? | `literal_date()`, `literal_datetime()`, `literal_time()` | +| Native `BOOLEAN`? If not, substitute (`BIT`/`SMALLINT`/`TINYINT`)? | type-name dicts, `literal_boolean()` | +| Regex operator/function and flag syntax? | `_build_regex_like_sql()` | +| `TABLESAMPLE` (or equivalent) support, and which `SamplerType`s? | `supports_sampler()`, `_build_sample_sql()` | +| RANDOM/RAND function name — per-row or per-statement? | `_build_random_sql()` | +| `DROP TABLE`/`DROP VIEW IF EXISTS`? `CASCADE`? | drop-DDL overrides | +| `CREATE SCHEMA IF NOT EXISTS`? | `create_schema_if_not_exists_sql()` | +| Namespace depth — 2-part (`schema.table`) or 3-part (`db.schema.table`)? | `get_database_prefix_index()`, `get_schema_prefix_index()` | +| Which PEP 249 driver, and is `cursor.description` named or plain tuples? | `_create_connection()`, `_execute_query_get_result_row_column_name()` | +| Does `cursor.rowcount` return a real count after SELECT/UPDATE? | `execute_update()` return value | +| Does the engine treat `\` as an escape char inside string literals? | `escape_string()` | +| Native `TIMESTAMP WITH TIME ZONE`? Stored naive or UTC? | `literal_datetime()` / tz handling | +| Bare `DECIMAL` / `VARCHAR` / `NUMERIC` defaults — surprisingly narrow? | type-name dicts + default precision/length | +| Separate `datetime_precision` catalog column? | `column_data_type_datetime_precision()` | +| Default port? | connection model | + +### 2.1. Package skeleton + +Copy `soda-duckdb/` (smallest reference) and rename `duckdb` → `{name}` in +file/dir names and module paths. Final structure: + +``` +soda-{name}/ +├── pyproject.toml +├── docker-compose.yml # only if applicable +├── src/soda_{name}/ +│ ├── __init__.py +│ ├── common/data_sources/ +│ │ ├── __init__.py +│ │ ├── {name}_data_source.py +│ │ └── {name}_data_source_connection.py +│ └── test_helpers/ +│ ├── __init__.py +│ └── {name}_data_source_test_helper.py +└── tests/ + ├── conftest.py + └── data_sources/ + └── test_{name}.py +``` + +### 2.2. Connection model (pydantic) + +In `{name}_data_source_connection.py`: + +```python +class {Name}ConnectionProperties(DataSourceConnectionProperties, ABC): + field_mapping: ClassVar[Dict[str, str]] = { + # canonical name in YAML -> driver kwarg name + # only needed if names differ; example: postgres maps "database" -> "dbname" + } + +class {Name}ConnectionPropertiesBase({Name}ConnectionProperties, ABC): + host: str = Field(..., description="...") + port: int = Field(..., ge=1, le=65535) + # ... your fields with pydantic Field(..., description=...) + +# One concrete subclass per auth method — most real engines have 2-3: +class {Name}ConnectionString({Name}ConnectionProperties): + connection_string: SecretStr = Field(..., description="...") + +class {Name}ConnectionPassword({Name}ConnectionPropertiesBase): + password: SecretStr = Field(..., description="...") + +class {Name}ConnectionPasswordFile({Name}ConnectionPropertiesBase): + password_file: str = Field(..., description="path to file holding the password") + +class {Name}DataSource(DataSourceBase, ABC): + type: Literal["{name}"] = Field("{name}") + connection_properties: {Name}ConnectionProperties = Field( + ..., alias="connection", description="..." + ) + + @field_validator("connection_properties", mode="before") + def infer_connection_type(cls, value): + # discriminate between auth variants based on present keys + if "connection_string" in value: + return {Name}ConnectionString(**value) + if "password_file" in value: + return {Name}ConnectionPasswordFile(**value) + if "password" in value: + return {Name}ConnectionPassword(**value) + raise ValueError("Unknown connection structure") +``` + +See `soda-postgres/src/soda_postgres/common/data_sources/postgres_data_source_connection.py` +for the canonical three-variant pattern. + +The `type` literal is the load-bearing match — the user's YAML `type:` field +is looked up against the registry keyed by this `Literal[...]` default. The +entry-point group's suffix should match by convention, but doesn't have to +(see §5). + +### 2.3. `DataSourceConnection` subclass + +```python +class {Name}DataSourceConnection(DataSourceConnection): + def __init__(self, name: str, connection_properties: DataSourceConnectionProperties): + super().__init__(name, connection_properties) + + def _create_connection(self, config: {Name}ConnectionProperties): + connection_kwargs = config.to_connection_kwargs() + return {your_dbapi}.connect(**connection_kwargs) +``` + +Override `execute_query` / `execute_update` only if the driver requires +specific error handling (Postgres rolls back the transaction on failure; +see `PostgresDataSourceConnection`). + +If your driver's cursor description rows don't have a `.name` attribute +(e.g. they're plain tuples), override `_execute_query_get_result_row_column_name`. +DuckDB demonstrates wrapping the cursor in a namedtuple to get DB-API +compatibility. + +### 2.4. `SqlDialect` subclass + +```python +class {Name}SqlDialect(SqlDialect, sqlglot_dialect="{sqlglot_name}"): + SODA_DATA_TYPE_SYNONYMS = ( + # tuples of SodaDataTypeName members that should be considered equivalent + # for schema-check purposes in this dialect + ) +``` + +The `sqlglot_dialect=` kwarg is **required** by `__init_subclass__`. Pick the +matching dialect from sqlglot, or `"postgres"` as a generic fallback. + +`SodaDataTypeName` is the canonical Soda type-name enum defined in +`soda_core/common/metadata_types.py`. `python -c "from soda_core.common.metadata_types +import SodaDataTypeName; print([m.name for m in SodaDataTypeName])"` enumerates +the full set — your two required type-name dicts must round-trip every member +your engine can produce. + +Work through the overrides in dependency order: identifier quoting + casing +first (everything else builds on it), then pagination, then the type-name +dicts, then literal formatting, then metadata queries, then regex / sampling / +string ops. Add a unit test after each group rather than at the very end. + +**Required overrides:** + +| Method | Returns | Purpose | +|---|---|---| +| `get_data_source_data_type_name_by_soda_data_type_names()` | `dict[SodaDataTypeName, str]` | For DDL: how to spell each Soda type in this dialect | +| `get_soda_data_type_name_by_data_source_data_type_names()` | `dict[str, SodaDataTypeName]` | For schema checks: how to canonicalize what the DB returns | +| `get_database_prefix_index()` | `int \| None` | Index into `dataset_prefix` for database name (`None` if no concept) | +| `get_schema_prefix_index()` | `int \| None` | Index into `dataset_prefix` for schema name (`None` if flat namespace) | + +Postgres uses `[database, schema]` (database=0, schema=1). DuckDB returns +`None` for database, `0` for schema. + +**Frequently overridden:** + +- `_build_regex_like_sql(matches: REGEX_LIKE) -> str` — every dialect spells + this differently. Postgres: `expr ~ 'pattern'`. DuckDB: + `REGEXP_MATCHES(expr, 'pattern')`. Snowflake: `REGEXP_LIKE(expr, 'pattern')`. +- `_build_cast_sql(cast: CAST) -> str` — Postgres uses `expr::type`, others + use `CAST(expr AS type)` (the default). +- `create_schema_if_not_exists_sql(prefixes, add_semicolon)` — for engines + with non-standard schema DDL. +- `escape_string(value)` / `escape_regex(value)` — quoting/escaping rules. + Engines that treat `\` as an escape character inside string literals + (e.g. Snowflake) must double backslashes **and** single quotes, or an + `INSERT ... VALUES` containing `'C:\path'` silently corrupts. Postgres and + DuckDB don't, so the default (double single-quote only) is correct there. +- `quote_default(identifier)` / `quote_for_ddl(identifier)` — identifier + quoting (default: double-quote). +- `default_casify(identifier)` / `metadata_casify(identifier)` — case + normalisation. Postgres lowercases unquoted identifiers; Snowflake uppercases. +- `_get_data_type_name_synonyms()` — returns `list[list[str]]` of equivalent + native type names (e.g. `["varchar", "character varying"]`). +- `supports_data_type_character_maximum_length()`, `supports_data_type_numeric_precision()`, + `supports_data_type_numeric_scale()`, `supports_data_type_datetime_precision()` — + default `True`; override to `False` if the dialect doesn't expose that info. +- `default_numeric_precision()` / `default_numeric_scale()` — returned when + the engine doesn't report explicit precision/scale. +- `supports_materialized_views()` — default `False`. +- `supports_sampler(SamplerType)` / `_build_sample_sql(...)` — for + `TABLESAMPLE`-style sampling. + +**Metadata queries** — defaults use `INFORMATION_SCHEMA`. If your engine +doesn't have one, override `build_columns_metadata_query_str(...)` and +`build_all_columns_metadata_query_str(...)`. Postgres builds queries against +`pg_catalog.pg_class` etc. via the `sql_ast` builder DSL — see lines 248-431 +of `postgres_data_source.py` for a full example. + +If the engine doesn't support bulk metadata queries (e.g. only `DESCRIBE TABLE` +per table), override `bulk_columns_metadata_available` on your `DataSourceImpl` +to return `False`. + +### 2.5. `DataSourceImpl` subclass + +```python +class {Name}DataSourceImpl(DataSourceImpl, model_class={Name}DataSource): + def _create_sql_dialect(self) -> SqlDialect: + return {Name}SqlDialect() + + def _create_data_source_connection(self) -> DataSourceConnection: + return {Name}DataSourceConnection( + name=self.data_source_model.name, + connection_properties=self.data_source_model.connection_properties, + ) +``` + +The `model_class=` kwarg in the class declaration triggers +`__init_subclass__` registration in `DataSourceImpl`. Without it, your data +source will not be discoverable. + +If your engine needs custom metadata queries, also override +`create_metadata_tables_query()` to return a subclass of `MetadataTablesQuery` +(see `PostgresMetadataTablesQuery`). + +### 2.6. `pyproject.toml` + +```toml +[project] +name = "soda-{name}" +version = "" +description = "Soda {Name} V4" +requires-python = ">=3.10" +dependencies = [ + "soda-core==", + "{your-dbapi-driver}>={X.Y}", +] + +[project.entry-points."soda.plugins.data_source.{name}"] +{Name}DataSourceImpl = "soda_{name}.common.data_sources.{name}_data_source:{Name}DataSourceImpl" + +[tool.uv.sources] +soda-core = { workspace = true } + +[build-system] +requires = ["setuptools>=45", "wheel"] +build-backend = "setuptools.build_meta" + +[tool.setuptools] +package-dir = {"" = "src"} +``` + +The entry-point group `soda.plugins.data_source.{name}` is what makes your +package discoverable. By convention the `{name}` suffix matches the +`Literal[...]` `type` field on your `DataSourceBase` subclass, but the loader +does **not** read the entry-point name to match — `ep.load()` is just an import +trigger that fires `DataSourceImpl.__init_subclass__(model_class=...)`, which +registers the impl class keyed by `model_class.get_class_type()` (i.e. the +`Literal[...]` default). The actual lookup at YAML parse time uses the `type:` +field in user YAML against that registry. See `DataSourceImpl.from_yaml_source` +and `__init_subclass__` in `data_source_impl.py`. + +Keep the entry-point suffix, the `Literal[...]` value, and the `type:` field +in YAML aligned for sanity — but the load-bearing match is YAML-`type:` ↔ +`Literal[...]`. + +### 2.7. Workspace registration + +Add to the repo root `pyproject.toml`: + +- `[tool.uv.workspace] members` — add `"soda-{name}"` +- `[tool.uv.sources]` — add `soda-{name} = { workspace = true }` + +Add to the repo root `pytest.ini`: + +- `testpaths` — add `soda-{name}/tests` +- `pythonpath` — add `soda-{name}/tests` + +### 2.8. Test helper + +In `src/soda_{name}/test_helpers/{name}_data_source_test_helper.py`: + +```python +from helpers.data_source_test_helper import DataSourceTestHelper + +class {Name}DataSourceTestHelper(DataSourceTestHelper): + def _create_database_name(self) -> Optional[str]: + return os.getenv("{NAME}_DATABASE", "soda_test") + + def _create_data_source_yaml_str(self) -> str: + return f""" + type: {name} + name: {self.name} + connection: + host: {os.getenv("{NAME}_HOST", "localhost")} + user: {os.getenv("{NAME}_USERNAME", "soda_test")} + password: {os.getenv("{NAME}_PASSWORD")} + port: {int(os.getenv("{NAME}_PORT", "..."))} + database: {self.dataset_prefix[0]} + """ +``` + +Then register in `soda-tests/src/helpers/data_source_test_helper.py`'s +`DataSourceTestHelper.create()` factory — add an `elif` branch for your +`{name}`. + +If your data source has no schema concept, override `_create_schema_name()` +to return `None`. If it has unusual case rules, override `_adjust_schema_name()`. + +### 2.9. Conftest + +In `soda-{name}/tests/conftest.py`: + +```python +from helpers.test_fixtures import * # noqa: F401 +from soda_core.common.logging_configuration import configure_logging + +def pytest_sessionstart(session) -> None: + configure_logging(verbose=True) +``` + +This wires in the standard fixtures (`data_source_test_helper`, etc.). + +--- + +## 3. Adding a check type — step by step + +### 3.1. Two `*_yaml.py` classes + +`{type}_check_yaml.py`: + +```python +class {Type}CheckYamlParser(CheckYamlParser): + def get_check_type_names(self) -> list[str]: + return ["{type}"] + + def parse_check_yaml( + self, check_type_name: str, check_yaml_object: YamlObject, column_yaml: Optional[ColumnYaml] + ) -> Optional[CheckYaml]: + return {Type}CheckYaml(type_name=check_type_name, check_yaml_object=check_yaml_object) + + +class {Type}CheckYaml(ThresholdCheckYaml): # or CheckYaml / MissingAncValidityCheckYaml + def __init__(self, type_name: str, check_yaml_object: YamlObject): + super().__init__(type_name=type_name, check_yaml_object=check_yaml_object) + # read additional fields: + # self.foo = check_yaml_object.read_string("foo") + # self.bar = check_yaml_object.read_number_opt("bar") +``` + +Pick the right base: + +- `CheckYaml` — bare check, no threshold (e.g. schema check). **Do not** use + if you have a `threshold:` block — pick `ThresholdCheckYaml` instead. +- `ThresholdCheckYaml` — has a `threshold:` block (most metric-style checks). + **Do not** use for checks that have no user-tunable threshold; you'll get + spurious validation errors. +- `MissingAncValidityCheckYaml` — adds `missing_*` and `valid_*` fields. + **Do not** use unless you actually parse those fields; otherwise pick + `ThresholdCheckYaml`. (Yes — `Anc` is a typo carried forward from current + code. The corresponding impl base is correctly spelled + `MissingAndValidityCheckImpl`. The names will be unified in a separate PR.) + +Use `check_yaml_object.read_string` / `read_string_opt` / `read_number_opt` / +`read_bool_opt` / `read_object_opt` for typed parsing with location-aware +errors. `read_*_opt` returns `None` if the key is absent; +non-`opt` variants log an error and return `None`. + +### 3.2. Two `*_check.py` classes + +```python +class {Type}CheckParser(CheckParser): + def get_check_type_names(self) -> list[str]: + return ["{type}"] + + def parse_check( + self, contract_impl: ContractImpl, column_impl: Optional[ColumnImpl], check_yaml: {Type}CheckYaml, + ) -> Optional[CheckImpl]: + return {Type}CheckImpl(contract_impl, column_impl, check_yaml) + + +class {Type}CheckImpl(CheckImpl): + def __init__(self, contract_impl, column_impl, check_yaml): + super().__init__(contract_impl=contract_impl, column_impl=column_impl, check_yaml=check_yaml) + # threshold: only if you extended ThresholdCheckYaml + self.threshold = ThresholdImpl.create( + threshold_yaml=check_yaml.threshold, + default_threshold=ThresholdImpl(type=ThresholdType.SINGLE_COMPARATOR, must_be_greater_than=0), + ) + + def setup_metrics(self, contract_impl, column_impl, check_yaml) -> None: + # called once per check; resolve metrics that should be measured + self.{your}_metric = self._resolve_metric( + {Your}MetricImpl(contract_impl=contract_impl, check_impl=self), + ) + + def evaluate(self, measurement_values: MeasurementValues) -> CheckResult: + value = measurement_values.get_value(self.{your}_metric) + outcome = self.evaluate_threshold(value) # or your custom logic + return CheckResult( + check=self._build_check_info(), + outcome=outcome, + threshold_value=value, + diagnostic_metric_values={"...": value}, + ) +``` + +### 3.3. Metric + +For SQL aggregation (counted in the same query as the rest of the contract): + +```python +class {Your}MetricImpl(AggregationMetricImpl): + def __init__(self, contract_impl, check_impl=None, ...): + super().__init__( + contract_impl=contract_impl, + metric_type="{your_metric}", + check_filter=check_impl.check_yaml.filter if check_impl else None, + missing_and_validity=None, + ) + + def sql_expression(self) -> SqlExpression: + # use the sql_ast builders: SUM, COUNT, CASE_WHEN, COLUMN, LITERAL, ... + return COUNT(STAR()) + + def convert_db_value(self, value: any) -> any: + return int(value) if value is not None else 0 +``` + +For checks that need their own query (e.g. freshness picks `MAX(updated_at)` +in a separate query and timestamps it against `NOW()`), subclass `MetricImpl` +directly and emit the query via the contract's query builder. See +`freshness_check.py` for the pattern. + +### 3.4. Register + +Either: + +**(a)** if you ship inside an existing package, add to that package's plugin's +`load()` classmethod: + +```python +CheckYaml.register({Type}CheckYamlParser()) +CheckImpl.register({Type}CheckParser()) +``` + +See `CoreCheckTypesPlugin.register_check_types()` for the canonical pattern. + +**(b)** if you ship in a new package, declare a `Plugin`: + +```python +from soda_core.plugins import Plugin + +class {Pkg}CheckTypesPlugin(Plugin): + @classmethod + def setup_cli(cls, root_parser): pass + + @classmethod + def load(cls): + from soda_core.contracts.impl.contract_yaml import CheckYaml + from soda_core.contracts.impl.contract_verification_impl import CheckImpl + CheckYaml.register({Type}CheckYamlParser()) + CheckImpl.register({Type}CheckParser()) + return cls() +``` + +Entry point: + +```toml +[project.entry-points."soda.plugins.check_types.{pkg}"] +{Pkg}CheckTypesPlugin = "soda_{pkg}.check_types:{Pkg}CheckTypesPlugin" +``` + +The plugin loader (`soda_core.plugins.load_plugins`) discovers all entry +points under `soda.plugins.*`, so any group prefix works; the `check_types.` +segment is convention. + +--- + +## 4. SQL-AST builders cheat sheet + +Use these in `sql_expression()` and SQL-building dialect methods. +All importable from `soda_core.common.sql_ast`: + +| Category | Builders | +|---|---| +| Selection | `SELECT`, `FROM`, `WHERE`, `AND`, `OR`, `NOT`, `JOIN`, `LEFT_INNER_JOIN`, `ORDER_BY_ASC`, `LIMIT`, `OFFSET` | +| Columns / values | `COLUMN`, `STAR`, `LITERAL`, `RAW_SQL`, `CAST` | +| Aggregation | `COUNT`, `SUM`, `AVERAGE`, `MIN`, `MAX`, `DISTINCT` | +| Conditions | `EQ`, `GT`, `LT`, `GTE`, `LTE`, `IN`, `IS_NULL`, `IS_NOT_NULL`, `LIKE`, `NOT_LIKE`, `REGEX_LIKE`, `EXISTS` | +| Functions | `LOWER`, `LENGTH`, `COALESCE`, `CASE_WHEN`, `CONCAT`, `CONCAT_WS` | +| DDL/DML | `CREATE_TABLE`, `CREATE_TABLE_COLUMN`, `CREATE_VIEW`, `CREATE_MATERIALIZED_VIEW`, `DROP_TABLE`, `INSERT_INTO`, `VALUES`, `VALUES_ROW` | +| Strings | `SqlExpressionStr` (raw string fallback when no builder fits) | + +Render via the dialect's `build_select_sql(elements)` / +`build_expression_sql(expr)` / `build_create_table_sql(...)`. Don't hand-write +SQL strings unless absolutely necessary — they bypass dialect rules. + +--- + +## 5. Plugin discovery — how it actually works + +`soda_core.plugins.load_plugins()` is called from +`cli.py`'s entry point. It calls `entry_points()` and iterates every group +starting with `soda.plugins.`. For each entry, it `ep.load()`s the class. If +the class implements the `Plugin` Protocol (has `setup_cli` + `load` +classmethods), it gets called. + +For data sources, the implementation class itself is **not** a `Plugin` — the +`isinstance(plugin_cls, Plugin)` check fails and `setup_cli`/`load` are +skipped. Registration happens via `DataSourceImpl.__init_subclass__` at +**import time**, triggered by `ep.load()` evaluating the entry-point string. +That's why your entry-point target must be the `*DataSourceImpl` class +itself, not a wrapper. + +For check types, you **need** the `Plugin` class because the registration +calls (`CheckYaml.register(...)`, `CheckImpl.register(...)`) are imperative, +not metaclass-driven. Without a `load()` classmethod, the parsers never +register and your check type silently disappears. + +--- + +## 6. Verification + +Before claiming the work is done, all of these must pass: + +```bash +# 1. Workspace install resolves +uv sync --all-packages --group dev + +# 2. Pre-commit clean +uv run pre-commit run --all-files + +# 3. Plugin discovery sees your data source +uv run python -c " +from soda_core.plugins import load_plugins +from importlib.metadata import entry_points +load_plugins() +print([ep.name for ep in entry_points(group='soda.plugins.data_source.{name}')]) +" +# expected output: ['{Name}DataSourceImpl'] + +# 4. YAML parse + connect smoke test +TEST_DATASOURCE={name} uv run pytest soda-{name}/tests/data_sources/test_{name}.py -v + +# 5. Cross-data-source suite (this is the bar) +TEST_DATASOURCE={name} uv run pytest soda-tests/tests -x + +# 6. No leftover NotImplementedError +grep -rn NotImplementedError soda-{name}/src +``` + +Tests must pass, not be skipped. If you skip a test, document why in the test +itself with `pytest.skip("reason: ...")` and surface the list in your PR. + +For check types, the equivalent verification is: + +```bash +uv run python -c " +from soda_core.plugins import load_plugins; load_plugins() +from soda_core.contracts.impl.contract_verification_impl import CheckImpl +print('{type}' in CheckImpl.get_check_type_names()) +" +# expected: True +``` + +Plus pytest covering the new check type running against at least one data +source. + +--- + +## 7. Common failure modes + +| Symptom | Likely cause | +|---|---| +| `Data source type 'X' not available. Make sure to install the required plugin` | Nothing registered the type. Causes, in rough order: (1) the package isn't installed at all (no entry point to load); (2) the entry point exists but `ep.load()` raised silently during import — run `python -c "import soda_{name}.common.data_sources.{name}_data_source"` to surface the real error; (3) `model_class=` was omitted from the `DataSourceImpl` subclass declaration, so `__init_subclass__` never fired and the registry stayed empty; (4) the `Literal[...]` default on `DataSourceBase` doesn't match the `type:` value in the user's YAML (typo on either side). The entry-point name is never read — don't chase it. | +| Schema check passes locally but fails in cross-DS suite | Type-name mappings are incomplete. Run a `SELECT * FROM information_schema.columns` against your test schema and ensure every type the suite creates round-trips through your two type dicts. | +| `Unknown check type 'X'` after registering | The package's `Plugin.load()` classmethod isn't being called — ensure your entry-point group starts with `soda.plugins.` and the class implements both `setup_cli` and `load` classmethods. | +| Regex / LIKE / sample tests fail with SQL syntax error | Override `_build_regex_like_sql`, `_build_like_sql`, or `_build_sample_sql`. Defaults are Postgres-flavoured. | +| Identifier case mismatches (e.g. `MY_TABLE` vs `my_table`) | Override `default_casify` and `metadata_casify`. Snowflake uppercases by default; most others lowercase. | +| Failing-rows queries return nothing | Override `escape_string` if your engine uses non-standard string escaping. | +| `Cannot determine schema name from prefixes` | `get_schema_prefix_index()` is wrong, or your `_create_dataset_prefix()` in the test helper returns the wrong list shape for this dialect. | +| Pre-commit fails with import errors | Missing `__init__.py` in a new directory, or import paths don't match the package structure. | + +### 7.1. Base-class assumptions that don't always hold + +Assumptions the base classes make that a new engine may violate. Verify each +against the target driver and dialect: + +1. **`cursor.rowcount` may be `-1`, non-int, or expensive** after a SELECT. + Some drivers report `-1`; do not depend on it. +2. **Rows may not be plain tuples.** Some drivers return non-tuple rows that + must be normalised on read. +3. **`INFORMATION_SCHEMA` is not universal.** Hive-metastore engines + (Databricks) and vendor catalogs need custom + `build_columns_metadata_query_str()` / `build_all_columns_metadata_query_str()`. +4. **`RANDOM()` semantics vary.** Some engines evaluate it once per statement + rather than per row, and need a per-row construct instead. +5. **A tz-aware `TIMESTAMP` may be stored tz-naive.** Convert to UTC in the + dialect before insert rather than patching individual tests. +6. **Precision may be silently normalised** by the engine (e.g. + `TIMESTAMP(3)` → `TIMESTAMP(6)`). Assert against the *actual* fetched + metadata, not the requested DDL. +7. **Identifier length limits vary** — long generated test names can overflow + engines with short identifier caps. +8. **CTE-wrapping arbitrary user SQL is not always safe.** Some engines reject + `ORDER BY` inside a CTE or forbid `WITH` inside parenthesised subqueries. + Fall back to row-by-row streaming. +9. **Metadata discovery must exclude internal `__soda_*` tables** — the + framework creates them (failed-rows storage, etc.) and they must not leak + into schema or discovery results. +10. **A declared capability may silently degrade.** `supports_materialized_views()` + returning `True` does not guarantee real MV semantics — verify behaviour + rather than trusting the flag. + +--- + +## 8. What you do **not** need to do + +- Do not edit `soda-core/` itself unless you found a real upstream bug. +- Do not register your data source in any central registry (entry points are + the only registration). +- Do not write SQL by string concatenation — use `sql_ast` builders. +- Do not subclass `DataSourceTestHelper` for things that aren't actually + data-source-specific. The base class handles 95% of test fixture setup. +- Do not duplicate metric logic that already exists in + `contracts/impl/check_types/` — extend or reuse. + +--- + +## 9. Pointers to canonical examples + +| You need to look at... | Read | +|---|---| +| Full-featured DS with custom metadata SQL | `soda-postgres/src/soda_postgres/common/data_sources/postgres_data_source.py` | +| Minimal DS with no schema concept | `soda-duckdb/src/soda_duckdb/common/data_sources/duckdb_data_source.py` | +| DS with multi-warehouse semantics | `soda-snowflake/`, `soda-databricks/` | +| Aggregation check (simplest) | `soda-core/src/soda_core/contracts/impl/check_types/row_count_check.py` | +| Check with own query | `soda-core/src/soda_core/contracts/impl/check_types/freshness_check.py` | +| Check with missing/validity semantics | `soda-core/src/soda_core/contracts/impl/check_types/missing_check.py` | +| Plugin registration of check types | `soda-core/src/soda_core/contracts/impl/check_types/check_types.py` | +| Test helper with credentials | `soda-postgres/src/soda_postgres/test_helpers/postgres_data_source_test_helper.py` | +| Cross-data-source test factory | `soda-tests/src/helpers/data_source_test_helper.py` line 65 onward | diff --git a/CONTRIBUTING-DATA-SOURCE.md b/CONTRIBUTING-DATA-SOURCE.md new file mode 100644 index 000000000..6366861ff --- /dev/null +++ b/CONTRIBUTING-DATA-SOURCE.md @@ -0,0 +1,223 @@ +# Contribute support for a data source + +Thanks for considering contributing to Soda Core's library of supported data sources! + +This is a guide for **humans** who want to add a new data source. It walks through the moving parts, what you must implement, and how to test the result. If you'd rather have a denser, machine-oriented reference — every override, every signature — read [`CONTRIBUTING-DATA-SOURCE.AGENT.md`](CONTRIBUTING-DATA-SOURCE.AGENT.md). Coding agents (Claude, Cursor, …) should read both: this one for the *why*, the agent doc for the *what*. + +There is no separate Soda SDK package. Extension support is built into Soda Core via Python entry points, a small set of base classes, and a handful of fully working reference packages. The existing data source packages **are** the SDK — when in doubt, copy from `soda-postgres/` (full featured) or `soda-duckdb/` (minimal). + +--- + +## What you need to provide + +- A **working data source** that reviewers can connect to: + - Self-hostable engines (Postgres, MySQL, …): ship a `docker-compose.yml` with the package. + - Cloud-only engines (BigQuery, Snowflake, Redshift, …): provide test credentials or a service account that CI can use. +- A **Python DB-API 2.0 (PEP 249) compatible driver** as a dependency. Soda Core calls into this driver to open connections and execute SQL. +- A **`soda-{name}/` package** that implements the three classes described below and registers itself through a Python entry point. + +--- + +## Implementation basics + +### Before you start: research your engine + +A new adapter is mostly dialect-specific overrides, each tied to a fact about +the target engine. Establish these before scaffolding: + +- identifier quoting and case-folding +- pagination syntax (`LIMIT`/`OFFSET`, `FETCH FIRST`, `TOP`) +- where table/column metadata lives (`INFORMATION_SCHEMA` or a vendor catalog) +- date/time literal format +- native `BOOLEAN`, or its substitute +- regex syntax +- sampling support +- namespace depth (2-part vs 3-part) +- the PEP 249 driver and the shape of its `cursor.description` + +Each maps to a specific override; the agent reference's [research checklist](CONTRIBUTING-DATA-SOURCE.AGENT.md#20-research-the-engine-first) maps every question to the method it drives. + +### Package layout + +A new data source package mirrors the existing `soda-{name}/` siblings: + +``` +soda-{name}/ +├── pyproject.toml # dependencies + entry point +├── docker-compose.yml # optional, for self-hostable engines +├── src/ +│ └── soda_{name}/ +│ ├── __init__.py +│ ├── common/ +│ │ └── data_sources/ +│ │ ├── {name}_data_source.py # impl + dialect +│ │ └── {name}_data_source_connection.py # connection + pydantic models +│ └── test_helpers/ +│ └── {name}_data_source_test_helper.py +└── tests/ + ├── conftest.py + └── data_sources/ + └── test_{name}.py # data-source-specific smoke tests +``` + +The cleanest way to start is to copy `soda-duckdb/` (smallest) or `soda-postgres/` (full-featured) and rename everything from the original prefix to your `{name}`. + +### The three classes + +A data source is wired up by implementing three classes plus a pydantic connection model. + +| Class | Base | Purpose | +|---|---|---| +| `{Name}DataSourceConnection` | `DataSourceConnection` | Wraps your DB-API driver. Implements `_create_connection()` and any retry / cursor quirks. | +| `{Name}SqlDialect` | `SqlDialect` | Tells Soda how to render SQL for your engine — type names, regex, casing, metadata queries. | +| `{Name}DataSourceImpl` | `DataSourceImpl` | The entry-point class. Wires the connection and dialect together. Two abstract methods. | + +Plus a pydantic schema — `{Name}DataSource(DataSourceBase)` and one or more `{Name}ConnectionProperties` variants (typically one per authentication method) — that defines the YAML users will write in their `data-source.yml`. Look at `PostgresConnectionPropertiesBase` and the three concrete variants (`PostgresConnectionString`, `PostgresConnectionPassword`, `PostgresConnectionPasswordFile`) for the canonical multi-auth pattern. + +The `type: Literal["{name}"]` field on your `DataSourceBase` subclass is the value users put in YAML *and* the suffix of your entry-point group. They must match. + +### Required overrides on `SqlDialect` + +`SqlDialect` has ~150 hooks but most defaults are sane. In practice you must override: + +- `get_data_source_data_type_name_by_soda_data_type_names()` — Soda canonical type names → native DDL type names (used when creating tables). +- `get_soda_data_type_name_by_data_source_data_type_names()` — native type names → Soda canonical type names (used by schema checks). +- `get_database_prefix_index()` / `get_schema_prefix_index()` — how `dataset_prefix` is interpreted for this engine. Postgres uses `[database, schema]` (`0`, `1`). DuckDB has no database concept, so returns `None` for database, `0` for schema. +- `_build_regex_like_sql(...)` — every dialect spells regex differently. Postgres: `expr ~ 'pattern'`. DuckDB: `REGEXP_MATCHES(expr, 'pattern')`. Snowflake: `REGEXP_LIKE(expr, 'pattern')`. + +You must also pass the matching sqlglot dialect via the subclass keyword: + +```python +class {Name}SqlDialect(SqlDialect, sqlglot_dialect="{sqlglot_name}"): + ... +``` + +### Optional overrides — frequently needed + +- `_build_cast_sql(...)` — Postgres uses `expr::type`; everyone else uses `CAST(expr AS type)` (the default). +- `escape_string(...)` / `escape_regex(...)` — quoting and escaping rules. +- `quote_default(...)` / `quote_for_ddl(...)` — identifier quoting (default: double-quote). +- `default_casify(...)` / `metadata_casify(...)` — case normalisation. Postgres lowercases unquoted identifiers; Snowflake uppercases. +- `_get_data_type_name_synonyms()` — list of equivalent native type names (e.g. `["varchar", "character varying"]`). +- `supports_data_type_*()` and `default_numeric_precision()` / `default_numeric_scale()` — for engines that don't expose precision/scale via metadata. +- `build_columns_metadata_query_str(...)` / `build_all_columns_metadata_query_str(...)` — defaults query `INFORMATION_SCHEMA`. Override if your engine doesn't have one. See `postgres_data_source.py` for an example that builds queries with the `sql_ast` DSL against `pg_catalog`. + +### Optional overrides — infrequent + +- `create_schema_if_not_exists_sql(...)` — for engines with non-standard schema DDL. +- `supports_materialized_views()`, `supports_sampler(...)`, `_build_sample_sql(...)` — only if you support these features. +- `bulk_columns_metadata_available` on `DataSourceImpl` — set to `False` if your engine can only fetch column metadata one table at a time (e.g. `DESCRIBE TABLE` per table). +- `create_metadata_tables_query()` — override on `DataSourceImpl` if you need a custom subclass of `MetadataTablesQuery`. + +### Connection class + +```python +class {Name}DataSourceConnection(DataSourceConnection): + def __init__(self, name, connection_properties): + super().__init__(name, connection_properties) + + def _create_connection(self, config): + return {your_dbapi}.connect(**config.to_connection_kwargs()) +``` + +Override `execute_query` / `execute_update` only if the driver needs specific error handling. The Postgres connection rolls back the transaction on failure — see `PostgresDataSourceConnection.execute_query`. + +If your driver's cursor description rows aren't named tuples, override `_execute_query_get_result_row_column_name`. DuckDB wraps the cursor to get DB-API compatibility — read its code if your driver doesn't behave like Postgres'. + +### Further considerations + +- **Schemas.** Some engines don't have schemas (DuckDB), some require them to be prefixed in every query (Snowflake), and some can have a default schema set on the connection. Make sure your `*_prefix_index` overrides reflect this. +- **Case sensitivity.** Postgres folds unquoted identifiers to lowercase. Snowflake folds to uppercase. Quoted identifiers are preserved. Reflect this in `default_casify` and `metadata_casify`. +- **Identifier quoting.** Default is `"double quotes"`. BigQuery uses backticks. Override `quote_default` if needed. +- **Regex.** Even engines that all "support regex" express it differently. Don't assume the default works. + +--- + +## Wire it up + +In `soda-{name}/pyproject.toml`: + +```toml +[project] +name = "soda-{name}" +version = "" +requires-python = ">=3.10" +dependencies = [ + "soda-core==", + "{your-dbapi-driver}>=X.Y", +] + +[project.entry-points."soda.plugins.data_source.{name}"] +{Name}DataSourceImpl = "soda_{name}.common.data_sources.{name}_data_source:{Name}DataSourceImpl" + +[tool.uv.sources] +soda-core = { workspace = true } + +[build-system] +requires = ["setuptools>=45", "wheel"] +build-backend = "setuptools.build_meta" + +[tool.setuptools] +package-dir = {"" = "src"} +``` + +The entry-point group `soda.plugins.data_source.{name}` is what makes your package discoverable — there is no central registry. Loading an entry point triggers `DataSourceImpl.__init_subclass__(model_class=...)`, which registers the impl class keyed by the `Literal[...]` default on your `DataSourceBase` subclass. At runtime, Soda matches the `type:` field in the user's YAML against that registry. By convention the entry-point `{name}` suffix matches the `Literal[...]` value, but the load-bearing match is YAML `type:` ↔ `Literal[...]` — not the entry-point name. Keep all three aligned for sanity. + +Then add `soda-{name}` to the workspace `pyproject.toml` (`[tool.uv.workspace] members`) and your `tests/` path to `pytest.ini` (`testpaths` and `pythonpath`). + +--- + +## Test the data source + +There are two test layers, and a serious contribution exercises both. + +### Data-source-specific tests + +Put anything genuinely unique to your engine — connection edge cases, dialect quirks, dependency behaviour — under `soda-{name}/tests/data_sources/test_{name}.py`. Most data-source-specific test files are short. + +### Cross-data-source test suite + +The shared suite under `soda-tests/tests/` runs **the same tests** against every registered data source. This is where coverage actually comes from. To plug your data source in: + +1. Implement a `{Name}DataSourceTestHelper(DataSourceTestHelper)` under `src/soda_{name}/test_helpers/`. At minimum override `_create_data_source_yaml_str()` to render a valid YAML config. Override `_create_database_name()` / `_create_schema_name()` if your engine has special naming rules. +2. Register your helper inside `DataSourceTestHelper.create()` in `soda-tests/src/helpers/data_source_test_helper.py` — add an `elif` branch for your `{name}`. +3. Add your `tests/` path to `testpaths` and `pythonpath` in `pytest.ini`. + +To run the full suite against your engine: + +```bash +TEST_DATASOURCE={name} uv run pytest soda-tests/tests +``` + +Credentials live in a `.env` file (kept out of git) — see `.env_example`. + +--- + +## Development setup + +```bash +# from repo root +uv sync --all-packages --group dev + +# format + lint +uv run pre-commit run --all-files + +# unit tests (don't need a database) +uv run pytest soda-tests/tests/unit + +# your data source tests +uv run pytest soda-{name}/tests +``` + +--- + +## Submitting + +1. Open a draft PR early. Soda maintainers can advise on tricky dialect issues before you sink time into them. +2. Include in the PR description: + - which DB version(s) you tested against, + - any features you deliberately did not implement, and + - any tests you skipped, with reasons. +3. Make sure the cross-data-source suite passes: `TEST_DATASOURCE={name} uv run pytest soda-tests/tests`. + +If you get stuck, the [agent reference](CONTRIBUTING-DATA-SOURCE.AGENT.md) catalogues every override, every common failure mode, and the verification commands maintainers will run on your PR. It's intentionally written so a coding agent can drive the implementation end-to-end with a human reviewing. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 000000000..dfae7dd84 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,175 @@ +# Contributing to Soda Core + +Thanks for taking the time to contribute. Soda Core is an open-source data quality and data contract verification engine, and we welcome bug reports, fixes, documentation improvements, and new extensions. + +This guide covers everything you need to know to get a local development environment running, the conventions we follow, and how to submit a change. It is written for humans, but it is also intended to be readable by coding agents (Claude Code, Cursor, …) that may drive parts of a contribution end-to-end. + +--- + +## Where to ask questions + +- **Bug reports** and **feature requests**: open a [GitHub issue](https://github.com/sodadata/soda-core/issues). +- **General questions** about using Soda Core: join the [Soda Community Slack](https://soda-community.slack.com/). +- **Security issues**: do **not** open a public issue. Email security@soda.io. + +Before filing a bug, check existing issues and the [Soda documentation](https://docs.soda.io/soda-v4/) — your question may already be answered. + +--- + +## What you can contribute + +There are several ways to contribute to Soda Core. In rough order from smallest to largest: + +- **Bug fixes** and **documentation improvements** — always welcome. Open a PR directly. +- **New built-in check types** that are broadly useful (e.g. a new statistical check, a new validity rule). Discuss in an issue first so we can confirm the scope is open-source. +- **New data sources** — adding a new database to the supported list. This is the most common contribution from outside the core team. See [`CONTRIBUTING-DATA-SOURCE.md`](CONTRIBUTING-DATA-SOURCE.md) for the full guide. +- **External extension packages** — Soda Core has a public plugin system based on Python entry points. You can ship your own data source or check-type plugin as a separate PyPI package without merging anything upstream. The same base classes documented in [`CONTRIBUTING-DATA-SOURCE.md`](CONTRIBUTING-DATA-SOURCE.md) apply. + +For anything beyond a small bug fix, please open an issue first to align on the approach before writing code. + +--- + +## How extensions work + +Soda Core has two extension surfaces: + +1. **Data sources** — teach Soda how to connect to a database, render its SQL dialect, and query its metadata. Every contract verification runs through a `DataSourceImpl` + `SqlDialect` + `DataSourceConnection` triple. +2. **Check types** — add a new kind of check that contract authors can write in YAML. + +Both surfaces use the same mechanism: a Python entry point under the `soda.plugins.*` group, discovered at runtime by `soda_core.plugins.load_plugins()`. An extension package can ship one, the other, or both. + +For a step-by-step guide to building either, see [`CONTRIBUTING-DATA-SOURCE.md`](CONTRIBUTING-DATA-SOURCE.md) (the human guide) and [`CONTRIBUTING-DATA-SOURCE.AGENT.md`](CONTRIBUTING-DATA-SOURCE.AGENT.md) (the dense reference). Both files live in this repository. + +--- + +## Development setup + +### Requirements + +- Python 3.10 or newer. +- [UV](https://docs.astral.sh/uv/) (recommended) or pip 21.0+. + +### Install + +```bash +# clone the repo, then from the repo root: +uv sync --all-packages --group dev +``` + +This installs every package in the workspace (`soda-core`, `soda-tests`, all data source packages) in editable mode with their development dependencies. + +Pip equivalent (if you don't have UV). Unlike `uv sync --all-packages`, this only installs the packages you list — add `-e soda-{name}` for every driver you need: + +```bash +python -m venv .venv && source .venv/bin/activate +pip install -e soda-core -e soda-tests \ + -e soda-postgres # add -e soda-snowflake / soda-bigquery / ... as needed +pip install pytest pre-commit pydantic python-dotenv freezegun +``` + +### Tests + +```bash +# unit tests — fast, no database needed +uv run pytest soda-tests/tests/unit + +# data-source-specific tests (Postgres ships with a docker-compose.yml) +uv run pytest soda-postgres/tests + +# cross-data-source suite against a specific engine +TEST_DATASOURCE=postgres uv run pytest soda-tests/tests +``` + +Credentials for cloud-only data sources go in a `.env` file at the repo root — see `.env_example`. The file is gitignored. + +### What not to commit + +- **No credentials anywhere in tracked files.** `.env` is gitignored — keep it that way. Service-account JSON, API tokens, and passwords belong in `.env` (locally) or in CI secrets (in CI), never in code, YAML, or markdown. +- **No real credentials in `docker-compose.yml`.** Use placeholder values that are obviously placeholders (`POSTGRES_PASSWORD: postgres`). Reviewers will reject anything that looks like a real secret. +- **No personal absolute paths** (e.g. `/Users/you/...`) in committed code or docs. Use repo-relative paths. + +### Matching the soda-core version + +A new data source package must pin `soda-core` to the current workspace version. Find it with: + +```bash +grep -m1 '^version' soda-core/pyproject.toml +``` + +Use that exact string in your `pyproject.toml` dependencies and in your own `version` field. + +### Pre-commit checks + +We use `pre-commit` to enforce formatting and basic hygiene. Install the hook once: + +```bash +uv run pre-commit install +``` + +…and run it before pushing: + +```bash +uv run pre-commit run --all-files +``` + +--- + +## Code style + +Enforced automatically by pre-commit: + +- **black**, line length 120 +- **isort**, black profile +- **autoflake** for unused imports +- standard hygiene checks (trailing whitespace, YAML/JSON/TOML syntax, no debug statements) + +Beyond formatting: + +- Write SQL via the `sql_ast` builders in `soda_core.common.sql_ast`, not by string concatenation. The builders go through dialect rules; raw strings bypass them. +- Use frozen pydantic models for YAML-facing data shapes — see `soda_core/model/`. Top-level model classes (e.g. `DataSourceBase`) use `extra="forbid"`; connection-properties subclasses use `extra="allow"` so unknown driver kwargs flow through unchanged. +- Type-hint public functions. The codebase is gradually moving toward stricter typing. +- Keep comments for *why*, not *what*. Code reviewers will push back on commentary that restates the code. + +--- + +## Submitting a pull request + +1. **Fork** the repo and create a topic branch. +2. **Open a draft PR early** if the change is more than a few lines. It costs nothing and lets maintainers flag dead ends before you sink time into them. +3. **Write a clear PR description**: + - what the change does and why, + - which tests cover it, + - anything you deliberately did *not* do (and why), + - any tests you skipped, with reasons. +4. **Run the checks locally**: + ```bash + uv run pre-commit run --all-files + uv run pytest soda-tests/tests/unit + # plus any data-source or integration tests relevant to your change + ``` +5. **Keep the PR focused**. One logical change per PR. Unrelated cleanups go in a separate PR. + +We squash-merge most PRs. Your commit history within the PR doesn't need to be pristine, but the final squashed commit message should be. + +--- + +## Versioning and releases + +Soda Core uses semantic versioning, managed via `tbump`. Releases are cut by Soda maintainers — contributors do not need to update version numbers in their PRs. + +--- + +## License + +Soda Core is licensed under the Apache License 2.0. By submitting a contribution, you agree that your work will be released under the same license. See [`LICENSE`](LICENSE) for the full text. + +--- + +## A note for coding agents + +If you are a coding agent (Claude Code, Cursor, etc.) driving a contribution: + +- **Read the base classes before writing code.** Headers and signatures are documented in [`CONTRIBUTING-DATA-SOURCE.AGENT.md`](CONTRIBUTING-DATA-SOURCE.AGENT.md), but the actual code in `soda-core/src/soda_core/common/` is the source of truth. When the docs and the code disagree, the code wins. +- **Start from a working reference.** Copy `soda-duckdb/` (minimal) or `soda-postgres/` (full-featured) and adapt, rather than generating files from scratch. +- **Verify before claiming done.** Run `uv run pre-commit run --all-files` and the relevant pytest paths. Do not declare a task complete unless the commands actually pass. +- **Do not edit `soda-core/` itself** unless you found a real upstream bug. Extensions live in their own packages.