To provision a scalable, real-time data analytics pipeline on Google Cloud Platform using Infrastructure as Code (IaC). This project automates the deployment of a Change Data Capture (CDC) system that streams changes from a Cloud SQL for MySQL source to a Cloud Pub/Sub topic, which are then ingested directly into a BigQuery table. The pipeline utilizes a Debezium Server running in a Docker container on a Google Compute Engine (GCE) VM, with the entire infrastructure managed by Terraform.
This project implements a modern, serverless-first CDC architecture for streaming database changes to a data warehouse. Instead of relying on intermediate caching layers (like Kafka) or complex ETL services (like Dataflow), this solution combines the open-source Debezium project with Google Cloud's managed services. A Debezium Server, running on a GCE instance, reads the change log from Cloud SQL and streams the events to a Pub/Sub topic. A BigQuery subscription then writes these events directly into a BigQuery table, making it ideal for real-time analytics and data warehousing with minimal operational overhead. This approach significantly simplifies the data pipeline, reduces latency, and lowers total cost of ownership.
The Terraform configuration must provision and configure the following GCP services to work in concert:
| Component | Service | Requirement |
|---|---|---|
| Data Source | Cloud SQL for MySQL | A Cloud SQL instance, accessible only via a private network (no public IP), to act as the transactional database source. Must have an rdsadmin user for management and a dedicated datastream user for CDC. |
| CDC Engine Host | GCE VM | A Google Compute Engine (GCE) instance to host the Debezium Server in a Docker container. This instance must be located within the private network. |
| CDC Engine | Debezium Server (Docker) | A Debezium Server running as a Docker container on the GCE VM. It captures change logs from Cloud SQL and publishes them as messages to a Cloud Pub/Sub topic. |
| Messaging & Ingestion | Cloud Pub/Sub | A Pub/Sub topic to receive CDC events from Debezium. It decouples the source from the sink and provides a scalable, durable buffer. |
| Data Sink / Warehouse | BigQuery & BigQuery Subscription | A BigQuery dataset and table to store the raw CDC events. A BigQuery subscription connects the Pub/Sub topic directly to the BigQuery table for serverless ingestion. |
| Networking | VPC, Subnets, PSC, Cloud NAT | Secure private networking using a custom VPC. A Private Service Connect (PSC) endpoint provides a stable internal IP for the Cloud SQL instance. A Cloud NAT gateway is required for the GCE instance to download software packages and the Debezium Docker image. |
| Provisioning Helper | Google Cloud Storage | A temporary GCS bucket is created and deleted during the terraform apply and destroy processes. It is used to copy Debezium configuration files from the local machine to the GCE VM. |
| Security & Permissions | IAM (Identity and Access Management) | A set of fine-grained IAM roles and permissions to ensure that each service (GCE, Pub/Sub) has only the access it needs to function, following the principle of least privilege. |
- Debezium-based CDC: Utilizes the open-source Debezium engine, providing flexibility and reducing vendor lock-in.
- Serverless Sink with Pub/Sub and BigQuery: CDC events are streamed to Pub/Sub and ingested into BigQuery via a direct subscription, creating a highly scalable, low-maintenance, and cost-effective data sink. This eliminates the need for a separate data processing layer like Dataflow.
- Decoupled Architecture: Pub/Sub acts as a message bus, decoupling the Debezium producer from the BigQuery consumer. This allows for greater resilience and flexibility.
- Secure by Default: All components operate within a private VPC, with no public IPs for the database or the main GCE VM, enhancing security. Cloud SQL is accessed via a secure PSC endpoint.
- Infrastructure as Code (IaC) Management: The entire infrastructure is defined and managed using Terraform, enabling consistent, repeatable, and version-controlled deployments.
- Automated Provisioning: The solution includes scripts and Terraform configurations to automate the entire setup process, from network creation to Debezium Server configuration.
- The entire infrastructure must be defined in HashiCorp Configuration Language (HCL) for Terraform.
- The code must be modular and reusable, with separate modules for networking, data source, GCE host, and the data sink.
- A GCS bucket must be used as the backend for storing the Terraform state file (
terraform.tfstate), ensuring state is managed remotely and securely.
- Principle of Least Privilege: All service accounts must be granted only the IAM permissions necessary for their function.
- The
datastreamdatabase user should have the minimal required permissions for replication (REPLICATION SLAVE,SELECT,REPLICATION CLIENT). - The GCE instance's service account requires
pubsub.publisherrights to send messages. - The Pub/Sub service agent requires
bigquery.dataEditorrights to write to the sink table.
- The
- Private Networking: All communication between GCP services must occur over the private network. The Cloud SQL instance must have its public IP disabled.
- Firewall Rules: VPC firewall rules must be configured to:
- Allow all internal traffic within the VPC's private subnet (
10.1.0.0/16). - Allow SSH traffic (
tcp:22) only from Google's Identity-Aware Proxy (IAP) service range (35.235.240.0/20), ensuring no direct external SSH access.
- Allow all internal traffic within the VPC's private subnet (
- Secrets Management: Database credentials for the
rdsadminanddatastreamusers should be generated at runtime by Terraform and passed securely.
- The Terraform project must be highly configurable through variables (
.tfvars). - Key parameters such as
project_id,region, instance names, and database settings must be externalized from the core logic. - Performance and cost-related parameters (e.g., Cloud SQL
db_tier, GCEmachine_type) must be configurable. - Core Input Variables: The root module must accept the following core variables to define the environment:
project_id: The target GCP project ID.region: The GCP region for resource deployment.zone: The GCP zone for the GCE VM.db_name: The name of the database to be monitored for changes.debezium_server_name: A logical name for the Debezium server, used as a prefix for Pub/Sub topics.
- All resources should follow a consistent naming convention to ensure clarity and manageability.
- Format: Resource names are typically composed of a prefix and a resource-specific name.
Note: This is a functional requirement that is not yet implemented in the current Terraform code.
- Basic monitoring and alerting must be configured to ensure pipeline reliability.
- GCE VM Health: Alerts should be created if the GCE instance's CPU utilization exceeds a defined threshold or if the instance becomes unhealthy.
- Pub/Sub & BigQuery: A Cloud Monitoring alert should be configured for the BigQuery subscription to detect a growing backlog of unacknowledged messages or a high number of messages being sent to the Dead Letter Queue (DLQ).
- MySQL Version:
MYSQL_8_0 - Tier: Must be a configurable variable (e.g.,
db-n1-standard-2). - CDC Configuration: The instance must be configured to support CDC through
backup_configuration.binary_log_enabledand appropriatedatabase_flags(e.g.,binlog_row_image = "full"). - PSC Enabled: The instance must be configured to allow Private Service Connect, and a PSC endpoint is created to provide a stable internal IP.
- Database Users:
root,rdsadmin, anddatastreamusers must be created with unique, randomly generated passwords.
- Startup Script: A comprehensive startup script automates the installation of:
- MySQL Client: For database interaction.
- Docker Engine: To run the Debezium server.
- Debezium Configuration: Configuration files from the local
debezium-server/configdirectory are uploaded to a temporary GCS bucket and then copied to/opt/debezium-serveron the VM during provisioning. - Debezium Image: The
debezium/server:3.0.0.FinalDocker image is pulled on startup. - Debezium Persistence: To ensure Debezium can resume from its last position after a restart, its state must be persisted on the GCE VM.
debezium.source.offset.storage.file.filename: CDC offsets will be stored in a local file atdata/offset.datinside the container.debezium.source.schema.history.internal.file.filename: Schema history will be stored in a local file atdata/schema_history.datinside the container.- These container paths must be mounted as volumes to a persistent directory on the GCE host (e.g.,
$HOME/data).
- Topic Creation: A Pub/Sub topic is created to receive all CDC events for a specific table, following the naming convention
{topic-prefix}.{db-name}.{table-name}. - Dead Letter Queue (DLQ): A DLQ topic and subscription are created to capture messages that fail to be written to BigQuery after a set number of retries, allowing for manual inspection and reprocessing.
- Dataset and Table: A BigQuery dataset (
debezium_sink) and a table (cdc_events) are created to store the incoming data. - Table Schema: The table schema is predefined to match the flattened JSON structure of the Debezium CDC events after the
event-flatteningSMT is applied. The BigQuery subscription is configured withuse_table_schema = true, meaning it writes data by matching the JSON fields from the Pub/Sub message directly to the columns of the BigQuery table. - Partitioning: To optimize query performance and manage costs, the table is partitioned by time. It is crucial to use a
TIMESTAMPorDATETIMEcolumn from the source table as the partitioning key. In this project, the table is partitioned by thetrans_datetimecolumn (Day). - BigQuery Subscription: A
google_pubsub_subscriptionis configured with abigquery_configblock to write messages directly from the Pub/Sub topic to the target table.
google_sql_database_instancegoogle_sql_usergoogle_compute_instancegoogle_storage_bucketgoogle_project_iam_membergoogle_compute_networkgoogle_compute_subnetworkgoogle_compute_firewallgoogle_compute_forwarding_rule(for PSC)google_compute_routergoogle_compute_router_natgoogle_pubsub_topicgoogle_pubsub_subscriptiongoogle_bigquery_datasetgoogle_bigquery_tablegoogle_bigquery_dataset_iam_membergoogle_monitoring_alert_policy(Note: Not yet implemented)
- A complete set of Terraform files (
.tf) to provision the entire pipeline. - A
variables.tffile defining all configurable parameters. - An example
terraform.tfvars.examplefile showing users how to configure the project with sensible defaults. - A
README.mdfile with detailed instructions on how to initialize, plan, and apply the Terraform configuration. - Required Outputs: The root Terraform module must output the following values after a successful
applyfor operational purposes:- Cloud SQL Instance Name & Private IP
- Cloud SQL PSC Endpoint IP
- GCE VM Instance Name & Private IP
- Admin User Name (
rdsadmin) & Password (sensitive) - Debezium User Name (
datastream) & Password (sensitive) - Pub/Sub Topic Name
- BigQuery Dataset ID & Table ID
- BigQuery Subscription Name
- Dead Letter Queue (DLQ) Topic & Subscription Names
- Acceptance Criteria: A successful deployment is defined by the following conditions being met:
- Users can connect to the Cloud SQL instance via Cloud SQL Studio using the
rdsadmincredentials and successfully execute SQL commands. - The Debezium Server Docker container is running on the GCE VM, confirmed via
docker ps. - An
INSERTstatement executed in Cloud SQL results in a corresponding record appearing in the BigQuerycdc_eventstable within minutes. - The columns in the BigQuery record contain the flattened data from the Debezium JSON payload, matching the source table's structure.
- Users can connect to the Cloud SQL instance via Cloud SQL Studio using the
- Custom Debezium connector development.
- CI/CD automation for deploying the Terraform infrastructure.
- Complex data transformations (the pipeline is designed for raw ingestion).
- Automated schema evolution handling in BigQuery.
- Creation of applications that consume data from the BigQuery table.