Skip to content

Heemyk/zingage_takehome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Zingage-Inspired ETL & Analytics System

A production-minded ETL pipeline that demonstrates data engineering maturity with idempotent merges, field ownership enforcement, derived metrics, anomaly detection, and freshness tracking.

πŸ“‹ Assignment Answers: See docs/DOCUMENTATION.md for complete answers to all assignment questions (Steps 1-4), including analytical queries with explanations, schema design rationale, and assumptions.

🎯 Overview

This system implements a full ETL pipeline that:

  • Loads CSV data into staging tables
  • Applies idempotent merge with hash-based change detection
  • Enforces source-of-truth / ownership per-field (EMR vs Zingage)
  • Handles late-arriving data & backfills safely
  • Materializes derived analytics without mutating source truth
  • Flags anomalies instead of corrupting data
  • Tracks freshness & time-decay SLAs
  • Executes via orchestrated job graph

πŸ—οΈ Architecture

CSV Source β†’ Staging DB β†’ Deterministic Merge Jobs β†’ Warehouse Tables β†’ Derived Metrics β†’ Exports/Dashboards

Core Principles

  • Staging β†’ Merge pattern: Never transform in-place
  • Idempotency: Hash-based change detection, update-on-change only
  • Ownership enforcement: EMR fields never overwritten by derived fields
  • Race safety: Single-writer merge jobs
  • Time-decay freshness: Actor-aware time budgets
  • Anomaly logging: Log issues instead of corrupting data

πŸ“ Project Structure

/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ etl/
β”‚   β”‚   β”œβ”€β”€ jobs/              # Individual ETL jobs
β”‚   β”‚   β”‚   β”œβ”€β”€ ingest_csv_staging.ts
β”‚   β”‚   β”‚   β”œβ”€β”€ merge_caregivers.ts
β”‚   β”‚   β”‚   β”œβ”€β”€ merge_visits.ts
β”‚   β”‚   β”‚   β”œβ”€β”€ compute_daily_hours.ts
β”‚   β”‚   β”‚   β”œβ”€β”€ anomaly_checks.ts
β”‚   β”‚   β”‚   └── freshness_status.ts
β”‚   β”‚   └── orchestrator.ts     # ETL DAG orchestration
β”‚   └── lib/
β”‚       β”œβ”€β”€ db.ts              # Database connection
β”‚       β”œβ”€β”€ logger.ts          # Logging utility
β”‚       β”œβ”€β”€ hash.ts            # Hash generation for change detection
β”‚       └── types.ts           # TypeScript type definitions
β”œβ”€β”€ migrations/                # Knex database migrations
β”œβ”€β”€ tests/                     # Test scripts
β”œβ”€β”€ data/                      # CSV input files (create this directory)
β”œβ”€β”€ docker/                    # Docker configuration
└── sql/                       # SQL initialization scripts

πŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose
  • Node.js 20+ (for local development)
  • npm or yarn

Complete Setup

  1. Clone and install dependencies:

    npm install
  2. Start Docker services:

    docker-compose up -d

    This starts:

    • PostgreSQL (port 5432)
    • pgAdmin (port 5050)
    • ETL container (for running jobs)

    Wait for services to be healthy (~10-15 seconds). Verify with:

    docker-compose ps
  3. Run database migrations:

    npm run migrate

    This creates all necessary tables: staging, warehouse, derived, and observability tables.

  4. Ingest CSV data:

    Place your CSV files in the root directory or data/ directory:

    • caregiver_data_20250415_sanitized.csv
    • carelog_data_20250415_sanitized.csv
  5. Run the full ETL pipeline:

    npm run etl:full

    This executes the complete ETL DAG:

    1. CSV ingestion to staging
    2. Duplicate detection (caregivers & visits)
    3. Merge caregivers
    4. Merge visits (first pass)
    5. Resolve orphaned visits
    6. Re-merge visits (with resolutions)
    7. Compute daily hours
    8. Anomaly checks
    9. Freshness status update

Frontend Setup (Optional)

  1. Navigate to frontend directory:

    cd frontend/frontend
  2. Install dependencies:

    npm install
  3. Start frontend dev server:

    npm run dev

    Frontend runs on http://localhost:5173

Backend API Server (Required for Frontend)

  1. From project root, start backend server:

    npm run start:server

    Backend runs on http://localhost:4000

  2. Configure frontend (if needed):

    • Create frontend/frontend/.env file:
      VITE_API_BASE_URL=http://localhost:4000
      

πŸ“Š ETL Job DAG

The pipeline executes the following jobs in order:

  1. ingest_csv_staging: Loads CSV files into staging tables
  2. detect_duplicate_caregivers: Detects and removes duplicate caregivers (exact + semantic)
  3. detect_duplicate_visits: Detects and removes duplicate visits (exact + overlapping)
  4. merge_caregivers: Idempotent merge of caregivers with hash-based change detection
  5. merge_visits (first pass): Idempotent merge of visits with derived fields (duration, flags)
  6. resolve_orphaned_visits: Resolves orphaned visits using multiple strategies
  7. merge_visits (re-merge): Re-merges resolved visits with resolution metadata
  8. compute_daily_hours: Aggregates visit durations per caregiver per date
  9. anomaly_checks: Detects and logs data quality issues
  10. freshness_status: Updates freshness metadata and SLA status

πŸ”§ Usage

Individual Jobs: npm run etl:ingest, etl:merge-caregivers, etl:merge-visits, etl:daily-hours, etl:anomalies, etl:freshness
Testing: npm run test:etl, npm run test:idempotency
Database: PostgreSQL localhost:5432, pgAdmin http://localhost:5050

For complete usage guide, see docs/SETUP.md.

πŸ“ˆ Data Model

Staging: caregivers_staging, visits_staging (temporary)
Warehouse: caregivers, visits (source of truth)
Derived: caregiver_daily_hours (aggregated metrics)
Observability: anomaly_log, freshness_metadata, visit_duplicates, caregiver_duplicates, orphaned_visit_resolutions, unresolved_anomalies

For complete schema details, see docs/DOCUMENTATION.md section "Database Schema Reference".

πŸ”’ Field Ownership

Field Category Owner
Caregiver demographics EMR
Raw clock times EMR
Derived durations & flags Zingage ETL
Aggregations Zingage ETL

Rule: EMR fields are never overwritten by derived fields. Only new derived fields are added.

⏱️ Freshness SLAs

SLA Type Requirement
App UX ≀ 5–15 mins stale ok
Ops dashboard hourly budget
Payroll daily strict boundary

Freshness status is tracked in freshness_metadata:

  • fresh: Data is within acceptable SLA
  • stale: Data exceeds hourly budget
  • critical: Data exceeds daily boundary

πŸ§ͺ Testing Strategy

The testing approach is designed for gradual verification:

  1. Test Full Pipeline (test:etl)

    • Runs complete ETL pipeline
    • Verifies record counts
    • Shows sample queries
  2. Test Idempotency (test:idempotency)

    • Runs pipeline twice with same data
    • Verifies no duplicates created
    • Tests hash-based change detection
  3. Manual Testing

    • Run individual jobs
    • Query database directly
    • Check anomaly logs
    • Monitor freshness status

πŸ“ CSV Format

CSV files should be placed in the root directory or data/ folder:

  • caregiver_data_20250415_sanitized.csv (or any file matching caregiver_data*.csv)
  • carelog_data_20250415_sanitized.csv (or any file matching carelog_data*.csv)

Expected CSV Structure

Caregivers CSV should include:

  • caregiver_id, email, first_name, last_name, phone_number, status, etc.

Visits CSV should include:

  • carelog_id, caregiver_id, parent_id, start_datetime, end_datetime, clock_in_actual_datetime, clock_out_actual_datetime, etc.

Timestamps should be in ISO 8601 format (YYYY-MM-DD HH:MM:SS).

Note: The system automatically detects CSV files in the root directory or data/ folder.

πŸ” Derived Metrics

Visit Completion: Scheduled time exists, clock-in/out both exist, duration > 5 minutes
Reliability: Missing clock-out, late start (>10 min), no-show
Documentation Quality: Comment β‰₯ 100 chars = detailed
Outliers: Negative or >24h duration, short duration (<5 min)

For complete analytical queries, see docs/DOCUMENTATION.md section "Analytical Queries Explained".

πŸ› Troubleshooting

For troubleshooting help, see docs/SETUP.md section "Troubleshooting".

πŸ“¦ Environment Variables

Create a .env file (use .env.example as template):

DB_HOST=localhost
DB_PORT=5432
DB_NAME=zingage_etl
DB_USER=postgres
DB_PASSWORD=postgres
CSV_DATA_DIR=./data
LOG_LEVEL=info

πŸ“š Documentation (docs/)

  • docs/DOCUMENTATION.md: Complete answers to all assignment questions (Steps 1-4), analytical queries with explanations, schema reference, anomaly detection & resolution
  • docs/PIPELINE.md: Complete pipeline architecture, merge mechanics, design decisions
  • docs/SETUP.md: Setup instructions, usage guide, troubleshooting

πŸ”„ Next Steps

  1. βœ… Frontend Integration: Frontend built and connected to backend API
  2. βœ… API Layer: REST API created for SQL execution and ETL job triggering
  3. Scheduling: Add cron/job scheduler for automated runs
  4. Monitoring: Add Prometheus metrics and alerting
  5. Backfill Support: Enhanced handling of historical data loads

πŸ“„ License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages