Zingage-Inspired ETL & Analytics System

A production-minded ETL pipeline that demonstrates data engineering maturity with idempotent merges, field ownership enforcement, derived metrics, anomaly detection, and freshness tracking.

📋 Assignment Answers: See docs/DOCUMENTATION.md for complete answers to all assignment questions (Steps 1-4), including analytical queries with explanations, schema design rationale, and assumptions.

🎯 Overview

This system implements a full ETL pipeline that:

Loads CSV data into staging tables
Applies idempotent merge with hash-based change detection
Enforces source-of-truth / ownership per-field (EMR vs Zingage)
Handles late-arriving data & backfills safely
Materializes derived analytics without mutating source truth
Flags anomalies instead of corrupting data
Tracks freshness & time-decay SLAs
Executes via orchestrated job graph

🏗️ Architecture

CSV Source → Staging DB → Deterministic Merge Jobs → Warehouse Tables → Derived Metrics → Exports/Dashboards

Core Principles

Staging → Merge pattern: Never transform in-place
Idempotency: Hash-based change detection, update-on-change only
Ownership enforcement: EMR fields never overwritten by derived fields
Race safety: Single-writer merge jobs
Time-decay freshness: Actor-aware time budgets
Anomaly logging: Log issues instead of corrupting data

📁 Project Structure

/
├── src/
│   ├── etl/
│   │   ├── jobs/              # Individual ETL jobs
│   │   │   ├── ingest_csv_staging.ts
│   │   │   ├── merge_caregivers.ts
│   │   │   ├── merge_visits.ts
│   │   │   ├── compute_daily_hours.ts
│   │   │   ├── anomaly_checks.ts
│   │   │   └── freshness_status.ts
│   │   └── orchestrator.ts     # ETL DAG orchestration
│   └── lib/
│       ├── db.ts              # Database connection
│       ├── logger.ts          # Logging utility
│       ├── hash.ts            # Hash generation for change detection
│       └── types.ts           # TypeScript type definitions
├── migrations/                # Knex database migrations
├── tests/                     # Test scripts
├── data/                      # CSV input files (create this directory)
├── docker/                    # Docker configuration
└── sql/                       # SQL initialization scripts

🚀 Quick Start

Prerequisites

Docker and Docker Compose
Node.js 20+ (for local development)
npm or yarn

Complete Setup

Clone and install dependencies:
```
npm install
```
Start Docker services:
```
docker-compose up -d
```
This starts:
- PostgreSQL (port 5432)
- pgAdmin (port 5050)
- ETL container (for running jobs)
Wait for services to be healthy (~10-15 seconds). Verify with:
```
docker-compose ps
```
Run database migrations:
```
npm run migrate
```
This creates all necessary tables: staging, warehouse, derived, and observability tables.
Ingest CSV data:

Place your CSV files in the root directory or data/ directory:
- caregiver_data_20250415_sanitized.csv
- carelog_data_20250415_sanitized.csv
Run the full ETL pipeline:
```
npm run etl:full
```
This executes the complete ETL DAG:
1. CSV ingestion to staging
2. Duplicate detection (caregivers & visits)
3. Merge caregivers
4. Merge visits (first pass)
5. Resolve orphaned visits
6. Re-merge visits (with resolutions)
7. Compute daily hours
8. Anomaly checks
9. Freshness status update

Frontend Setup (Optional)

Navigate to frontend directory:
```
cd frontend/frontend
```
Install dependencies:
```
npm install
```
Start frontend dev server:
```
npm run dev
```
Frontend runs on http://localhost:5173

Backend API Server (Required for Frontend)

From project root, start backend server:
```
npm run start:server
```
Backend runs on http://localhost:4000
Configure frontend (if needed):
- Create frontend/frontend/.env file:
```
VITE_API_BASE_URL=http://localhost:4000
```

📊 ETL Job DAG

The pipeline executes the following jobs in order:

ingest_csv_staging: Loads CSV files into staging tables
detect_duplicate_caregivers: Detects and removes duplicate caregivers (exact + semantic)
detect_duplicate_visits: Detects and removes duplicate visits (exact + overlapping)
merge_caregivers: Idempotent merge of caregivers with hash-based change detection
merge_visits (first pass): Idempotent merge of visits with derived fields (duration, flags)
resolve_orphaned_visits: Resolves orphaned visits using multiple strategies
merge_visits (re-merge): Re-merges resolved visits with resolution metadata
compute_daily_hours: Aggregates visit durations per caregiver per date
anomaly_checks: Detects and logs data quality issues
freshness_status: Updates freshness metadata and SLA status

🔧 Usage

Individual Jobs: npm run etl:ingest, etl:merge-caregivers, etl:merge-visits, etl:daily-hours, etl:anomalies, etl:freshness
Testing: npm run test:etl, npm run test:idempotency
Database: PostgreSQL localhost:5432, pgAdmin http://localhost:5050

For complete usage guide, see docs/SETUP.md.

📈 Data Model

Staging: caregivers_staging, visits_staging (temporary)
Warehouse: caregivers, visits (source of truth)
Derived: caregiver_daily_hours (aggregated metrics)
Observability: anomaly_log, freshness_metadata, visit_duplicates, caregiver_duplicates, orphaned_visit_resolutions, unresolved_anomalies

For complete schema details, see docs/DOCUMENTATION.md section "Database Schema Reference".

🔒 Field Ownership

Field Category	Owner
Caregiver demographics	EMR
Raw clock times	EMR
Derived durations & flags	Zingage ETL
Aggregations	Zingage ETL

Rule: EMR fields are never overwritten by derived fields. Only new derived fields are added.

⏱️ Freshness SLAs

SLA Type	Requirement
App UX	≤ 5–15 mins stale ok
Ops dashboard	hourly budget
Payroll	daily strict boundary

Freshness status is tracked in freshness_metadata:

fresh: Data is within acceptable SLA
stale: Data exceeds hourly budget
critical: Data exceeds daily boundary

🧪 Testing Strategy

The testing approach is designed for gradual verification:

Test Full Pipeline (test:etl)
- Runs complete ETL pipeline
- Verifies record counts
- Shows sample queries
Test Idempotency (test:idempotency)
- Runs pipeline twice with same data
- Verifies no duplicates created
- Tests hash-based change detection
Manual Testing
- Run individual jobs
- Query database directly
- Check anomaly logs
- Monitor freshness status

📝 CSV Format

CSV files should be placed in the root directory or data/ folder:

caregiver_data_20250415_sanitized.csv (or any file matching caregiver_data*.csv)
carelog_data_20250415_sanitized.csv (or any file matching carelog_data*.csv)

Expected CSV Structure

Caregivers CSV should include:

caregiver_id, email, first_name, last_name, phone_number, status, etc.

Visits CSV should include:

carelog_id, caregiver_id, parent_id, start_datetime, end_datetime, clock_in_actual_datetime, clock_out_actual_datetime, etc.

Timestamps should be in ISO 8601 format (YYYY-MM-DD HH:MM:SS).

Note: The system automatically detects CSV files in the root directory or data/ folder.

🔍 Derived Metrics

Visit Completion: Scheduled time exists, clock-in/out both exist, duration > 5 minutes
Reliability: Missing clock-out, late start (>10 min), no-show
Documentation Quality: Comment ≥ 100 chars = detailed
Outliers: Negative or >24h duration, short duration (<5 min)

For complete analytical queries, see docs/DOCUMENTATION.md section "Analytical Queries Explained".

🐛 Troubleshooting

For troubleshooting help, see docs/SETUP.md section "Troubleshooting".

📦 Environment Variables

Create a .env file (use .env.example as template):

DB_HOST=localhost
DB_PORT=5432
DB_NAME=zingage_etl
DB_USER=postgres
DB_PASSWORD=postgres
CSV_DATA_DIR=./data
LOG_LEVEL=info

📚 Documentation (docs/)

docs/DOCUMENTATION.md: Complete answers to all assignment questions (Steps 1-4), analytical queries with explanations, schema reference, anomaly detection & resolution
docs/PIPELINE.md: Complete pipeline architecture, merge mechanics, design decisions
docs/SETUP.md: Setup instructions, usage guide, troubleshooting

🔄 Next Steps

✅ Frontend Integration: Frontend built and connected to backend API
✅ API Layer: REST API created for SQL execution and ETL job triggering
Scheduling: Add cron/job scheduler for automated runs
Monitoring: Add Prometheus metrics and alerting
Backfill Support: Enhanced handling of historical data loads

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docker		docker
docs		docs
frontend/frontend		frontend/frontend
migrations		migrations
sql		sql
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
jest.config.js		jest.config.js
knexfile.js		knexfile.js
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zingage-Inspired ETL & Analytics System

🎯 Overview

🏗️ Architecture

Core Principles

📁 Project Structure

🚀 Quick Start

Prerequisites

Complete Setup

Frontend Setup (Optional)

Backend API Server (Required for Frontend)

📊 ETL Job DAG

🔧 Usage

📈 Data Model

🔒 Field Ownership

⏱️ Freshness SLAs

🧪 Testing Strategy

📝 CSV Format

Expected CSV Structure

🔍 Derived Metrics

🐛 Troubleshooting

📦 Environment Variables

📚 Documentation (docs/)

🔄 Next Steps

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Zingage-Inspired ETL & Analytics System

🎯 Overview

🏗️ Architecture

Core Principles

📁 Project Structure

🚀 Quick Start

Prerequisites

Complete Setup

Frontend Setup (Optional)

Backend API Server (Required for Frontend)

📊 ETL Job DAG

🔧 Usage

📈 Data Model

🔒 Field Ownership

⏱️ Freshness SLAs

🧪 Testing Strategy

📝 CSV Format

Expected CSV Structure

🔍 Derived Metrics

🐛 Troubleshooting

📦 Environment Variables

📚 Documentation (docs/)

🔄 Next Steps

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages