Skip to content

datadriven-io/data-engineering-interview-questions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Data Engineering Interview Questions

1418 tagged practice problems for data engineering interviews. SQL, Python, schema design, pipeline architecture. Each problem links to a runnable browser sandbox.

Stars License PRs welcome Sandbox

SQL · Python · Schema design · Pipeline architecture · Companion repos


Section Count Browse
SQL 854 datadriven.io/sql-interview-questions
Python 388 datadriven.io/python-interview-questions
Schema design 56 datadriven.io/data-modeling-interview-questions
Pipeline architecture 120 datadriven.io/data-pipeline-interview-questions
Total 1418

Every problem runs in a browser sandbox with the schema preloaded. No local setup. Each question is tagged with difficulty, what it tests, and the common trap.

SQL (854 problems)

Topics: joins, aggregating, window functions, filtering, dates, conditional aggregation, CTEs, performance reasoning. Topic browser at datadriven.io/sql-interview-questions.

Top problems to know cold

Problem Difficulty Tests Trap
10 Lowest Uptime Services Easy TOP N with ties LIMIT 10 drops tied rows
2FA Confirmation Rate Easy Conditional aggregation Divide by zero
2nd Most Common Content Type Easy Tie breaking LIMIT 1 OFFSET 1 ignores ties
30 Day Page View Counts Easy Date filtering Timezone boundaries
7 Day Onboarding Conversion Medium Funnel analysis Anchoring on the wrong event
7 Check Rolling Average Medium Rolling window ROWS vs RANGE when days are missing
Active Users by Month Hard Cohort logic Double counting users active in multiple months

Window functions drill

Window functions appear in most senior DE SQL screens. Timed practice at datadriven.io/sql-window-functions-practice.

Python (388 problems)

DE Python is data manipulation, not LeetCode. Common patterns: chunking, sessionization, hash partitioning, interval merging, dedup with tie breaking, streaming aggregation, retries with backoff, schema evolution. Browse at datadriven.io/python-interview-questions.

Top problems

Problem Difficulty Pattern
Batch Records Easy Chunking iterables
Column Sum Easy Dict aggregation
Activity Time Ledger Medium Interval merging
Batch Partitioner Medium Hash bucketing
Batch With Metadata Medium Stateful iteration
Caesar Shift Check Hard String transforms
Character Occurrence Map Hard Counting tradeoffs

Schema design (56 problems)

Senior loops are won here. Reward: pick the right grain for fact tables, defend an SCD type, validate the schema with sample queries. Browse at datadriven.io/data-modeling-interview-questions.

Top problems

Problem Tests
A/B Experiment Assignment Schema SCD type 2, sticky bucketing
Customer Address History Effective dates, history preservation
Insurance Claims Lifecycle State machine modeling
Clickstream and Session Schema Sessionization, late events
E Commerce Supply Chain Tracking Multi entity tracking
Loan Management Schema Bridge tables, party roles
Cloud File Storage Metadata Schema Recursive hierarchies
Financial Trading Warehouse Time series, late arriving facts
Content Engagement Data Model Fact table grain
B2B Invoicing Data Model Many to many with attributes

Pipeline architecture (120 problems)

End to end design questions. Use the eight beat framework on every one. Browse at datadriven.io/data-pipeline-interview-questions.

Top case studies

Case study Domain
Card Transaction Streaming Pipeline Real time, exactly once
Cellular Connectivity and App Log Data Warehouse High cardinality
AWS Pipeline Auto Scaling for Variable Volume Cost optimization
Connected Vehicle Telemetry Pipeline High volume IoT
Capital Markets Intraday Risk Pipeline Regulatory lineage
Database Replication and Schema Normalization Pipeline CDC
Cost Optimized Clickstream Data Lake Storage tradeoffs
Databricks Pipeline with Spark Performance Optimization Spark internals

How many problems to be ready

About 100 medium and 25 hard, distributed across the four sections. Past that, returns diminish. Below that, gaps remain.

Companion repos

Contributing

Open an issue with: question text, schema, expected output, what it tests, the common trap. Reviewed and added with attribution.

License

CC BY-SA 4.0. Sandboxes hosted at datadriven.io.

Releases

No releases published

Packages

 
 
 

Contributors