Russian version: README.ru.md.
GEBODIKBD is an educational project: a minimal relational DBMS with a client–server architecture, a custom SQL pipeline, and disk-backed storage (heap pages + buffer pool). The main goal is to make the full query path tangible — from a SQL string to page reads/writes and index usage.
This project is intentionally simplified and not intended for production use. The focus is on core DBMS concepts and a readable implementation.
The initial motivation came from a university databases course (HSE University): it provided the theoretical foundation and classic DBMS building blocks, but I wanted to understand how those concepts work in practice — at the level of code, data formats, and real engineering trade-offs.
Over time, the learning goal evolved into a small research-style effort: build a minimal yet coherent relational database “analogue” where the full processing pipeline is visible end-to-end — from parsing and planning down to page-level storage and indexing — and observe how each subsystem impacts architecture and performance.
This project focuses on:
- The SQL pipeline: lexer → parser → semantic analysis → planning → optimization → execution
- The storage engine: heap pages, tuple identifiers (TID), serialization, buffer pool, and eviction strategies
- Indexes and plan selection (sequential scan vs index scan)
- Engineering practice: modular design, APIs, unit/e2e tests, and Java networking/I/O
- Java 17+
- Gradle Wrapper (included)
./run-server.shYou can configure the server via environment variables:
PORT=15432 DATA_DIR=data BUFFER_POOL=10 ./run-server.sh./run-cli.shYou can configure the client via environment variables:
HOST=127.0.0.1 PORT=15432 ./run-cli.sh- Client–server: TCP framed protocol, JSON responses, CLI client
- SQL dialect:
CREATE TABLE,INSERT,SELECT,CREATE INDEX - Expressions in
SELECTandWHERE: arithmetic, comparisons,AND/OR, aliases viaAS - Query compilation:
LexerImpl,ParserImpl,SemanticAnalyzerImpl,PlannerImpl,OptimizerImpl - Execution: Volcano-style executors (scan/filter/project/insert/create)
- Disk storage: 8KB heap pages, page file manager, tuple serializer
- Buffer pool: page cache, dirty pages, LRU/Clock eviction strategies
- System catalog: table/column/type/index metadata stored under
data/ - Indexes:
-
HASH— persistent, used for point lookups (=) -
BTREE(B+Tree) — range scans supported; currently rebuilt on startup (see limitations)
-
- Test suite: lexer/parser/semantic, storage, indexes, end-to-end over TCP
-
BTREEpersistence (on-disk format + recovery without rebuild) -
DELETE/UPDATE -
DROP TABLE/DROP INDEX - Transactions and concurrency (remove global server lock, add an isolation model)
- Better optimizer (statistics, cost-based planning)
- SQL extensions (JOIN, ORDER BY, LIMIT)
Full list: docs/ROADMAP.md.
High-level component diagram:
flowchart LR
subgraph Client["Client"]
CLI["CLI client\nru.open.cu.student.client.CliMain"]
end
subgraph Network["Network"]
FP["FramedProtocol\n(length-prefixed)"]
end
subgraph Server["Server"]
SRV["DbServer\nru.open.cu.student.server.DbServer"]
SES["DbSession\n(1 connection)"]
PIPE["SqlPipeline"]
QEE["QueryExecutionEngine"]
end
subgraph Compiler["Query compilation"]
LEX["LexerImpl"]
PAR["ParserImpl"]
SEM["SemanticAnalyzerImpl"]
PLN["PlannerImpl"]
OPT["OptimizerImpl\nRuleBasedScanStrategy"]
EXF["ExecutorFactoryImpl"]
end
subgraph Storage["Storage engine"]
SE["StorageEngine"]
CAT["CatalogManager"]
IDX["IndexManager\n(HASH / BTREE)"]
BPM["Buffer pools\n(LRU / Clock)"]
PFM["PageFileManager\n(HeapPageFileManager)"]
DISK["Disk files\n*.dat, indexes/*"]
end
CLI --> FP --> SRV --> SES --> PIPE
PIPE --> LEX --> PAR --> SEM --> PLN --> OPT --> EXF --> QEE
QEE --> SE
SE --> CAT
SE --> IDX
SE --> BPM --> PFM --> DISK
IDX --> DISK
Details and rationale: docs/ARCHITECTURE.md.
Minimal scenario:
CREATE TABLE users (id INT64, name VARCHAR);
INSERT INTO users VALUES (1, 'Alice');
INSERT INTO users VALUES (2, 'Bob');
SELECT * FROM users WHERE id = 2;
CREATE INDEX idx_users_id ON users(id) USING HASH;
SELECT * FROM users WHERE id = 2;More scenarios: docs/EXAMPLES.md.
docs/README.md— documentation indexdocs/ARCHITECTURE.md— architecture, diagrams, key decisionsdocs/PIPELINE.md— SQL pipeline (lexer/parser/semantic/plans/execution)docs/SQL_DIALECT.md— supported SQL dialect and limitationsdocs/STORAGE_ENGINE.md— heap pages, buffer pool, on-disk layoutdocs/INDEXES.md— HASH/BTREE indexes and plan selection rulesdocs/PROTOCOL.md— TCP/JSON protocoldocs/TECHNICAL_ASSIGNMENT.md— sample technical assignment / checklistdocs/DEVELOPMENT.md— build, run, configuration
Russian documentation is available under docs/ru/.
- Concurrency is simplified: queries are serialized via a global server-level lock
- No transactions and no isolation model
BTREEindexes are rebuilt by scanning the table on startup (persistence is a TODO)
./gradlew build
./gradlew testSee LICENSE.
See CONTRIBUTING.md.