sql-rl-env: A Reinforcement Learning Environment for SQL Query Generation

Author: Kartik Munjal

A Gymnasium-compatible reinforcement learning environment for SQL query generation via slot-filling, designed for studying reward signal design and reward hacking in structured task domains.

Research Overview

This project implements a slot-filling Markov decision process for SQL query generation over a small e-commerce database. The environment is designed explicitly to study:

Reward signal design: Four reward functions with different trade-offs between semantic richness, signal density, and hacking vulnerability
Reward hacking analysis: Eight documented hacking scenarios with formal detection methods
Curriculum difficulty: Five tasks of increasing SQL complexity (simple SELECT → window functions)
Baseline comparisons: Three agents (random, rule-based, REINFORCE) across all task/reward combinations

The key research insight is that by using a schema-grounded slot-filling formulation (rather than free-text token generation), all reward hacking scenarios operate on syntactically and schema-valid SQL, making them analytically tractable.

Portfolio Position

This repo is the structured-environment companion to rlhf-and-reward-modelling-alt. Both repos study the same failure mode - optimizing an imperfect reward until it breaks - but at different levels of control:

Repo	Reward surface	What it isolates
`rl-env`	SQL execution, partial credit, exact match, composite reward	Reward hacking in a fully inspectable MDP
`rlhf-and-reward-modelling-alt`	preference RMs, rubric RMs, ensembles, DPO/PPO	Reward hacking in language-model alignment
`deep-research-agent`	claim-level verifier labels	Verification as a process-style diagnostic signal

The intended research through-line is: start with reward hacking in a controlled SQL environment, transfer the lessons to RLHF reward models, then reuse the same process-level thinking for agent verification.

Environment Design

State Space

obs["schema"]        float32 (20, 8)    — Column feature matrix: [table one-hot | dtype one-hot]
obs["nl_embedding"]  float32 (128,)     — Bag-of-words NL query encoding
obs["partial_sql"]   float32 (26, 32)   — One-hot slot fillings per phase
obs["current_phase"] int32   (1,)       — Current build phase index

Action Space

Discrete(N_MAX_ACTIONS=32)

Actions are schema-grounded and phase-specific. The valid action set (provided via info["action_mask"]) changes at each step based on:

The current build phase (SELECT_COLS, FROM_TABLE, WHERE_COL, etc.)
Previously filled slots (join key masking, type-safe WHERE constraints)
Schema structure (only existing tables/columns are valid actions)

Reward Signals

Name	R	Signal	Primary Use
Exact Match	R1	Binary {0, 1}	Evaluation
Execution Match	R2	{-0.1, 0, 0.5, 1.0}	Evaluation + R4 input
Partial Credit	R3	[0, 1] continuous	Dense training signal
Composite	R4	[−0.1, 1]	Primary training signal

R4 = 0.50·R2 + 0.30·R3 + 0.15·R1 − 0.05·efficiency_penalty

Five SQL Tasks

Task	Complexity	SQL Construct	Episode Steps
1: Simple SELECT	★☆☆☆☆	SELECT + WHERE	5
2: Aggregation	★★☆☆☆	GROUP BY + HAVING + ORDER BY	8
3: INNER JOIN	★★★☆☆	Two-table join with FK selection	8
4: IN Subquery	★★★★☆	Nested sub-MDP with 10 phases	10
5: Window Function	★★★★★	RANK/ROW_NUMBER + PARTITION BY	9

All tasks use a shared e-commerce database: customers, products, orders, order_items (50, 20, 200, 600 rows respectively).

Reward Hacking Analysis

Eight hacking scenarios are documented in research_notes/reward_hacking_report.md:

ID	Reward	Mechanism	Detection Signal
H1.1	R1	Cross-join passes alias normalisation	Row count KL drift
H1.2	R1	SELECT * exploitation	Column coverage trend
H2.1	R2	Large result set Jaccard inflation	Row distribution KL
H2.2	R2	Timeout stagnation (no-gradient trap)	Learning curve plateau
H3.1	R3	Column flooding for Jaccard numerator	Coverage Spearman r
H3.2	R3	WHERE >= 0 operator monoculture	Operator entropy
H4.1	R4	High w_partial local optimum	exec vs partial gap
H4.2	R4	LIMIT suppresses efficiency penalty	Pre-LIMIT COUNT(*)

Hacking is detected in real-time by src/analysis/reward_hacking_detector.py using three statistical signals (KL divergence, Shannon entropy, Spearman rank correlation). A composite alert fires when 2/3 signals trigger simultaneously.

Formal Detection Framework

The detector maintains three rolling-window statistics over the last 100 training episodes. For a full derivation and worked examples, see research_notes/reward_hacking_report.md.

Signal 1 — Result Set Size Drift (KL Divergence)

Let P be the baseline row-count histogram (log-scale buckets: 0, 1–5, 6–20, 21–100, 101+) computed over the first 100 episodes. Let Q be the histogram over the most recent 50 episodes. The detector fires when:

KL(P || Q) = Σ_x P(x) · log(P(x) / Q(x)) > τ_kl = 0.50

with Laplace smoothing ε = 10⁻⁹ to handle empty buckets. An increasing distribution shift flags H2.1 (large result set inflation); a decreasing shift flags H2.2 (timeout stagnation).

Signal 2 — Operator Diversity (Shannon Entropy)

Let {p_i} be the empirical distribution of WHERE operators in the last 50 episodes (operators: =, !=, >, <, >=, <=, LIKE, NONE). The normalized entropy is:

H_norm = H(π) / log₂(|Ω|)    where H(π) = −Σ_i p_i · log₂(p_i)

Signal 2 fires when H_norm < τ_H = 0.80 and a single operator dominates (|Ω| = 1). This flags H3.2 (WHERE operator monoculture). The normalisation by log₂(|Ω|) accounts for the fact that fewer distinct operators naturally yield lower maximum entropy.

Signal 3 — Column Coverage Trend (Spearman Rank Correlation)

Let c_t = |selected_cols_t| / |schema_cols| ∈ [0, 1] be the column coverage fraction for episode t. The Spearman rank correlation between episode index and coverage over the last 50 episodes is:

ρ = 1 − (6 · Σ d_i²) / (n(n²−1))    where d_i = rank(t_i) − rank(c_i)

Signal 3 fires when ρ > τ_ρ = 0.40, indicating a persistent upward coverage trend (H3.1 column flooding). Unlike Pearson correlation, Spearman is robust to outliers from episodes where the agent successfully generates short queries.

Composite Alert

A composite alert fires when 2 or 3 of the three signals trigger simultaneously. This voting threshold is a deliberate design choice: any single signal can fire transiently due to short-window variance, while requiring all three is too conservative for mixed-mode failures. The severity is |signals_fired| / 3. The per-signal thresholds themselves are heuristic and are discussed as a limitation in the methodology section below.

Quick Start

1. Install dependencies

pip install -r requirements.txt
pip install -e .

2. Seed the database

python scripts/seed_db.py

This creates data/ecommerce.db (deterministic, seed=42) and configs/nl_vocab.json.

3. Run a sanity check (random agent)

python scripts/run_experiment.py --dry_run

4. Train REINFORCE on Task 1

python scripts/run_experiment.py --task task_01_simple --reward composite --episodes 2000

5. Run full evaluation table

python scripts/evaluate_agents.py --episodes 500

6. Run tests

pytest tests/ -v

Results

All experiments run with 1,500 training episodes per task, REINFORCE with composite reward (R4), 50-episode evaluation windows. Baselines are evaluated with 300 episodes each. Raw data in results/eval_table.json, results/training_curves.json, and results/curriculum_ablation.json. Additional ablation data in results/weight_ablation.json, results/extended_curriculum.json, results/reward_hacking_table.json, results/threshold_sensitivity_eval.json, and traced training-threshold sweeps in results/task_02_aggregation/composite/threshold_sensitivity_training_trace.json and results/task_03_join/composite/threshold_sensitivity_training_trace.json.

Key Research Findings

Finding 1 — Task structure, not reward weighting, determines hacking vulnerability. Under fixed-policy evaluation, the reward hacking detector fires identically under all four reward signals (exact, execution, partial, composite) for every task. Tasks 2 and 4 trigger 8 composite alerts beginning at episode 50 under every reward function, while Task 5 triggers a single late alert at episode 150 under every reward function. The learned policy's behavioral fingerprint — which operators it uses, how many columns it selects, how large its result sets are — is therefore largely fixed by training dynamics and task structure, not by the score used at evaluation time. The weight ablation adds the stronger training-time result: increasing w_exec does not improve execution match at all. This should be read as an explicit limitation of the current prototype environment, not hidden as a neutral finding: the present reward design admits stable partial-credit optima that the execution component does not reliably break. That limitation is precisely what motivates roadmap items 3 and 4 below: tighten the action space's link to semantics, and enrich execution feedback beyond a static database penalty.

Finding 2 — Task 3 (INNER JOIN) is harder to exploit during training, but not immune. In the main 1,500-episode training run, Task 3 produced no composite alerts, which is materially different from the immediate alert patterns on Tasks 2 and 4. JOIN tasks require co-selecting a join table and a FK-compatible join key, which forces broader operator and column usage during optimization. But the evaluation-only reward-swap study still shows alerts on Task 3 beginning at episode 100, so the correct conclusion is not "JOIN is hacking-proof." The stronger claim supported by the data is narrower: multi-step relational structure delays and weakens collapse, but does not remove it.

Finding 3 — Curriculum benefit is hierarchically structured. Transfer learning follows a depth-of-transfer ladder rather than a binary "shared sub-skills / no shared sub-skills" split. Task 1 to Task 2 yields no benefit because WHERE-slot priors do not transfer to GROUP BY or HAVING decisions. Task 1 to Task 3 yields a small early gain that fades; the most plausible explanation is transfer of shallow schema/action priors, but the current artifacts do not log phase-level action distributions, so that mechanism remains a hypothesis. Task 3 to Task 4 yields a +62% final reward gain (0.300 vs 0.185) and 3× faster convergence because JOIN training transfers the deeper capability that the target also needs: coordinating two SQL namespaces through FK-compatible bindings. The pattern is therefore: no overlap -> zero transfer, temporary warm-start -> fading transfer, deep structural overlap -> persistent transfer.

Finding 4 — The bottleneck is not just the NL encoder; it is the encoder plus a weak RL optimization regime. The rule-based agent dominates REINFORCE on Tasks 1–4 despite having zero training, which does indicate that the 128-dim bag-of-words state is a hard ceiling. But that is not the full story. REINFORCE here also uses terminal-only reward with high variance, and it fails to cleanly separate from the non-learning rule baseline on the very tasks where learning should have been easiest. The Task 5 win is real but small. The correct interpretation is therefore two-part: the BoW encoder constrains what information is available, and the current REINFORCE formulation does not reliably convert the available information into better behavior than a hand-written heuristic. That does not make the environment useless. Its value is diagnostic rather than benchmark-maximizing: it makes reward hacking, local optima, and metric disagreement visible in a controlled SQL domain with executable semantics. Richer encoders are necessary, but on their own they are not sufficient evidence that the RL formulation is sound.

Learning Curves

Left: REINFORCE composite reward across all five tasks. All tasks improve from the initial random exploration level (ep 100) and plateau between episodes 800–1,500. Task 4 (subquery) achieves the highest composite reward (0.21) due to high partial-credit scores on a large action space. Task 2 (aggregation) is hardest (reward 0.07–0.11) because GROUP BY + HAVING requires correctly co-selecting both aggregation function and grouping column — errors in either produce 0 execution match.

Right: Curriculum ablation (see §Curriculum Ablation below).

Agent × Task × Reward Signal Table

Metrics: composite = primary training signal; exec = execution match (−0.1 for syntax errors, 0.5 for partial overlap, 1.0 for exact result set); partial = component-wise partial credit [0,1].

Task	Reward	Random	Rule-Based	REINFORCE
T1: Simple SELECT	composite	0.147	0.187	0.165
	execution	−0.057	0.000	−0.032
	partial	0.589	0.639	0.610
	exact	0.000	0.000	0.000
T2: Aggregation	composite	0.073	0.105	0.085
	execution	−0.033	0.000	−0.025
	partial	0.306	0.349	0.332
	exact	0.000	0.000	0.000
T3: INNER JOIN	composite	0.138	0.203	0.182
	execution	−0.090	0.000	−0.038
	partial	0.610	0.689	0.677
	exact	0.000	0.000	0.000
T4: IN Subquery	composite	0.184	0.304	0.203
	execution	−0.080	0.203	−0.037
	partial	0.749	0.694	0.741
	exact	0.000	0.000	0.000
T5: Window Fn	composite	0.121	0.132	0.141
	execution	−0.047	0.000	−0.030
	partial	0.490	0.467	0.529
	exact	0.000	0.000	0.000

Key findings:

No agent achieves exact match > 0.0 on any task. This is expected — the normalised string comparison in R1 is extremely strict, and the slot-filling policy is not constrained to reproduce gold SQL verbatim. It establishes that R1 is unsuitable as a training signal. More importantly, this is a validity problem, not just a strictness problem. If exact match never rises above zero, R1 provides no discriminative gradient at all and cannot serve as a ground-truth verifier for "fully correct" behavior. The environment therefore cannot claim semantic validity by appealing to exact match. The current validity argument has to come from metric triangulation instead: execution match distinguishes executable result agreement from execution failure, and partial credit distinguishes clause-level progress from total mismatch. That triangulation is informative, but it is weaker than having a non-degenerate exact verifier, so claims about genuine SQL understanding should stay narrow.
Rule-based agent dominates Tasks 1–4 on composite reward. The BoW encoder is part of the reason: keyword matching on simple NL patterns is sufficient for tasks 1–3, and the rule agent's keyword-to-IN mapping gives it a decisive edge on Task 4 execution match (0.203 vs −0.037 for REINFORCE). But it is incomplete to frame this purely as an encoder bottleneck. The learning agent also uses a high-variance, terminal-only REINFORCE objective and never clearly separates from the rule baseline on the tasks it should plausibly learn. The environment therefore exposes a methodological limitation as well as a representational one: with the current RL setup, the non-learning baseline wins on 80% of tasks.
REINFORCE beats both baselines on Task 5 (partial credit 0.529 vs 0.490 random, 0.467 rule). Window function tasks have no clear keyword cue for PARTITION BY column selection. The rule agent's keyword-to-action lookup fails here, while REINFORCE learns from partial-credit feedback to select schema-compatible partition columns. This is the clearest positive learning result in the benchmark, but it should be framed cautiously. The advantage is modest and appears primarily in partial credit, not in execution match.
Execution match is consistently negative for REINFORCE and random (range −0.09 to −0.025). This is not a failure — it reflects that the majority of randomly assembled SQL queries cause SQLite execution errors, each penalised at −0.1. The partial credit component keeps composite reward positive. This is Hacking Scenario H2.2 (timeout stagnation) partially manifesting: even with the −0.1 error penalty, the policy cannot fully escape the high-error regime without semantic understanding.
Rule agent achieves 0.000 execution match on Tasks 1–3 and 5 (not negative). The rule agent reliably produces valid SQL (no execution errors) but rarely matches the exact result set. This confirms that syntactic validity alone is not sufficient for execution match.

REINFORCE Training Curve — Composite Reward

Episode	T1 R	T2 R	T3 R	T4 R	T5 R
100	0.149	0.071	0.137	0.194	0.120
300	0.165	0.106	0.173	0.190	0.129
500	0.167	0.081	0.157	0.182	0.121
700	0.169	0.078	0.169	0.198	0.126
900	0.170	0.083	0.158	0.210	0.130
1100	0.175	0.106	0.158	0.187	0.137
1300	0.172	0.079	0.171	0.184	0.131
1500	0.177	0.073	0.181	0.210	0.133

T4 and T1 show monotonic improvement; T2 is noisy due to high partial-credit variance in GROUP BY matching. T3 (JOIN) shows the slowest improvement of the first three tasks, consistent with the credit assignment challenge of selecting correct FK pairs.

Reward Hacking Observations

The RewardHackingDetector monitors three signals per episode: KL divergence of row count distribution (H2.1), operator diversity entropy (H3.2), and column coverage Spearman trend (H3.1). An alert fires when 2/3 signals trigger simultaneously.

Task	First Alert (episode)	Primary Signals	Interpretation
T1: Simple SELECT	500	row_distribution + column_coverage	H2.1 partial: agent gravitates toward SELECT * from full-table queries. First 500 episodes show row counts skewed high.
T2: Aggregation	100	operator_diversity + column_coverage	H3.2 fires immediately: GROUP BY queries omit WHERE, collapsing operator entropy to "NONE". Column coverage trends up as agent explores SELECT *.
T3: INNER JOIN	—	(no composite alert)	JOIN tasks generate diverse WHERE operators organically (=, !=, LIKE across multiple table columns), keeping entropy above threshold throughout training.
T4: IN Subquery	100	operator_diversity + column_coverage	Same pattern as T2. Subquery structure dominates; outer WHERE operator variety collapses to "IN" monoculture.
T5: Window Fn	1500	operator_diversity + column_coverage	Alert fires only at the final checkpoint. WINDOW tasks generate no WHERE operators in most episodes (WHERE phase skipped), so entropy builds slowly before collapsing at convergence.

Task 3 had zero composite hacking alerts across 1,500 episodes. This is the most significant hacking observation: JOIN tasks require selecting both a join table and a join key, which forces the agent to diversify its column choices, keeping coverage signal below the Spearman threshold. This demonstrates that task structure directly affects hacking vulnerability — a finding with implications for reward signal design in multi-step structured task domains.

The parallel finding in the RLHF repo is that reward signal type is an independent axis: PPO trained with an explicit Rubric RM (Conciseness criterion) produces 31% verbose-bias rate vs. 78% for PPO trained with an implicit Bradley-Terry RM — same task, different reward signal. Together the two findings suggest a 2×2 design space:

	Simple task structure	Complex task structure (e.g. INNER JOIN)
Implicit RM (Bradley-Terry)	High hacking risk (T1/T2/T4 alert within 100–500 eps)	Delayed and reduced hacking risk, but not eliminated — RM still exploitable
Explicit Rubric RM	Reward signal controlled (31% vs 78% verbose-bias)	Both dimensions favorable — lowest overall hacking exposure

Both dimensions matter and they interact: a well-designed reward signal cannot compensate for a trivially exploitable task structure, and a rich task structure does not immunise against a weak reward signal.

Curriculum Ablation: Task 3 (INNER JOIN)

Setup: Train REINFORCE on Task 3 twice — once from random initialisation (scratch), once initialised from the Task 1 (Simple SELECT) checkpoint. 1,200 episodes each, same hyperparameters.

Episode	Scratch R	Curriculum R	Advantage
100	0.143	0.155	+0.012
200	0.145	0.165	+0.020
300	0.157	0.171	+0.014
400	0.166	0.169	+0.003
500	0.155	0.173	+0.019
600	0.132	0.139	+0.007
700	0.153	0.151	−0.002
800	0.161	0.171	+0.009
900	0.161	0.168	+0.007
1000	0.174	0.178	+0.004
1100	0.160	0.159	−0.001
1200	0.161	0.166	+0.005

Summary: Curriculum initialisation provides a mean advantage of +0.014 in the early phase (ep 100–500) and +0.004 in the late phase (ep 600–1200). The maximum early advantage is +0.020 at episode 200.

Interpretation: The most defensible claim from the released artifacts is that Task 1 pretraining gives Task 3 a short-lived warm start, but not a durable relational advantage. The temporary benefit is consistent with transfer of shallow schema/action priors — e.g., earlier concentration on plausible tables and columns — but the current artifacts do not log phase-level action distributions, so that mechanism remains a hypothesis rather than a demonstrated result. The curriculum advantage dissipates after ~600 episodes as the scratch agent catches up on those basics. The final performance gap (+0.005 at ep 1200) is small.

This experiment directly answers the curriculum design question only at the level of learning curves: the Task 1 → Task 3 curriculum reaches the scratch ep-300 reward level earlier, but the available data do not yet show which action distributions changed to create that effect. Saved checkpoints alone are not enough for a retroactive phase-histogram analysis because the serialized artifacts contain only model weights, optimizer state, and episode_count, not per-episode trajectories or slot choices. A small follow-up experiment in this repo should log phase-level action histograms during curriculum and scratch training so the temporary transfer can be attributed to specific slots rather than inferred from reward alone.

R4 Weight Sensitivity Ablation — Tasks 2 and 5

Setup: Three weight configurations for R4 trained on the two highest-variance tasks (Task 2: hardest, Task 5: REINFORCE-winning). 600 episodes each.

Weight Profile	w_exec / w_partial / w_exact	Task 2 R (final)	Task 2 exec	Task 2 alerts	Task 5 R (final)	Task 5 exec
balanced (default)	0.50 / 0.30 / 0.15	0.077	−0.034	6	0.130	−0.040
high_exec	0.70 / 0.20 / 0.05	0.038	−0.032	6	0.071	−0.040
high_partial	0.30 / 0.50 / 0.15	0.151	−0.032	6	0.242	−0.040

Key findings:

Execution match is weight-invariant (−0.032 to −0.040 across all configs on both tasks). This is not just a neutral empirical finding; it is a design red flag. The learned policy generates the same proportion of execution errors regardless of how those errors are penalised, which means the policy is effectively not responding to the execution signal at all. The environment has admitted a partial-credit local optimum that is insensitive to stronger semantic penalties.
Task 2 triggers 6 hacking alerts under every weight profile, with first alert always at episode 100. Task 5 triggers zero alerts under every profile. Weight tuning cannot change the task-structural hacking vulnerability — confirming Finding 1.
high_partial inflates composite reward (Task 2: 0.077 → 0.151, Task 5: 0.130 → 0.242) because the metric being maximised weights partial credit more heavily. This is H4.1 in action: higher w_partial creates a local optimum where good partial credit substitutes for execution match. Researchers should not interpret high-partial composite rewards as better policies.
high_exec suppresses composite reward by 50–60% (Task 2: 0.038, Task 5: 0.071) without improving final execution match. The practical conclusion is not that w_exec ≈ 0.50 is "optimal" in a general sense; it is that within this environment, reward reweighting alone cannot repair the semantic blind spot.

Extended Curriculum Ablation — T1→T2 and T3→T4

Setup: Two additional transfer paths to complement the original T1→T3 ablation. 800 episodes each with Task 1 and Task 3 checkpoints as initialisers.

Path	Source	Target	Curriculum ep-100 R	Scratch ep-300 R	Curriculum crosses scratch ep-300 at	Final R (curriculum)	Final R (scratch)
T1→T2	Simple SELECT	Aggregation	0.079	0.088	ep 300 (no speedup)	0.089	0.090
T3→T4	INNER JOIN	IN Subquery	0.191	0.187	ep 100 (3× speedup)	0.300	0.185

Interpretation:

T1→T2 yields no curriculum benefit. Simple SELECT and Aggregation belong to categorically different clause families: Task 1 fills WHERE conditions while Task 2 fills GROUP BY/HAVING/aggregation slots. The Task 1 policy encodes WHERE-schema associations that do not transfer to aggregation decisions. Final performance is within 1% of scratch (0.089 vs 0.090). This is a negative result: adjacent difficulty does not imply shared sub-skills.
T3→T4 yields strong curriculum benefit. INNER JOIN and IN Subquery share multi-table reasoning: both require selecting a source table, choosing FK-compatible join columns, and coordinating across two SQL namespaces. The JOIN policy's FK-mapping associations transfer directly to subquery variable binding. The curriculum agent crosses the scratch ep-300 performance level at episode 100 — 3× convergence speedup — and achieves 62% higher final reward (0.300 vs 0.185). This is the strongest transfer result in the benchmark.

The pattern across all three curriculum paths is more specific than the original "shared sub-skills" summary. T1→T2 shows zero transfer because the target introduces new clause families. T1→T3 shows temporary transfer because source pretraining helps only with shallow schema priors and action familiarization; once scratch catches up on those basics, the benefit disappears. T3→T4 shows persistent transfer because the source and target both require the deeper relational skill of coordinating two SQL namespaces through compatible keys. Curriculum benefit therefore scales with transfer depth, not just overlap in topic or difficulty.

Reward Signal × Hacking Rate Interaction

Setup: Load each task's REINFORCE checkpoint (trained on R4 composite) and run 400 evaluation episodes under each of the four reward functions. The hacking detector runs during evaluation, reporting composite alerts per (task, reward) pair.

Task	Exact (first alert ep)	Execution (first alert ep)	Partial (first alert ep)	Composite (first alert ep)
T1: Simple SELECT	—	—	—	—
T2: Aggregation	50	50	50	50
T3: INNER JOIN	100	100	100	100
T4: IN Subquery	50	50	50	50
T5: Window Fn	150	150	150	150

All four reward functions produce identical alert patterns for every task. Tasks 2 and 4 trigger the earliest and most frequent alerts; Task 3 suppresses alerts the longest; Task 1 never triggers alerts; Task 5 triggers late-phase alerts. The reward function used for evaluation has no effect on which hacking behaviors the detector observes.

This is the key empirical result behind Finding 1: reward hacking is policy-intrinsic. The REINFORCE policy trained on composite reward develops a behavioral fingerprint (operator distribution, column coverage pattern, row count distribution) that is entirely determined by its training dynamics on each task. Evaluating the same policy under R1 vs R4 does not change what the policy does — it only changes what score that behavior receives. Hacking detection based on behavioral signals (KL, entropy, Spearman) is therefore reward-function-agnostic, which is both a strength (the detector generalizes across reward functions) and an important constraint on interpretation (changing the reward function does not eliminate hacking behaviors already embedded in a trained policy).

Project Structure

sql-rl-env/
├── src/
│   ├── env/
│   │   ├── sql_env.py          # Gymnasium.Env — central contract
│   │   ├── action_space.py     # Schema-grounded hierarchical action space
│   │   ├── state.py            # SQLState + NLEncoder
│   │   ├── schema.py           # Schema loader + FK graph
│   │   └── executor.py         # Safe SQLite executor (timeout + pre-LIMIT COUNT)
│   ├── tasks/
│   │   └── base.py             # SQLTask + TaskRegistry + SQL assemblers
│   ├── rewards/
│   │   ├── exact_match.py      # R1: normalised string comparison
│   │   ├── execution_match.py  # R2: result-set Jaccard comparison
│   │   ├── partial_credit.py   # R3: component-wise scoring
│   │   └── composite.py        # R4: weighted combination + efficiency penalty
│   ├── agents/
│   │   ├── random_agent.py     # Lower-bound baseline
│   │   ├── rule_agent.py       # Keyword-matching non-learning upper bound
│   │   └── reinforce_agent.py  # REINFORCE + baseline network (PyTorch)
│   └── analysis/
│       └── reward_hacking_detector.py  # 3-signal statistical detector
├── configs/
│   ├── env_config.yaml
│   ├── task_config.yaml
│   ├── reward_config.yaml
│   ├── agent_config.yaml
│   └── tasks/                  # NL queries + gold SQL for each task
├── scripts/
│   ├── seed_db.py              # Deterministic database seeder
│   ├── run_experiment.py       # REINFORCE training loop
│   ├── evaluate_agents.py      # Cross-agent comparison table
│   ├── run_all_experiments.py  # Master: train all tasks + eval + T1→T3 curriculum
│   ├── run_extra_experiments.py  # Ablations: weight sensitivity, extended curriculum, reward×hacking
│   ├── run_threshold_sensitivity.py  # Checkpoint-eval detector threshold sweep
│   └── run_trace_threshold_sweep.py  # Offline sweep on traced training runs
├── tests/
│   ├── test_rewards.py         # Reward function unit tests + H3.1 regression
│   ├── test_actions.py         # Action mask correctness
│   └── test_tasks.py           # SQL assembly and task loading
└── research_notes/
    ├── design_decisions.md     # Rationale for 8 architectural decisions
    └── reward_hacking_report.md  # Formal analysis of 8 hacking scenarios

Reward Hacking Detection: Methodology Connection to RLHF

The three-signal detection framework in src/analysis/reward_hacking_detector.py is architecturally aligned with the reward hacking detection methodology developed in the companion rlhf-and-reward-modelling-alt repository. Both frameworks share the same core design principles:

Shared architectural principles:

Principle	sql-rl-env	rlhf-and-reward-modelling-alt
Signal 1: Distribution shift	KL divergence of result-set row counts	KL divergence of reward model score distributions
Signal 2: Behavioral monoculture	Shannon entropy of WHERE operators	Entropy of response length/token distributions
Signal 3: Structural trend	Spearman rank correlation of column coverage	Spearman rank correlation of semantic similarity trends
Alert threshold	2/3 signals simultaneously	2/3 signals simultaneously
Rolling window	100 episodes	100 training steps

The 2/3 voting threshold is a deliberate architectural choice in both projects. A 1/3 threshold produces excessive false positives as individual signals fluctuate during normal training; a 3/3 threshold is too conservative and misses mixed-mode failures where one signal remains below threshold by noise alone. This majority-vote structure is the generalizable pattern: reward hacking manifests as simultaneous anomaly across multiple behavioral dimensions, not as a single-signal spike.

The per-signal thresholds in this repository are less mature. kl_threshold=0.50, entropy_threshold=0.80, and coverage_trend_threshold=0.40 are currently heuristic defaults baked into src/analysis/reward_hacking_detector.py, chosen for interpretability on this toy schema rather than from a formal calibration study. That limitation is now materially narrower than it was in the first release.

Two calibration artifacts now exist:

Checkpoint-evaluation sweep: scripts/run_threshold_sensitivity.py replays released checkpoints and shows that tau_kl in {0.35, 0.50, 0.65} leaves the composite-alert pattern unchanged for all five tasks in evaluation mode; see results/threshold_sensitivity_eval.json.
Training-trace sweep: traced reruns on an exploit-prone task (Task 2) and a delayed-risk task (Task 3) were logged with scripts/run_experiment.py using --save_detector_trace, then swept offline with scripts/run_trace_threshold_sweep.py over the full 3 x 3 x 3 grid tau_kl in {0.35, 0.50, 0.65}, tau_entropy in {0.64, 0.80, 0.96}, and tau_spearman in {0.32, 0.40, 0.48}. The resulting artifacts show complete pattern stability across all 27 configurations on both traced runs: Task 2 always alerts from episode 50 onward with 12/12 alert windows, while Task 3 never alerts during the 600-episode traced run.

This is enough to say that the detector is locally stable to substantial threshold perturbations in both checkpoint evaluation and traced training reruns. The remaining calibration gap is narrower and more specific: the repository still lacks a larger bank of traced runs across all tasks and seeds, and it does not yet provide formal false-positive / false-negative estimates against human-labeled hacking windows. The current proxy reference rule in the training-trace sweep is useful for sanity checking exploit-prone runs, but it is not a substitute for manual calibration labels.

The SQL domain operationalizes this methodology in a form with executable semantics and stronger verification than RLHF-style text rewards, even though the strictest verifier in this repository is degenerate in practice. That still makes it a useful testbed for hacking-detection methods that later generalize to domains where verification is even weaker. The key research contribution is demonstrating that the same statistical framework — KL + entropy + Spearman, with majority-vote composite alerts — detects reward hacking across both a structured task domain (SQL) and the unstructured natural language domain studied in the RLHF literature.

Production-Grade Environment Roadmap

This environment is an intentional research prototype: a small schema, fixed NL vocabulary, and slot-filling MDP chosen for analytical tractability. The roadmap below is driven by the failure modes exposed in the experiments rather than by a generic SQL-agent wish list.

1. Scale the schema and task set

Motivated by Findings 1 and 3: the current task set is small enough that task-specific local optima and brittle transfer effects dominate. A larger schema and broader task family are required to test whether the observed hacking patterns and curriculum hierarchy persist outside this toy regime.

The 4-table, 20-column e-commerce schema is sufficient for 5 task types but would not support enterprise SQL diversity. A production environment would require:

50–200 tables with realistic FK graphs (star schema, snowflake schema, self-referential FKs)
Per-schema dynamic masking (the current HierarchicalActionSpace is already schema-agnostic; only the vocab size needs to scale)
Tasks generated from real query logs (e.g., Spider, WikiSQL, or internal warehouse logs) rather than hand-authored gold SQL

2. Replace the NL encoder

Motivated by Finding 4: the rule agent matches the information content of the current BoW encoder too closely for the learned policy to demonstrate clear representational gains.

The 128-dim bag-of-words encoder is the dominant performance ceiling (see Finding 4). A production encoder would be:

A frozen pre-trained sentence encoder (BERT, T5) producing 768-dim embeddings
Or a jointly trained encoder using a contrastive NL-to-SQL matching objective
The observation space would grow from 1,121 to ~6,000+ dims, requiring a deeper policy network

3. Move from slot-filling to constrained beam search

Motivated by the exact-match validity problem and the execution-insensitivity result: the current slot-filling policy makes it easy to accumulate partial credit without assembling semantically correct SQL. A grammar-constrained decoder would tighten the link between action choices and executable semantics.

Slot-filling cannot express all SQL constructs (window frames, CTEs, lateral joins, arithmetic in SELECT). A production action space would use:

Constrained beam search over a SQL grammar, with schema-grounded validity checks at each token
Or a hybrid: slot-filling for common constructs, with an escape action that invokes token-level generation for rare constructs

4. Live database execution with data drift

Motivated by the weight-ablation red flag: if stronger execution penalties do not move behavior in a static database, the next environment should test richer semantic feedback under changing data distributions and explicit query-cost consequences.

The current executor uses a static database seeded at launch. A production environment would support:

Live database connections with MVCC-isolated snapshot reads (preventing reward manipulation via data mutation)
Periodic database updates to test distributional robustness of the learned policy
Query execution cost as an explicit efficiency signal (not just row count)

5. Multi-turn conversational interface

Motivated by Findings 1 and 4: one way to break keyword monoculture and expose richer policy behavior is to move beyond one-shot queries into clarification and refinement, where conversational context can create natural action diversity that a rule baseline cannot exploit as easily.

Current tasks are single-turn: one NL query → one SQL query. A production version would model:

Clarification dialogue (NL ambiguity resolution → SQL refinement)
Multi-turn WHERE construction (progressive constraint addition)
Session-level reward (correctness of final query after N turns)

The slot-filling MDP's BuildPhase enum and HierarchicalActionSpace are designed to extend naturally to these settings: additional phases can be added without changing the environment's Gymnasium interface.

Key Design Decisions

Full rationale in research_notes/design_decisions.md.

D1 — SQL domain chosen for executable semantics and rich partial-correctness structure.

D2 — Slot-filling MDP (not token-level) for tractable masking, interpretable hacking, and clean separation from language model capabilities.

D3 — Terminal-only reward with gamma=1.0 to avoid artificial decomposition of joint SQL decisions and penalisation of longer tasks.

D4 — Schema-grounded masking at environment level to eliminate syntactic hacking and focus analysis on semantically valid but semantically wrong queries.

D5 — REINFORCE (not PPO) is the correct choice for this research environment. REINFORCE's high variance is a research feature, not a drawback: variance surfaces policy failures early and visibly in learning curves. PPO's clipping mechanism (L^CLIP) truncates policy updates that deviate from the behavior policy, which suppresses the very policy oscillations and reward-exploitation episodes that this project is designed to study. In a reward signal analysis project, failure modes must be observable to be documented. REINFORCE makes every hacking episode visible as a policy gradient update; PPO's KL penalty would stabilize training at the cost of hiding gradient pathologies. Production RLHF deployments use PPO for stability — this project uses REINFORCE for diagnostic transparency, which is the methodologically appropriate choice for the research goal.

Citing This Work

@misc{munjal2026sqlrlenv,
  title  = {sql-rl-env: A Gymnasium Environment for Reward Signal Design in SQL Query Generation},
  author = {Kartik Munjal},
  year   = {2026},
  url    = {https://github.com/kartikmunjal/rl-env}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
research_notes		research_notes
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

sql-rl-env: A Reinforcement Learning Environment for SQL Query Generation

Research Overview

Portfolio Position

Environment Design

State Space

Action Space

Reward Signals

Five SQL Tasks

Reward Hacking Analysis

Formal Detection Framework

Quick Start

1. Install dependencies

2. Seed the database

3. Run a sanity check (random agent)

4. Train REINFORCE on Task 1

5. Run full evaluation table

6. Run tests

Results

Key Research Findings

Learning Curves

Agent × Task × Reward Signal Table

REINFORCE Training Curve — Composite Reward

Reward Hacking Observations

Curriculum Ablation: Task 3 (INNER JOIN)

R4 Weight Sensitivity Ablation — Tasks 2 and 5

Extended Curriculum Ablation — T1→T2 and T3→T4

Reward Signal × Hacking Rate Interaction

Project Structure

Reward Hacking Detection: Methodology Connection to RLHF

Production-Grade Environment Roadmap

Key Design Decisions

Citing This Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages