| title | AI Research Dashboard |
|---|---|
| emoji | π€ |
| colorFrom | blue |
| colorTo | purple |
| sdk | docker |
| sdk_version | 0.2.4 |
| python_version | 3.11 |
| app_file | app.py |
| pinned | false |
ββββββ βββ βββββββ ββββββββββββββββββββββββ ββββββ βββββββ ββββββββββ βββ βββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββ βββββββββββ ββββββββββββββ ββββββββββββββ βββββββββββββββββββ ββββββββ βββββββββββ ββββββββββββββ ββββββββββββββ βββββββββββββββββββ ββββββββ βββ ββββββ βββ ββββββββββββββββββββββββββββββ ββββββ ββββββββββββββ βββ βββ ββββββ βββ ββββββββββββββββββββββββββββββ ββββββ βββ ββββββββββ βββ
ββββββββββββ ββββββ βββ βββββββββββββ ββββββ βββ ββββββ ββββββ ββββββ βββ ββββββ ββββββββββββββ ββββ βββββββββββ ββββββ βββββββ βββββββββββ βββββ βββββ
AI research environment that simulates the end-to-end scientific discovery process, enabling agents to analyze papers, generate hypotheses, design experiments, and validate results collaboratively
- Architecture
- Prerequisites
- Installation
- Run
- The Simulation Loop Architecture
- Tasks & Graders
- Agent Actions
- Reward Function
- Training Pipeline
- Project Structure
- Configuration
- License
- Author
- Hackathon
The system transitions traditional MDP benchmarks into a Full-Stack Serverless Application composed of a stunning React dashboard and a robust Python backend leveraging FastAPI.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Web User Interface β
β React + Vite + Zustand + Recharts + Tailwind β
β (User drives Auto-Pilot or manual execution) β
ββββββββββββββββ¬βββββββββββββββββββββββββββββ¬βββββββββββββ
β HTTP POST /api/agent β HTTP POST /step
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend (server/app.py) β
β β
β ββββββββββββββββββ βββββββββββββββββββββββ β
β β HF Serverless β β ResearchEnvironment β β
β β Inference API β β (environment.py) β β
β βββββββββ¬βββββββββ βββββββββββ¬ββββββββββββ β
βββββββββββββΌβββββββββββββββββββββββββββββΌββββββββββββββββ
β Qwen2.5-72B-Instruct β Graders &
β LLM Inference API β Tasks
βΌ βΌ
- Python 3.11+
- Node.js 18+
- Hugging Face Account with an Access Token (
HF_TOKEN)
pip install -r requirements.txtcd dashboard
npm installStart the Backend Server (FastAPI):
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860Start the Frontend Server (Vite):
cd dashboard
npm run devThe project is built to rely on a native Dockerfile for HF Spaces.
# Build the React UI internally
cd dashboard && npm run build && cd ..
# The Docker build natively serves the static dist folder
docker build -t ai-research-environment .
docker run -p 7860:7860 ai-research-environmentgraph TD
A[RL/LLM Agent] -->|Selects Next Action| B(OpenEnv API Layer)
B --> C{Research Environment}
C -->|Executes Task Step| D[State Update]
D --> E[Agent Activity Log]
E --> F[Reward Grader]
F -->|Calculates Reward & Updates History| G[History Tab & Charts]
G -->|Returns Observation & Reward| A
The environment ships with deterministic, multi-factor graders evaluating the agent against predefined structured tasks evaluating logic consistency.
| Task | Difficulty | Domain | Challenge |
|---|---|---|---|
image_classification |
π’ Easy | Computer Vision | Clear signal, minimal noise |
nlp_sentiment |
π‘ Medium | NLP | Noisy results, misleading papers |
tabular_prediction |
π΄ Hard | Healthcare ML | Conflicting evidence, budget limit |
The reward function enforces strict alignment with the scientific method using a dense, continuous weighted evaluation system instead of sparse binary signals. Each component is independently scored in graders.py and combined via a difficulty-scaled weighted sum.
# graders.py β grade_episode()
score = w[0]*h + w[1]*e + w[2]*i + w[3]*r + w[4]*f + w[5]*t
# No final_answer submitted β score penalized by 40%
if not state_dict.get("final_answer"):
score *= 0.6# graders.py β weights dict (h, e, i, r, f, t)
weights = {
"easy": (0.25, 0.15, 0.30, 0.10, 0.10, 0.10),
"medium": (0.20, 0.20, 0.25, 0.10, 0.15, 0.10),
"hard": (0.15, 0.25, 0.20, 0.10, 0.20, 0.10),
}
# Order: hypothesis Β· experiment Β· improvement Β· reasoning Β· final_answer Β· trajectory| # | Component | Easy | Medium | Hard | Scoring Formula |
|---|---|---|---|---|---|
h |
Hypothesis Quality | 0.25 |
0.20 |
0.15 |
Keyword overlap between current_hypothesis and task ground_truth_keywords |
e |
Experiment Quality | 0.15 |
0.20 |
0.25 |
0.6 Γ diversity + 0.4 Γ found_optimal β 0.2 Γ repetition_penalty |
i |
Improvement Score | 0.30 |
0.25 |
0.20 |
(best_accuracy β baseline) / (optimal β baseline), capped at 1.0 |
r |
Reasoning Quality | 0.10 |
0.10 |
0.10 |
Sequence score: readβhypothesis (+0.3), designβrun (+0.3), analyze (+0.2), refine (+0.2) |
f |
Final Answer Quality | 0.10 |
0.15 |
0.20 |
0.5 Γ keyword_overlap + 0.5 Γ Jaccard_similarity vs ground truth |
t |
Trajectory Learning | 0.10 |
0.10 |
0.10 |
Fraction of consecutive experiments showing accuracy improvement |
| Action | Reward Signal | Notes |
|---|---|---|
read_paper |
+0.05 Γ n_papers |
Diminished to +0.01 on redundant re-read |
propose_hypothesis |
+0.05 + 0.20 Γ quality + 0.05 bonus |
Bonus if papers were read first |
design_experiment |
+0.03 |
Per valid method_id:dataset_id design |
run_experiment (new best) |
+0.02 + min(0.30, improvement) |
improvement = accuracy β baseline |
run_experiment (no improvement) |
β0.01 |
Fails to beat current best |
run_experiment (duplicate) |
β0.05 |
Exact same method+dataset combo |
analyze_results |
+0.05 + 0.05 trend_bonus |
Trend bonus if last > previous accuracy |
refine_hypothesis |
+0.03 + 0.10 Γ quality_delta |
β0.02 if quality regresses |
final_answer |
+0.10 + 0.50 Γ final_score |
β0.10 if no experiments were run |
| Repeated action type | β0.03 |
Applied on any consecutive duplicate action type |
| Invalid action | β0.10 |
Unknown action_type submitted |
Max steps without final_answer |
β0.20 |
Episode forcibly terminated |
Characteristics:
- Dense and incremental (not sparse/binary)
- Penalizes invalid/redundant actions
- Rewards information gathering and refinement
- Difficulty-dependent weight distribution
Agents have a predefined set of tools to mimic real-world machine learning research workflows:
| Action | Description |
|---|---|
read_paper |
Read paper summaries for domain knowledge |
propose_hypothesis |
Form an initial hypothesis |
design_experiment |
Specify method + dataset combination |
run_experiment |
Execute a designed experiment |
analyze_results |
Get structured analysis of results |
refine_hypothesis |
Update hypothesis based on evidence |
final_answer |
Submit conclusion (ends episode) |
This environment operates seamlessly inside Reinforcement Learning workflows. Because the step API is fully OpenENV compliant, it maps fluently to standard gym.Env wrappers. You can construct continuous PPO loops taking the JSON output, calculating the cumulative score, and performing back-propagation on policy networks without writing any new logic.
βββ models.py # Action, Observation, State dataclasses
βββ tasks.py # Task definitions (easy, medium, hard)
βββ graders.py # Deterministic multi-factor graders
βββ environment.py # Core environment (reset/step/state loop)
βββ inference.py # Baseline automated execution logic
βββ server/
β βββ __init__.py
β βββ app.py # FastAPI HTTP Serverless integration
βββ dashboard/ # React + Vite UI
β βββ src/
β β βββ components/
β β βββ store/ # Zustand state management
β β βββ hooks/
β β βββ types/ # Front-end Typings
β βββ index.html
β βββ package.json
βββ openenv.yaml # OpenEnv manifest
βββ Dockerfile # Container definition
βββ requirements.txt # Python dependencies
βββ README.md # This file
For proper LLM proxy execution, the Hugging Face server must be provided a token. Add the following to your Space's repository secrets, or to a .env in local development:
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxThis project is licensed under the MIT License.
Created by Team One Way.
| Name | Role |
|---|---|
| Jiya Jahnavi | Co-Developer |
| Aditya Kumar Singh | Lead Developer |
| Rishabh Yadav | Co-Developer |
Developed for the Meta Python OpenENV Hackathon 2026.