- Overview
- Demo & Results
- System Architecture
- Action Classes & Key Mapping
- Dataset Pipeline
- Model Architecture & Training
- Real-Time Inference Pipeline
- Performance Metrics
- Project Structure
- Installation & Setup
- How to Run
- Technical Details
- Requirements
A production-grade, real-time AI system that enables touchless game control through full-body gesture recognition. The system uses a deep learning pipeline combining AI-powered person tracking, body pose estimation, and a sequential deep learning model trained on thousands of action sequences — allowing a player to control a game character using nothing but their body movements.
No controller. No keyboard. Your body IS the controller.
| Feature | Detail |
|---|---|
| 🎯 Test Accuracy | 99.69% on held-out test set |
| ⚡ Inference | Sub-100ms end-to-end real-time latency |
| 🦴 Pose Points | 33 full-body skeletal landmarks (MediaPipe) |
| 🎮 Action Classes | 5 — Jump, Kick, Punch, MoveForward, MoveBackward |
| 📊 Dataset Size | 40,865 labeled sequences of shape (30, 132) |
| 🧠 Model | Stacked LSTM + BatchNorm + Dense layers |
| 🔁 Smoothing | Majority voting over last 5 predictions |
Jump Kick MoveBackward MoveForward Punch
Jump 1482 0 0 4 0
Kick 1 2138 4 2 0
MoveBackward 0 0 1724 10 0
MoveForward 0 0 1 1940 0
Punch 0 1 2 0 864
precision recall f1-score support
Jump 1.00 1.00 1.00 1486
Kick 1.00 1.00 1.00 2145
MoveBackward 1.00 0.99 1.00 1734
MoveForward 0.99 1.00 1.00 1941
Punch 1.00 1.00 1.00 867
accuracy 1.00 8173
macro avg 1.00 1.00 1.00 8173
weighted avg 1.00 1.00 1.00 8173
╔══════════════════════════════════════════════════════════════════════╗
║ REAL-TIME ACTION RECOGNITION PIPELINE ║
╠══════════════════════════════════════════════════════════════════════╣
║ ║
║ 📷 Webcam Feed ║
║ │ ║
║ ▼ ║
║ 🔍 YOLO Person Tracking ──► Bounding Box + Track ID ║
║ │ ║
║ ▼ ║
║ 🦴 MediaPipe Pose Estimation ──► 33 Skeletal Landmarks ║
║ │ (x, y, z, visibility × 33 = 132) ║
║ ▼ ║
║ ⚖️ Hip Centralization ──► Normalize relative to hip center ║
║ │ ║
║ ▼ ║
║ 🪟 Sliding Window ──► 30-frame temporal sequence ║
║ │ ║
║ ▼ ║
║ 🧠 LSTM Deep Learning ──► 5-class action prediction ║
║ │ ║
║ ▼ ║
║ 🗳️ Majority Voting ──► Smoothed prediction (k=5 frames) ║
║ │ ║
║ ▼ ║
║ ⌨️ Virtual Keyboard ──► Game Control (↑ → ← C V) ║
║ ║
╚══════════════════════════════════════════════════════════════════════╝
| 🕹️ Body Action | ⌨️ Key Triggered | 🎯 Game Use |
|---|---|---|
| 🦘 Jump | ↑ (Up Arrow) |
Character jump |
| 👊 Punch | C |
Attack / punch move |
| 🦵 Kick | V |
Kick attack |
| 🏃 MoveForward | → (Right Arrow) |
Move character right |
| ⬅️ MoveBackward | ← (Left Arrow) |
Move character left |
The system includes an ActionMemory manager that:
- Tracks the previously executed action
- Mode 1 (Block Repeat): Prevents re-triggering the same action until a new one is detected — avoids button spamming
- Mode 2 (Allow Repeat): Executes every valid prediction continuously — suited for movement actions
NTU RGB+D Action Recognition Dataset
Academic access required — submit request via official portal.
# Action code → label mapping used:
action_map = {
"A024": "Kick", # Kick (NTU class 24)
"A051": "Kick", # Kick (NTU class 51)
"A050": "Punch", # Punch
"A026": "Jump", # Jump up
"A027": "Jump", # Jump down
"A059": "MoveForward", # Walking forward
"A060": "MoveBackward",# Walking backward
}- Total source videos: 2,880
- Videos extracted (selected actions): 336 → 800+ after multi-session accumulation
For each video, the pipeline:
- Runs YOLO tracking to isolate the target person (user manually selects Track ID)
- Crops the person region per frame
- Runs MediaPipe Pose on the crop → extracts 33 landmarks × 4 values (x, y, z, visibility) = 132 features
- Saves extracted pose sequences to
mediapipe_pose.csvincrementally
Robust tracking features:
- IoU-based re-identification across frames
- Predicted bounding box interpolation during occlusions
- Expansion of search area for lost tracks
- Full-frame fallback detection (up to 30 frames without detection)
- Skip tracking (Space = irrelevant action, ESC = skip video)
# Per-video processing:
# 1. Extract pose joint array: shape (frames, 132)
# 2. Hip centralization → subtract hip midpoint from all joints (x, y, z)
# 3. Sliding window with step=1 → shape (N_windows, 30, 132)
# Final saved dataset:
sequences: shape (40865, 30, 132)
labels: shape (40865,) # action class strings
video_ids: shape (40865,) # source video filenameFinal Dataset Stats:
| Split | Samples |
|---|---|
| Train (64%) | ~26,154 |
| Validation (16%) | ~6,538 |
| Test (20%) | 8,173 |
Input Shape: (30, 132) → 30 time steps × 132 features
│
┌───────────────▼───────────────┐
│ LSTM(128, return_sequences) │ ← Temporal pattern capture
│ BatchNormalization │
│ Dropout(0.3) │
├───────────────────────────────┤
│ LSTM(64) │ ← Sequence summary
│ BatchNormalization │
│ Dropout(0.3) │
├───────────────────────────────┤
│ Dense(128, ReLU) │
│ BatchNormalization │
│ Dropout(0.3) │
├───────────────────────────────┤
│ Dense(64, ReLU) │
│ BatchNormalization │
│ Dropout(0.3) │
├───────────────────────────────┤
│ Dense(5, Softmax) │ ← 5-class output
└───────────────────────────────┘
| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Loss Function | Sparse Categorical Crossentropy |
| Epochs | 100 (max) |
| Early Stopping Patience | 10 epochs |
| Batch Size | 32 |
| Best Epoch | 36 |
| Total Epochs Run | 46 (early stopped) |
| Epoch | Train Acc | Val Acc | Val Loss |
|---|---|---|---|
| 1 | 62.85% | 75.55% | 0.6471 |
| 4 | 89.78% | 91.60% | 0.2390 |
| 8 | 94.76% | 97.49% | 0.0858 |
| 17 | 97.56% | 98.69% | 0.0419 |
| 25 | 98.25% | 99.31% | 0.0267 |
| 36 | 99.05% | 99.53% | 0.0150 ← Best |
| 46 | 99.13% | 98.65% | 0.0453 (Early Stop) |
- Camera Init — Opens webcam at 1280×720
- Person Selection — YOLO detects and tracks all people; user types Track ID + Enter
- Pose Loop — For every frame of the selected person:
- Crop bounding box
- Extract 33 MediaPipe landmarks (132 values)
- Append to rolling deque (max=60 frames)
- Prediction — When deque ≥ 30 frames:
- Take last 30 frames
- Apply hip centralization
- Run through LSTM model
- Apply confidence threshold (≥ 0.5)
- Smoothing — Majority vote over last 5 predictions
- Key Press — Map smoothed action → virtual keyboard press via
pynput - Memory Check — ActionMemory decides whether to execute or block
┌────────────────────────────────┐
│ Prediction: Jump │
│ Confidence: 0.97 │
│ Previous: Kick │
│ Actions Exec: 42 │
│ │
│ [Tracking box + pose skeleton]│
└────────────────────────────────┘
╔══════════════════════════════════════════════════════╗
║ 📊 Model Performance Summary ║
╠══════════════════════════════════════════════════════╣
║ Test Accuracy ████████████████████ 99.69% ║
║ Train Accuracy (best) ████████████████████ 99.05% ║
║ Val Accuracy (best) ████████████████████ 99.53% ║
║ Test Loss 0.009 (extremely low) ║
║ Inference Latency <100ms (real-time) ║
╚══════════════════════════════════════════════════════╝
Per-Class Performance:
| Action | Precision | Recall | F1-Score | Test Support |
|---|---|---|---|---|
| Jump | 1.00 | 1.00 | 1.00 | 1,486 |
| Kick | 1.00 | 1.00 | 1.00 | 2,145 |
| MoveBackward | 1.00 | 0.99 | 1.00 | 1,734 |
| MoveForward | 0.99 | 1.00 | 1.00 | 1,941 |
| Punch | 1.00 | 1.00 | 1.00 | 867 |
| Overall | 1.00 | 1.00 | 1.00 | 8,173 |
Real_Time_Game_Control/
│
├── 📂 code/
│ ├── dataset_Preparation.ipynb # Full data pipeline
│ ├── lstm_training.ipynb # Model training & evaluation
│ └── real_time_model.ipynb # Live inference + game control
│
├── 📂 Dataset/
│ ├── nturgb+d_rgb/ # Raw NTU RGB+D videos (not tracked)
│ ├── video/ # Filtered action videos
│ ├── mediapipe_pose.csv # Extracted pose data
│ ├── yolo_pose.csv # YOLO-extracted pose data
│ ├── skipped_videos.csv # Skipped video log
│ └── mediapipe_dataset_132.npz # Final training dataset
│
├── 📂 Models/
│ ├── best_lstm_model.keras # Best saved model (Keras format)
│ ├── best_lstm_model.h5 # Best saved model (H5 format)
│ ├── label_encoder.pkl # Sklearn LabelEncoder
│ ├── class_info.pkl # Class names & mappings
│ ├── yolo11n-pose.pt # YOLO pose model (nano)
│ └── yolo11s-pose.pt # YOLO pose model (small)
│
└── README.md
- Python 3.10+
- CUDA-compatible GPU (recommended for training)
- Webcam (for real-time inference)
git clone https://github.com/uqasha524/real-time-action-recognition.git
cd real-time-action-recognitionpython -m venv venv
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activatepip install tensorflow
pip install torch torchvision
pip install ultralytics # YOLO
pip install mediapipe
pip install opencv-python
pip install scikit-learn
pip install pandas numpy
pip install matplotlib seaborn
pip install pynput # Virtual keyboard control
pip install jupyterfrom ultralytics import YOLO
# Auto-downloads on first use
model = YOLO('yolo11n-pose.pt') # Nano (faster)
model = YOLO('yolo11s-pose.pt') # Small (more accurate)Request access to the NTU RGB+D Dataset at:
Submit an academic request form — approval typically takes 1-3 days.
If you already have the trained model files in /Models/:
jupyter notebook code/real_time_model.ipynb- Run all cells
- A webcam window will open
- Type the Track ID of the person you want to track → Press Enter
- Start performing gestures — game keys will be triggered automatically!
Step 1: Dataset Preparation
jupyter notebook code/dataset_Preparation.ipynb- Update
source_dirto your NTU RGB+D dataset path - Run the extraction cell — manually select person Track ID for each video
- Press Space to skip irrelevant videos, ESC to skip during tracking
- Wait for
mediapipe_pose.csvandmediapipe_dataset_132.npzto be generated
Step 2: Train the Model
jupyter notebook code/lstm_training.ipynb- Run all cells
- Training takes ~30-60 minutes on GPU (46 epochs ran, best at epoch 36)
- Model is saved automatically to
/Models/
Step 3: Real-Time Inference
jupyter notebook code/real_time_model.ipynbAll joint coordinates are normalized relative to the hip midpoint before training and inference:
hip_x = (frame[LEFT_HIP*4] + frame[RIGHT_HIP*4]) / 2
hip_y = (frame[LEFT_HIP*4+1] + frame[RIGHT_HIP*4+1]) / 2
hip_z = (frame[LEFT_HIP*4+2] + frame[RIGHT_HIP*4+2]) / 2
# Subtract hip from all joints (x, y, z only — visibility unchanged)
frame[0::4] -= hip_x # All X coordinates
frame[1::4] -= hip_y # All Y coordinates
frame[2::4] -= hip_z # All Z coordinatesThis makes the model position-invariant — the same action is recognized regardless of where the person stands in the frame.
# Window size: 30 frames
# Step size: 1 (maximum overlap for dense training data)
# Each window: shape (30, 132)
# For real-time:
pose_sequence = deque(maxlen=60) # Rolling buffer
# Predict on last 30 frames every framedef smooth_predictions(predictions, k=5):
recent = predictions[-k:]
labels = [p[0] for p in recent]
majority = max(set(labels), key=labels.count) # Majority vote
conf = mean([p[1] for p in recent if p[0] == majority])
return majority, confCONF_THRESHOLD = 0.5
# Only actions with softmax probability >= 0.5 are considered valid
# Others are labeled "Low Confidence" and not mapped to any key# Filename format: ActionName_SeqNumber.avi
# e.g.: Jump_701.avi → label = "Jump"
# MoveBackward_123.avi → label = "MoveBackward"
def extract_action_label(filename):
parts = os.path.splitext(filename)[0].split("_")
if parts[-1].isdigit():
return "_".join(parts[:-1]) # Remove trailing sequence number
return "_".join(parts)tensorflow>=2.12
torch>=2.0
ultralytics>=8.0
mediapipe>=0.10
opencv-python>=4.8
scikit-learn>=1.3
pandas>=2.0
numpy>=1.24
matplotlib>=3.7
seaborn>=0.12
pynput>=1.7
jupyter>=1.0
- Add more action classes (Duck, Roll, Sprint, Block)
- Replace LSTM with Transformer-based architecture (e.g., TimeSformer)
- Deploy as standalone executable (no Jupyter required)
- Add multi-player support (track multiple IDs simultaneously)
- Train on custom game-specific gesture sets
- Export to ONNX for faster CPU inference
This project is for academic and research purposes. The NTU RGB+D dataset is subject to its own license terms from NTU Singapore.