GitHub - uqasha524/RealTime-GameActions-UsingLSTM

📋 Table of Contents

Overview
Demo & Results
System Architecture
Action Classes & Key Mapping
Dataset Pipeline
Model Architecture & Training
Real-Time Inference Pipeline
Performance Metrics
Project Structure
Installation & Setup
How to Run
Technical Details
Requirements

🎯 Overview

A production-grade, real-time AI system that enables touchless game control through full-body gesture recognition. The system uses a deep learning pipeline combining AI-powered person tracking, body pose estimation, and a sequential deep learning model trained on thousands of action sequences — allowing a player to control a game character using nothing but their body movements.

No controller. No keyboard. Your body IS the controller.

🔑 Key Highlights

Feature	Detail
🎯 Test Accuracy	99.69% on held-out test set
⚡ Inference	Sub-100ms end-to-end real-time latency
🦴 Pose Points	33 full-body skeletal landmarks (MediaPipe)
🎮 Action Classes	5 — Jump, Kick, Punch, MoveForward, MoveBackward
📊 Dataset Size	40,865 labeled sequences of shape `(30, 132)`
🧠 Model	Stacked LSTM + BatchNorm + Dense layers
🔁 Smoothing	Majority voting over last 5 predictions

🎬 Demo & Results

Confusion Matrix (Test Set — 8,173 samples)

              Jump    Kick  MoveBackward  MoveForward  Punch
Jump          1482       0             0            4      0
Kick             1    2138             4            2      0
MoveBackward     0       0          1724           10      0
MoveForward      0       0             1         1940      0
Punch            0       1             2            0    864

Classification Report

              precision    recall  f1-score   support
        Jump       1.00      1.00      1.00      1486
        Kick       1.00      1.00      1.00      2145
MoveBackward       1.00      0.99      1.00      1734
 MoveForward       0.99      1.00      1.00      1941
       Punch       1.00      1.00      1.00       867

    accuracy                           1.00      8173
   macro avg       1.00      1.00      1.00      8173
weighted avg       1.00      1.00      1.00      8173

🏗️ System Architecture

╔══════════════════════════════════════════════════════════════════════╗
║                REAL-TIME ACTION RECOGNITION PIPELINE                 ║
╠══════════════════════════════════════════════════════════════════════╣
║                                                                      ║
║  📷 Webcam Feed                                                      ║
║       │                                                              ║
║       ▼                                                              ║
║  🔍 YOLO Person Tracking  ──►  Bounding Box + Track ID              ║
║       │                                                              ║
║       ▼                                                              ║
║  🦴 MediaPipe Pose Estimation  ──►  33 Skeletal Landmarks           ║
║       │                         (x, y, z, visibility × 33 = 132)   ║
║       ▼                                                              ║
║  ⚖️  Hip Centralization   ──►  Normalize relative to hip center     ║
║       │                                                              ║
║       ▼                                                              ║
║  🪟 Sliding Window        ──►  30-frame temporal sequence           ║
║       │                                                              ║
║       ▼                                                              ║
║  🧠 LSTM Deep Learning    ──►  5-class action prediction            ║
║       │                                                              ║
║       ▼                                                              ║
║  🗳️  Majority Voting      ──►  Smoothed prediction (k=5 frames)    ║
║       │                                                              ║
║       ▼                                                              ║
║  ⌨️  Virtual Keyboard     ──►  Game Control (↑ → ← C V)            ║
║                                                                      ║
╚══════════════════════════════════════════════════════════════════════╝

🎮 Action Classes & Key Mapping

🕹️ Body Action	⌨️ Key Triggered	🎯 Game Use
🦘 Jump	`↑ (Up Arrow)`	Character jump
👊 Punch	`C`	Attack / punch move
🦵 Kick	`V`	Kick attack
🏃 MoveForward	`→ (Right Arrow)`	Move character right
⬅️ MoveBackward	`← (Left Arrow)`	Move character left

Action Memory System

The system includes an ActionMemory manager that:

Tracks the previously executed action
Mode 1 (Block Repeat): Prevents re-triggering the same action until a new one is detected — avoids button spamming
Mode 2 (Allow Repeat): Executes every valid prediction continuously — suited for movement actions

📊 Dataset Pipeline

Source Dataset

NTU RGB+D Action Recognition Dataset

Academic access required — submit request via official portal.

Step 1: Raw Video Extraction

# Action code → label mapping used:
action_map = {
    "A024": "Kick",        # Kick (NTU class 24)
    "A051": "Kick",        # Kick (NTU class 51)
    "A050": "Punch",       # Punch
    "A026": "Jump",        # Jump up
    "A027": "Jump",        # Jump down
    "A059": "MoveForward", # Walking forward
    "A060": "MoveBackward",# Walking backward
}

Total source videos: 2,880
Videos extracted (selected actions): 336 → 800+ after multi-session accumulation

Step 2: Pose Extraction (Dual-Model)

For each video, the pipeline:

Runs YOLO tracking to isolate the target person (user manually selects Track ID)
Crops the person region per frame
Runs MediaPipe Pose on the crop → extracts 33 landmarks × 4 values (x, y, z, visibility) = 132 features
Saves extracted pose sequences to mediapipe_pose.csv incrementally

Robust tracking features:

IoU-based re-identification across frames
Predicted bounding box interpolation during occlusions
Expansion of search area for lost tracks
Full-frame fallback detection (up to 30 frames without detection)
Skip tracking (Space = irrelevant action, ESC = skip video)

Step 3: Dataset Assembly (NPZ Format)

# Per-video processing:
# 1. Extract pose joint array: shape (frames, 132)
# 2. Hip centralization → subtract hip midpoint from all joints (x, y, z)
# 3. Sliding window with step=1 → shape (N_windows, 30, 132)

# Final saved dataset:
sequences: shape (40865, 30, 132)
labels:    shape (40865,)           # action class strings
video_ids: shape (40865,)           # source video filename

Final Dataset Stats:

Split	Samples
Train (64%)	~26,154
Validation (16%)	~6,538
Test (20%)	8,173

🧠 Model Architecture & Training

LSTM Network Architecture

Input Shape: (30, 132)  →  30 time steps × 132 features
                │
┌───────────────▼───────────────┐
│  LSTM(128, return_sequences)  │   ← Temporal pattern capture
│  BatchNormalization           │
│  Dropout(0.3)                 │
├───────────────────────────────┤
│  LSTM(64)                     │   ← Sequence summary
│  BatchNormalization           │
│  Dropout(0.3)                 │
├───────────────────────────────┤
│  Dense(128, ReLU)             │
│  BatchNormalization           │
│  Dropout(0.3)                 │
├───────────────────────────────┤
│  Dense(64, ReLU)              │
│  BatchNormalization           │
│  Dropout(0.3)                 │
├───────────────────────────────┤
│  Dense(5, Softmax)            │   ← 5-class output
└───────────────────────────────┘

Training Configuration

Parameter	Value
Optimizer	Adam
Loss Function	Sparse Categorical Crossentropy
Epochs	100 (max)
Early Stopping Patience	10 epochs
Batch Size	32
Best Epoch	36
Total Epochs Run	46 (early stopped)

Training Progress (Key Epochs)

Epoch	Train Acc	Val Acc	Val Loss
1	62.85%	75.55%	0.6471
4	89.78%	91.60%	0.2390
8	94.76%	97.49%	0.0858
17	97.56%	98.69%	0.0419
25	98.25%	99.31%	0.0267
36	99.05%	99.53%	0.0150 ← Best
46	99.13%	98.65%	0.0453 (Early Stop)

⚡ Real-Time Inference Pipeline

How It Works (Live)

Camera Init — Opens webcam at 1280×720
Person Selection — YOLO detects and tracks all people; user types Track ID + Enter
Pose Loop — For every frame of the selected person:
- Crop bounding box
- Extract 33 MediaPipe landmarks (132 values)
- Append to rolling deque (max=60 frames)
Prediction — When deque ≥ 30 frames:
- Take last 30 frames
- Apply hip centralization
- Run through LSTM model
- Apply confidence threshold (≥ 0.5)
Smoothing — Majority vote over last 5 predictions
Key Press — Map smoothed action → virtual keyboard press via pynput
Memory Check — ActionMemory decides whether to execute or block

Live Display Overlay

┌────────────────────────────────┐
│ Prediction:    Jump            │
│ Confidence:    0.97            │
│ Previous:      Kick            │
│ Actions Exec:  42              │
│                                │
│  [Tracking box + pose skeleton]│
└────────────────────────────────┘

📈 Performance Metrics

╔══════════════════════════════════════════════════════╗
║              📊 Model Performance Summary            ║
╠══════════════════════════════════════════════════════╣
║  Test Accuracy          ████████████████████  99.69% ║
║  Train Accuracy (best)  ████████████████████  99.05% ║
║  Val Accuracy (best)    ████████████████████  99.53% ║
║  Test Loss              0.009  (extremely low)       ║
║  Inference Latency      <100ms (real-time)           ║
╚══════════════════════════════════════════════════════╝

Per-Class Performance:

Action	Precision	Recall	F1-Score	Test Support
Jump	1.00	1.00	1.00	1,486
Kick	1.00	1.00	1.00	2,145
MoveBackward	1.00	0.99	1.00	1,734
MoveForward	0.99	1.00	1.00	1,941
Punch	1.00	1.00	1.00	867
Overall	1.00	1.00	1.00	8,173

📁 Project Structure

Real_Time_Game_Control/
│
├── 📂 code/
│   ├── dataset_Preparation.ipynb    # Full data pipeline
│   ├── lstm_training.ipynb          # Model training & evaluation
│   └── real_time_model.ipynb        # Live inference + game control
│
├── 📂 Dataset/
│   ├── nturgb+d_rgb/                # Raw NTU RGB+D videos (not tracked)
│   ├── video/                       # Filtered action videos
│   ├── mediapipe_pose.csv           # Extracted pose data
│   ├── yolo_pose.csv                # YOLO-extracted pose data
│   ├── skipped_videos.csv           # Skipped video log
│   └── mediapipe_dataset_132.npz    # Final training dataset
│
├── 📂 Models/
│   ├── best_lstm_model.keras        # Best saved model (Keras format)
│   ├── best_lstm_model.h5           # Best saved model (H5 format)
│   ├── label_encoder.pkl            # Sklearn LabelEncoder
│   ├── class_info.pkl               # Class names & mappings
│   ├── yolo11n-pose.pt              # YOLO pose model (nano)
│   └── yolo11s-pose.pt              # YOLO pose model (small)
│
└── README.md

🛠️ Installation & Setup

Prerequisites

Python 3.10+
CUDA-compatible GPU (recommended for training)
Webcam (for real-time inference)

1. Clone the Repository

git clone https://github.com/uqasha524/real-time-action-recognition.git
cd real-time-action-recognition

2. Create Virtual Environment

python -m venv venv
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activate

3. Install Dependencies

pip install tensorflow
pip install torch torchvision
pip install ultralytics        # YOLO
pip install mediapipe
pip install opencv-python
pip install scikit-learn
pip install pandas numpy
pip install matplotlib seaborn
pip install pynput             # Virtual keyboard control
pip install jupyter

4. Download YOLO Pose Models

from ultralytics import YOLO
# Auto-downloads on first use
model = YOLO('yolo11n-pose.pt')  # Nano (faster)
model = YOLO('yolo11s-pose.pt')  # Small (more accurate)

5. Download Dataset

Request access to the NTU RGB+D Dataset at:

🔗 https://rose1.ntu.edu.sg/dataset/actionRecognition/

Submit an academic request form — approval typically takes 1-3 days.

🚀 How to Run

Option A: Run Pre-Trained Model (Real-Time Only)

If you already have the trained model files in /Models/:

jupyter notebook code/real_time_model.ipynb

Run all cells
A webcam window will open
Type the Track ID of the person you want to track → Press Enter
Start performing gestures — game keys will be triggered automatically!

Option B: Full Pipeline (Train from Scratch)

Step 1: Dataset Preparation

jupyter notebook code/dataset_Preparation.ipynb

Update source_dir to your NTU RGB+D dataset path
Run the extraction cell — manually select person Track ID for each video
Press Space to skip irrelevant videos, ESC to skip during tracking
Wait for mediapipe_pose.csv and mediapipe_dataset_132.npz to be generated

Step 2: Train the Model

jupyter notebook code/lstm_training.ipynb

Run all cells
Training takes ~30-60 minutes on GPU (46 epochs ran, best at epoch 36)
Model is saved automatically to /Models/

Step 3: Real-Time Inference

jupyter notebook code/real_time_model.ipynb

🔧 Technical Details

Hip Centralization

All joint coordinates are normalized relative to the hip midpoint before training and inference:

hip_x = (frame[LEFT_HIP*4] + frame[RIGHT_HIP*4]) / 2
hip_y = (frame[LEFT_HIP*4+1] + frame[RIGHT_HIP*4+1]) / 2
hip_z = (frame[LEFT_HIP*4+2] + frame[RIGHT_HIP*4+2]) / 2

# Subtract hip from all joints (x, y, z only — visibility unchanged)
frame[0::4] -= hip_x   # All X coordinates
frame[1::4] -= hip_y   # All Y coordinates
frame[2::4] -= hip_z   # All Z coordinates

This makes the model position-invariant — the same action is recognized regardless of where the person stands in the frame.

Sliding Window

# Window size: 30 frames
# Step size: 1 (maximum overlap for dense training data)
# Each window: shape (30, 132)

# For real-time:
pose_sequence = deque(maxlen=60)  # Rolling buffer
# Predict on last 30 frames every frame

Prediction Smoothing

def smooth_predictions(predictions, k=5):
    recent = predictions[-k:]
    labels = [p[0] for p in recent]
    majority = max(set(labels), key=labels.count)  # Majority vote
    conf = mean([p[1] for p in recent if p[0] == majority])
    return majority, conf

Confidence Filtering

CONF_THRESHOLD = 0.5
# Only actions with softmax probability >= 0.5 are considered valid
# Others are labeled "Low Confidence" and not mapped to any key

Dataset Label Extraction

# Filename format: ActionName_SeqNumber.avi
# e.g.: Jump_701.avi → label = "Jump"
#       MoveBackward_123.avi → label = "MoveBackward"

def extract_action_label(filename):
    parts = os.path.splitext(filename)[0].split("_")
    if parts[-1].isdigit():
        return "_".join(parts[:-1])  # Remove trailing sequence number
    return "_".join(parts)

📦 Requirements

tensorflow>=2.12
torch>=2.0
ultralytics>=8.0
mediapipe>=0.10
opencv-python>=4.8
scikit-learn>=1.3
pandas>=2.0
numpy>=1.24
matplotlib>=3.7
seaborn>=0.12
pynput>=1.7
jupyter>=1.0

🔮 Future Improvements

Add more action classes (Duck, Roll, Sprint, Block)
Replace LSTM with Transformer-based architecture (e.g., TimeSformer)
Deploy as standalone executable (no Jupyter required)
Add multi-player support (track multiple IDs simultaneously)
Train on custom game-specific gesture sets
Export to ONNX for faster CPU inference

📄 License

This project is for academic and research purposes. The NTU RGB+D dataset is subject to its own license terms from NTU Singapore.

Built with ❤️ using Python · TensorFlow · MediaPipe · YOLO · OpenCV

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Models		Models
code		code
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

📋 Table of Contents

🎯 Overview

🔑 Key Highlights

🎬 Demo & Results

Confusion Matrix (Test Set — 8,173 samples)

Classification Report

🏗️ System Architecture

🎮 Action Classes & Key Mapping

Action Memory System

📊 Dataset Pipeline

Source Dataset

Step 1: Raw Video Extraction

Step 2: Pose Extraction (Dual-Model)

Step 3: Dataset Assembly (NPZ Format)

🧠 Model Architecture & Training

LSTM Network Architecture

Training Configuration

Training Progress (Key Epochs)

⚡ Real-Time Inference Pipeline

How It Works (Live)

Live Display Overlay

📈 Performance Metrics

📁 Project Structure

🛠️ Installation & Setup

Prerequisites

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Download YOLO Pose Models

5. Download Dataset

🚀 How to Run

Option A: Run Pre-Trained Model (Real-Time Only)

Option B: Full Pipeline (Train from Scratch)

🔧 Technical Details

Hip Centralization

Sliding Window

Prediction Smoothing

Confidence Filtering

Dataset Label Extraction

📦 Requirements

🔮 Future Improvements

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages