Active Speaker Detection 🎤👀

A full pipeline to detect and highlight active speakers in videos using YOLO for face detection and TalkNet for speaker detection.

📁 Clone the Repository

git clone https://github.com/MjdMahasneh/active-speaker-detection.git
cd active-speaker-detection

⚙️ Setup

1. Create Conda Environment

conda create --name active_speaker python=3.9 -y
conda activate active_speaker

2. Install Dependencies

💡 For GPU support, install PyTorch matching your CUDA version. Check pytorch.org for installation instructions.

Example for CUDA 11.8:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Then install the rest of the dependencies:

pip install -r requirements.txt

▶️ Run

To run the pipeline, modify configurations in ./config/args.py and then run main.py. Alternatively, you can run the script directly with command line arguments.

python main.py --videoName video --videoFolder workdir

Note: video can be in .mp4 or .avi formats.

🗂️ Output Structure

workdir/
└── video/
    ├── pyavi/                # extracted audio + output video
    ├── pyframes/             # all video frames
    ├── pycrop/               # cropped face clips
    ├── pywork/               # pickle files and internals
    └── speaker_summary.json  # summary of speaker activity

🧠 Models

YOLOv11n-Face: face detection
TalkNet: audio-visual active speaker detection

🛠️ Components

Scene detection via PySceneDetect
Face detection via YOLO
Face tracking via IOU + interpolation
Speech classification via TalkNet
Visualization with speaking durations

🔧 Improvements Made

Added --minSpeechLen to filter out short/non-speech segments.
Added --ignoreMultiSpeakers to ignore segments with multiple speakers.
Skipped interpolation when face detections have no frame gaps, improving efficiency for continuous tracks.
Applied weighted averaging across multi-duration inputs instead of repeating inference.
Added get_speaker_track_indices() to isolate actual speaker tracks.

Acknowledgements

This project builds on the great work from:

TalkNet-ASD for active speaker detection.
YOLO-Face for face detection.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
config		config
model		model
utils		utils
weights		weights
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Active Speaker Detection 🎤👀

📁 Clone the Repository

⚙️ Setup

1. Create Conda Environment

2. Install Dependencies

▶️ Run

🗂️ Output Structure

🧠 Models

🛠️ Components

🔧 Improvements Made

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Active Speaker Detection 🎤👀

📁 Clone the Repository

⚙️ Setup

1. Create Conda Environment

2. Install Dependencies

▶️ Run

🗂️ Output Structure

🧠 Models

🛠️ Components

🔧 Improvements Made

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages