A full pipeline to detect and highlight active speakers in videos using YOLO for face detection and TalkNet for speaker detection.
git clone https://github.com/MjdMahasneh/active-speaker-detection.git
cd active-speaker-detectionconda create --name active_speaker python=3.9 -y
conda activate active_speaker💡 For GPU support, install PyTorch matching your CUDA version. Check pytorch.org for installation instructions.
Example for CUDA 11.8:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118Then install the rest of the dependencies:
pip install -r requirements.txtTo run the pipeline, modify configurations in ./config/args.py and then run main.py. Alternatively, you can run the script directly with command line arguments.
python main.py --videoName video --videoFolder workdirNote: video can be in .mp4 or .avi formats.
workdir/
└── video/
├── pyavi/ # extracted audio + output video
├── pyframes/ # all video frames
├── pycrop/ # cropped face clips
├── pywork/ # pickle files and internals
└── speaker_summary.json # summary of speaker activity
- YOLOv11n-Face: face detection
- TalkNet: audio-visual active speaker detection
- Scene detection via
PySceneDetect - Face detection via YOLO
- Face tracking via IOU + interpolation
- Speech classification via TalkNet
- Visualization with speaking durations
- Added
--minSpeechLento filter out short/non-speech segments. - Added
--ignoreMultiSpeakersto ignore segments with multiple speakers. - Skipped interpolation when face detections have no frame gaps, improving efficiency for continuous tracks.
- Applied weighted averaging across multi-duration inputs instead of repeating inference.
- Added
get_speaker_track_indices()to isolate actual speaker tracks.
This project builds on the great work from:
- TalkNet-ASD for active speaker detection.
- YOLO-Face for face detection.