Maskify: PII Detection and Redaction in Audio

Problem Statement

Banks, utility companies, and telecommunication providers routinely record customer support calls. These recordings often contain sensitive customer data such as bank routing numbers, account numbers, Social Security Numbers (SSN), and other personally identifiable information (PII). Storing such data without protection raises serious privacy concerns and risks of data breaches.

Solution

To address this issue, we have developed Maskify—an automated system that detects and redacts sensitive entities from audio recordings. Maskify ensures that private information is removed from both the transcript and the audio, helping organizations comply with privacy regulations and protect customer data.

Pipeline Overview

The Maskify pipeline follows these steps:

Audio Input: Accepts audio files (.wav, .mp3, etc.) from customer support calls.
Audio Preprocessing: Converts audio to 16kHz mono WAV format for consistency.
Automatic Speech Recognition: Uses Faster-Whisper to transcribe audio and generate word-level timestamps.
Text Preprocessing & Annotation: Cleans and annotates the transcript using tools like Doccano.
PII Detection: Applies advanced NER models (DeBERTa or Mistral) to identify PII entities in the text.
PII Marking & Redaction: Identifies and marks PII words/phrases, then mutes corresponding audio segments based on timestamps.
Final Output: Produces a redacted transcript (Text/JSON) and a redacted audio file (with PII muted).

Tech Stack

Faster-Whisper: Audio transcription
DeBERTa & Mistral: Named Entity Recognition (NER)
Python, Jupyter Notebook: Backend and experimentation
React, Vite: Interactive web frontend
CSV/JSONL/XLSX: Data storage and annotation

Full-Stack Web App

Frontend:
- Vite: Lightning-fast frontend tooling
- React: Component-based UI
- Material UI (MUI): Pre-styled UI components
- Axios: For REST API communication
Backend:
- Flask (Python): Lightweight web framework
- Tempfile + Subprocess: File handling and audio conversion
- PIIDetector Class: Core logic for transcription, entity recognition, redaction, and audio editing

Development & Data Tools

Python & Jupyter Notebooks: For prototyping and testing
CSV / JSON: For storing transcripts, labels, and results
Files are converted to standard format automatically

2. Model Selection

Users can select:
- DeBERTa (precise span-based detection)
- Unsloth (flexible prompt-based output)

4. PII Detection

DeBERTa: Detects PII spans with confidence scores
Unsloth: Extracts PII in "LABEL: VALUE" format via prompting
Original transcript
Detected entities (with labels)

🔍 PII Types Detected

Addresses
Credit Card Numbers
Social Security Numbers (SSNs)
Bank Account Numbers
BANK Routing Numbers
Phone Numbers
Name

🔐 Security and Ethics

All files processed temporarily and deleted after serving results
No personal data is stored
Redacted audio/text ensures data privacy even if files are shared

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Dataset		Dataset
Notebook		Notebook
Website_Backend		Website_Backend
Website_Frontend		Website_Frontend
text_samples		text_samples
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Maskify: PII Detection and Redaction in Audio

Problem Statement

Solution

Pipeline Overview

Tech Stack

Full-Stack Web App

Development & Data Tools

2. Model Selection

4. PII Detection

🔍 PII Types Detected

🔐 Security and Ethics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Maskify: PII Detection and Redaction in Audio

Problem Statement

Solution

Pipeline Overview

Tech Stack

Full-Stack Web App

Development & Data Tools

2. Model Selection

4. PII Detection

🔍 PII Types Detected

🔐 Security and Ethics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages