Skip to content

Arfa-Ahsan/PII_Detection_and_Redaction_audio_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Maskify: PII Detection and Redaction in Audio

Problem Statement

Banks, utility companies, and telecommunication providers routinely record customer support calls. These recordings often contain sensitive customer data such as bank routing numbers, account numbers, Social Security Numbers (SSN), and other personally identifiable information (PII). Storing such data without protection raises serious privacy concerns and risks of data breaches.

Solution

To address this issue, we have developed Maskify—an automated system that detects and redacts sensitive entities from audio recordings. Maskify ensures that private information is removed from both the transcript and the audio, helping organizations comply with privacy regulations and protect customer data.

Pipeline Overview

The Maskify pipeline follows these steps:

  1. Audio Input: Accepts audio files (.wav, .mp3, etc.) from customer support calls.
  2. Audio Preprocessing: Converts audio to 16kHz mono WAV format for consistency.
  3. Automatic Speech Recognition: Uses Faster-Whisper to transcribe audio and generate word-level timestamps.
  4. Text Preprocessing & Annotation: Cleans and annotates the transcript using tools like Doccano.
  5. PII Detection: Applies advanced NER models (DeBERTa or Mistral) to identify PII entities in the text.
  6. PII Marking & Redaction: Identifies and marks PII words/phrases, then mutes corresponding audio segments based on timestamps.
  7. Final Output: Produces a redacted transcript (Text/JSON) and a redacted audio file (with PII muted).
image

Tech Stack

  • Faster-Whisper: Audio transcription
  • DeBERTa & Mistral: Named Entity Recognition (NER)
  • Python, Jupyter Notebook: Backend and experimentation
  • React, Vite: Interactive web frontend
  • CSV/JSONL/XLSX: Data storage and annotation

Full-Stack Web App

  • Frontend:

    • Vite: Lightning-fast frontend tooling
    • React: Component-based UI
    • Material UI (MUI): Pre-styled UI components
    • Axios: For REST API communication
  • Backend:

    • Flask (Python): Lightweight web framework
    • Tempfile + Subprocess: File handling and audio conversion
    • PIIDetector Class: Core logic for transcription, entity recognition, redaction, and audio editing

Development & Data Tools

  • Python & Jupyter Notebooks: For prototyping and testing

  • CSV / JSON: For storing transcripts, labels, and results

  • Files are converted to standard format automatically

2. Model Selection

  • Users can select:
    • DeBERTa (precise span-based detection)
    • Unsloth (flexible prompt-based output)

4. PII Detection

  • DeBERTa: Detects PII spans with confidence scores

  • Unsloth: Extracts PII in "LABEL: VALUE" format via prompting

  • Original transcript

  • Detected entities (with labels)

🔍 PII Types Detected

  • Addresses
  • Credit Card Numbers
  • Social Security Numbers (SSNs)
  • Bank Account Numbers
  • BANK Routing Numbers
  • Phone Numbers
  • Name

🔐 Security and Ethics

  • All files processed temporarily and deleted after serving results
  • No personal data is stored
  • Redacted audio/text ensures data privacy even if files are shared

About

This project focuses on detecting and redacting Personally Identifiable Information (PII) in English audio datasets. This toolkit transcribes audio, identifies sensitive entities using NLP, and redacts PII from both text and audio outputs. Ideal for privacy-focused dataset creation and secure data handling.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors