The Problem

Sending raw customer interactions to an LLM creates a direct path for sensitive data — names, account numbers, Aadhaar details, medical context — to be ingested into model logs, vector databases, and training pipelines. Regulatory exposure is immediate. But naively stripping all personal context also destroys the signal the model needs to perform accurate quality assessment. The challenge was to sanitize without destroying.

Architecture

Three-Tier Scrubbing Pipeline

Rather than a single scrubbing pass, the system applies three sequential tiers with escalating intelligence:

  • Deterministic Tier: Regex and checksum-based masking for structured data — phone numbers, PAN cards, Aadhaar numbers, credit card sequences. High speed, zero false negatives on known patterns. Implemented in C++ tokenizers to minimize latency contribution.
  • Probabilistic Tier: A lightweight NER model (fine-tuned RoBERTa / Presidio) identifies soft PII: names, addresses, emotional context across 12+ Indian and global languages. Handles regional nuances where names and common nouns overlap (e.g., "Surat" as city vs. contextual usage in certain dialects).
  • Semantic Tier: LLM-based de-identification replaces PII with synthetic but grammatically coherent placeholders — "Rahul from Bangalore" becomes [NAME_1] from [LOCATION_1]. The model retains full grammatical and contextual understanding; the identity is gone.

Dual-Direction Guardrails

  • Ingress Guard: Prevents PII from entering the Vector DB, training logs, or model context windows.
  • Egress Guard: A hallucination and safety filter on model outputs blocks the LLM from accidentally regenerating PII or producing toxic content in QA summaries.

Token Vault for Authorized Re-Identification

For senior QA leads with legitimate need to view original data during dispute resolution, a secure Token Vault maps synthetic placeholders back to source data through audited, role-based access control (RBAC). All re-identification events are logged immutably for compliance review.

Automated Red Team Audits

A separate AI agent continuously attempts to extract or reconstruct PII from the sanitized outputs — functioning as an automated red team. Weekly security posture reports are generated for the CISO without requiring manual penetration testing overhead.

Results

Metric Outcome
Scrubbing Latency Added <100ms end-to-end
Regulatory Compliance 100% DPDP Act 2023 + GDPR
PII Exposure Surface Reduction −98% (no PII in vector DB or logs)
Language Support 12+ Indian and global languages
Audit Readiness Automated weekly security posture report