Chapter 45

PII Detection Pipeline

PII detectionPresidiodata scrubbingruntime detectionPII alertingaudit loggingfalse positive tuning

Learning Path

Step 1

Reading Material

11 sections

Step 2

Knowledge Check

50 questions

Step 3

Hands-on Labs

6 labs

Step 1

Reading Material

11 sections

Step 2

Knowledge Check

50 questions

Step 3

Hands-on Labs

6 labs

Hands-on Labs

Each objective has a coding lab that opens in VS Code in your browser

Objective 1

Deploy Presidio and build PII middleware

Goal

You will deploy Microsoft Presidio and build PII detection middleware for all LLM traffic. Deploy Presidio analyzer and anonymizer services on your vCluster via Helm. Build `PIIDetectionMiddleware`: intercepts all LiteLLM requests, sends prompt text to Presidio analyzer, receives detected PII entities (names, emails, phone numbers, SSNs, credit cards, addresses), and scrubs detected PII using Presidio anonymizer (replace with placeholders like <PERSON>, <EMAIL>). Configure per-entity-type handling: names -> replace with placeholder, emails -> hash, phone numbers -> replace, SSNs -> block request entirely. Implement bypass for trusted internal prompts (system prompts that intentionally contain example PII). Track `pii_detected_total{entity_type}`, `pii_scrubbed_total{entity_type}`, `pii_blocked_total{reason}`.

Objective 2

Build PII alerting and audit logging

Goal

You will build PII detection alerting and audit logging for compliance. Implement PII audit logging: every PII detection event is logged to PostgreSQL `pii_audit_log` with `request_id`, `timestamp`, `entity_type`, `detection_confidence`, `action_taken` (scrubbed, blocked, allowed), `source_service`, `environment`. Never log the actual PII value in the audit log. Build PII alerting: alert when PII detection rate spikes (suggests a new code path is leaking PII), alert when high-confidence SSN or credit card is detected (immediate security concern), alert when PII is detected in a service that should never handle PII. Build PII detection dashboard: detection volume by entity type, detection trend over time, top sources of PII, and block rate. Generate weekly PII compliance report.

Objective 3

Build PII detection tuning

Goal

You will build workflows for tuning PII detection to reduce false positives while maintaining detection coverage. Implement false positive tracking: operators can flag PII detections as false positives via `POST /api/v1/pii/false-positive/{detection_id}`. Store false positive annotations in PostgreSQL. Build tuning workflow: analyze false positive patterns (which entity types have highest FP rate, which content patterns trigger FPs), adjust Presidio confidence thresholds per entity type, add custom deny lists for known non-PII patterns that trigger false positives. Implement A/B testing for threshold changes: run new thresholds on 10% of traffic in shadow mode, compare detection rate and estimated FP rate before full deployment. Track `pii_false_positive_rate{entity_type}`, `pii_tuning_improvement{entity_type}`. Build tuning effectiveness dashboard.

Objective 4

Build testing and validation for runtime pii detection

Goal

You will build comprehensive testing and validation for the runtime pii detection system. Implement `RuntimePIIDetectionTester`: define test scenarios that verify all critical paths work correctly under normal conditions, edge cases, and failure conditions. Build integration tests that verify the system integrates correctly with upstream and downstream components. Implement regression testing: maintain a test suite that runs on every configuration change to catch regressions. Build `POST /api/v1/runtime-pii-detection/test` API that triggers the full test suite and returns results. Run tests as scheduled Argo Workflow CronJobs. Track `test_pass_rate_{system}_total`, `test_duration_seconds`. Build test results dashboard showing pass rates, flaky tests, and coverage.

Objective 5

Implement performance optimization for runtime pii detection

Goal

You will build performance monitoring and optimization for the runtime pii detection system. Implement `RuntimePIIDetectionOptimizer`: instrument all critical paths with latency histograms, identify bottlenecks using p95/p99 analysis, and implement optimizations. Build capacity analysis: measure maximum throughput under load, identify scaling limits, and document capacity thresholds. Implement performance SLOs: define acceptable latency and throughput targets, track compliance, and alert on degradation. Build performance benchmarking: run standardized benchmarks on every significant change to detect performance regressions. Track `performance_benchmark_result_{system}`, `performance_slo_compliance_{system}`. Create performance dashboard with trend analysis.

Objective 6

Build operational documentation for runtime pii detection

Goal

You will build comprehensive operational documentation and runbooks for the runtime pii detection system. Implement `RuntimePIIDetectionDocGenerator`: auto-generate architecture diagrams from deployed resources, configuration reference from active configs, and API documentation from FastAPI OpenAPI specs. Build operational runbooks: document common operational tasks (scaling, configuration changes, troubleshooting), emergency procedures (failure recovery, rollback), and maintenance procedures (upgrades, data migrations). Implement documentation freshness: track when documentation was last updated vs when the system was last changed, flag stale docs. Store documentation in Git with version tracking. Build `GET /api/v1/runtime-pii-detection/docs` serving current documentation. Track `documentation_freshness_{system}`, `documentation_coverage_{system}`.