Prerequisites

  • Working knowledge of Kubernetes Deployments, Services, and Helm chart installation for deploying new services to a cluster

  • Familiarity with Microsoft Presidio analyzer and anonymizer APIs for PII detection and scrubbing

  • Experience with LiteLLM proxy middleware and request/response interceptors

  • Understanding of Prometheus metrics instrumentation including counters, histograms, and label cardinality management

  • Familiarity with FastAPI middleware patterns and request lifecycle hooks

  • Working knowledge of PostgreSQL for structured audit data storage and querying

  • Experience with Pydantic models for request validation and configuration management

  • Understanding of Alertmanager routing and receiver configuration for severity-based alert delivery

  • Basic knowledge of data protection regulations (GDPR, CCPA, HIPAA) and their requirements for PII handling

  • Completion of Chapter 44 (Key Rotation Operator) or equivalent experience with security automation in GenAI platforms

Learning Goals

  1. Deploy Microsoft Presidio on K8s for runtime PII detection in LLM traffic

    • Deploy Microsoft Presidio on K8s for runtime PII detection in LLM traffic.You will deploy Presidio analyzer and anonymizer services on your vCluster via Helm charts, configure them for low-latency operation, and build middleware that integrates with LiteLLM to scan every outbound prompt.

    • You will configure bypass rules for trusted internal prompts that intentionally contain example PII, ensuring system prompts are not falsely flagged.

    • The middleware sends text to the Presidio analyzer, receives detected PII entities with confidence scores, and applies per-entity-type handling rules.

  2. Build PII scrubbing middleware that redacts sensitive data before sending to LLM providers

    • Build PII scrubbing middleware that redacts sensitive data before sending to LLM providers.You will implement the PIIDetectionMiddleware that sits in the LiteLLM request path, intercepts all outbound prompts, and applies Presidio-based scrubbing before the request leaves your infrastructure.

    • You will build configurable scrubbing policies that allow different behaviors per service, use case, and entity type, providing the flexibility that production deployments require.

    • The middleware handles per-entity-type actions (replace, hash, block), tracks detection metrics by entity type, and adds minimal latency to the request path.

  3. Implement PII detection alerting and audit logging for compliance

    • Implement PII detection alerting and audit logging for compliance.Every PII detection event must be logged without exposing the detected PII values.

    • You will build audit logging to PostgreSQL that records request identifiers, entity types, confidence scores, and actions taken while never storing the actual sensitive data.

    • Every PII detection event must be logged without exposing the detected PII values.

    • Alerting rules fire on PII rate spikes, high-confidence detections of critical entity types (SSNs, credit cards), and PII detected in services that should never handle sensitive data.

  4. Create PII detection tuning workflows to reduce false positives

    • Create PII detection tuning workflows to reduce false positives.You will build a feedback loop where operators flag false positive detections, the system analyzes false positive patterns, and threshold adjustments are A/B tested on a percentage of traffic before full deployment.

    • You will build a feedback loop where operators flag false positive detections, the system analyzes false positive patterns, and threshold adjustments are A/B tested on a percentage of traffic before full deployment.

    • The tuning workflow tracks false positive rates by entity type, maintains custom deny lists for known non-PII patterns that trigger false positives, and measures the effectiveness of threshold changes against both detection rate and false positive rate.

Key Terminology

PII (Personally Identifiable Information)
Any data that can identify a specific individual, including names, email addresses, phone numbers, Social Security numbers, credit card numbers, and physical addresses.
Microsoft Presidio
An open-source framework for PII detection (analyzer) and remediation (anonymizer) that uses NLP models and pattern matching to identify sensitive data in unstructured text.
Presidio Analyzer
The Presidio component that scans text and returns a list of detected PII entities with their type, confidence score, and position in the text.
Presidio Anonymizer
The Presidio component that applies remediation actions (replace, hash, redact, mask) to detected PII entities in the original text.
Entity Type
The classification of a detected PII element, such as PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, or LOCATION.
Confidence Score
A float between 0 and 1 assigned by Presidio indicating how certain the analyzer is that a detected span is actually PII of the claimed entity type.
Scrubbing
The process of replacing or removing detected PII from text before it is transmitted to external services, preserving the text structure while eliminating sensitive data.
Placeholder Replacement
A scrubbing strategy that replaces detected PII with a typed marker like **** or ****, maintaining readability while removing sensitive values.
PII Hashing
A scrubbing strategy that replaces detected PII with a one-way hash, allowing correlation of repeated values without exposing the original data.
Request Blocking
The most aggressive PII handling action where the entire LLM request is rejected when high-severity PII (SSN, credit card) is detected.
Bypass Rule
A configuration that exempts specific services or system prompts from PII detection, used for trusted internal prompts that intentionally contain example PII.
False Positive
A detection where Presidio identifies text as PII when it is not actually sensitive data, such as flagging a product name as a person name.
Detection Threshold
The minimum confidence score required for a Presidio detection to trigger a scrubbing action, configurable per entity type.
A/B Testing (Threshold Tuning)
Running new detection thresholds on a percentage of traffic in shadow mode to compare detection and false positive rates before full deployment.
PII Audit Log
A compliance record of every PII detection event stored in PostgreSQL, containing entity type, confidence, and action taken but never the actual PII value.
Deny List
A curated list of text patterns known to trigger false positive PII detections, used to suppress known non-PII matches without adjusting global thresholds.
Shadow Mode
A detection mode where PII is detected and logged but not scrubbed or blocked, used for evaluating threshold changes without impacting request content.

On This Page