Core responsibilities this discipline prepares you for.
1
Design CI/CD pipelines
for LLM application deployment
- Build ArgoCD GitOps workflows with Helm-based deployments and environment promotion
- Implement canary and blue-green rollout strategies with automated quality-based rollback
- Wire complete CI/CD pipelines that trigger rollbacks when evaluation metrics degrade
2
Monitor LLM systems in production
— latency, errors, costs, quality
- Instrument with OpenTelemetry and Langfuse v3 for OTEL-native distributed tracing
- Build Grafana dashboards with Logfire for Python application monitoring and alerting
- Set up monitoring stacks that detect anomalies, fire alerts, and enable trace-based root cause analysis
3
Manage LLM gateway operations
— key rotation, failover, quota management
- Operate LiteLLM gateway: API key lifecycle management, provider health monitoring, per-team quotas
- Handle zero-downtime model version switching with traffic draining and validation
- Simulate provider outages and quota exhaustion to validate failover and degradation behavior
4
Implement FinOps practices
— cost attribution, budgets, and optimization
- Track token costs by team, feature, and model with Prometheus-based budget alerting
- Implement cost optimization through semantic caching, model tiering, and prompt compression
- Build FinOps dashboards that demonstrate measurable cost reduction across optimization strategies
5
Build continuous evaluation pipelines
for production LLM quality
- Run RAGAS and DeepEval evaluation pipelines alongside production traffic as shadow evaluators
- Set up Langfuse-based quality tracking with automated quality gates and threshold alerting
- Detect quality degradation in real time and trigger automated alerts when scores drop below baselines
6
Detect and respond to prompt attacks
and safety incidents in production
- Monitor NeMo Guardrails operationally for prompt injection and jailbreak detection patterns
- Classify incident severity and execute structured response workflows with containment procedures
- Simulate attack scenarios end-to-end: detection, triage, remediation, and post-incident analysis
7
Manage data quality for RAG systems
— freshness, drift, accuracy
- Monitor embedding drift and retrieval accuracy with continuous RAGAS evaluation
- Set up automated reindexing triggers and stale content detection pipelines
- Build monitoring for live RAG systems that detects quality degradation and triggers reindexing workflows
8
Implement capacity planning
— predict demand and right-size deployments
- Forecast token demand using historical usage patterns and run load tests for LLM services
- Model SLA capacity requirements and configure KEDA-based autoscaling policies
- Run load tests that predict capacity requirements and validate SLA compliance under variable traffic