Master SRE and platform engineering for GenAI systems. Covers deploying, monitoring, troubleshooting, scaling, securing, and maintaining production GenAI platforms. Topics include GenAI reliability fundamentals with SLIs/SLOs/error budgets, environment and pipeline engineering with Helm/Kustomize/Argo CD, release engineering with canary deployments and model lifecycle management, full-stack observability with distributed tracing and quality drift detection, incident management with chaos engineering and game days, cost engineering and FinOps for LLM spend, data and knowledge operations for vector stores and knowledge graphs, security and compliance operations with guardrails and red teaming, and platform operations capstones for healthcare, finance, and enterprise domains. Lab infrastructure: each student gets a vCluster (virtual K8s cluster) with full cluster-admin access on GKE. All labs run in K8s pods using hosted model SDKs — no GPU required.
Python essentials and development environment for agent development
Virtual environments, async programming, type hints, Pydantic, error handling, testing, debugging, logging, project structure
Core LLM concepts: API clients, token economics, caching, and function calling basics
LLM APIs, OpenAI/Anthropic/Gemini clients, prompt caching, token economics, function calling basics
Agent patterns: ReAct, planning, tool execution, sandboxing, web navigation, and MCP protocol
ReAct loop, planning patterns, tool execution, sandboxing, web navigation, MCP servers, MCP clients, tool routing
Memory systems, RAG patterns, context optimization, and LangGraph state machines
Short-term memory, long-term memory (RAG), agentic RAG patterns, semantic memory, context optimization, state graphs, conditional edges, checkpointing, human-in-the-loop, streaming, subgraphs
Multi-agent patterns, guardrails, evaluations, and observability
Supervisor pattern, hierarchical pattern, reflector pattern, input guardrails, output guardrails, prompt injection defense, evaluations, benchmarking, tracing, observability
Production deployment: APIs, containers, databases, scaling, CI/CD, and monitoring
FastAPI, Docker, production databases, scaling, CI/CD, monitoring, alerting, model routing, fallbacks, system design
Alternative frameworks, protocols, specialized agents, autonomous workflows, and cutting-edge capabilities
CrewAI/AutoGen, A2A protocols, GraphRAG, local models, vision agents, voice agents, code agents, autonomous workflows, streaming data, agent swarms
Production excellence: trajectory evaluation, safety, cost control, enterprise patterns, and governance
Agent trajectory evaluation, safety boundaries, cost control, enterprise agent patterns, load testing, versioning, fleet dashboards, autonomous agent governance