On This Page
Prerequisites
This chapter builds on Chapter 9: GitOps with Argo CD, where you configured declarative delivery for GenAI artifacts. Before starting, ensure you have completed the following:
- Argo CD fundamentals — deploying ApplicationSets and syncing Git-driven manifests to Kubernetes
- GenAI artifact structure — prompt templates, model configurations, RAG configs, and guardrail policies stored in version-controlled repositories
- Container orchestration — running workloads on Kubernetes with familiarity in namespaces, ConfigMaps, and Secrets
- Python scripting — writing validation and evaluation scripts for automated testing
- Basic CI/CD concepts — understanding pipeline stages, triggers, and artifact promotion across environments
Learning Goals
-
Build Argo Workflow templates for GenAI artifact CI/CD
-
Build Argo Workflow templates for GenAI artifact CI/CDthat encapsulate reusable pipeline logic for prompt files, model configuration manifests, retrieval-augmented generation indexes, and guardrail policy documents, enabling your platform team to ship a self-service CI/CD experience where ML engineers submit pull requests against a GitOps repository and Argo Workflows orchestrates every validation, evaluation, and promotion step without manual intervention.
-
Design WorkflowTemplate resources that parameterize GenAI artifact types. In production GenAI platforms, a single monolithic pipeline cannot handle the diversity of artifacts that flow through the system. A prompt template has fundamentally different validation needs than a model serving configuration or a vector-store index definition. Argo Workflows solves this through its WorkflowTemplate custom resource, which lets you define reusable, parameterized DAGs (directed acyclic graphs) that accept inputs such as artifact type, target environment, evaluation thresholds, and rollback flags. You will learn to structure your templates using a layered architecture: a base template that handles git checkout, artifact discovery, and notification scaffolding; artifact-specific templates that inherit from the base and inject lint, validate, and evaluate steps appropriate to the artifact kind; and promotion templates that handle the final gate between staging and production. Each WorkflowTemplate declares its parameters block, which Argo exposes to both the CLI and the Argo Events sensor that triggers workflows on repository changes. You will learn why the entrypoint field must reference a DAG template rather than a simple steps template when your pipeline requires conditional branching — for example, skipping the expensive model-evaluation stage when only a guardrail policy YAML changed. You will also learn how to version your WorkflowTemplates using Kubernetes labels and annotations so that in-flight workflows continue to reference the template version they were instantiated with, preventing mid-execution template drift that can cause silent correctness bugs in your promotion logic. A well-designed template library typically contains between six and twelve WorkflowTemplates for a mature GenAI platform, covering prompt CI, prompt CD, model-config CI, model-config CD, RAG-index CI, RAG-index CD, guardrail-policy CI, guardrail-policy CD, a shared notification template, and a shared artifact-signing template. Understanding this decomposition is the first step toward a pipeline system that scales with your organization rather than against it.
-
Implement artifact-aware triggering using Argo Events sensors and event sources. A CI/CD pipeline is only as useful as its trigger mechanism. Argo Events provides a clean separation between event ingestion (EventSource resources that listen to GitHub webhooks, S3 bucket notifications, or Kafka topics) and event routing (Sensor resources that filter events and instantiate Workflows or WorkflowTemplates with the correct parameters). You will learn to configure a GitHub EventSource that receives push and pull-request events, paired with a Sensor that inspects the changed file paths to determine which artifact-specific WorkflowTemplate to invoke. For example, if a commit modifies files under the prompts/ directory, the sensor extracts the prompt identifier from the file path, resolves the appropriate WorkflowTemplate reference for prompt CI, and passes the prompt ID, commit SHA, and target branch as parameters. This path-based routing eliminates wasted compute by ensuring that a change to a single guardrail policy file does not trigger the full model-evaluation suite. You will also learn how to configure the Sensor's dependency logic to require multiple events before triggering — a pattern useful when your pipeline should only run after both a prompt change and its corresponding evaluation-dataset update land in the same pull request. The dependencyGroups field in the Sensor spec controls this fan-in behavior. Getting triggering right is critical because misconfigured sensors are the number-one cause of pipeline storms in production Argo installations, where a single merge commit touching fifty files spawns fifty redundant workflows that saturate your cluster's CPU and memory resources.
-
Configure artifact storage, caching, and output-artifact promotion across pipeline stages. Argo Workflows treats artifacts as first-class citizens, allowing each step in a DAG to declare input and output artifacts that are persisted to an artifact repository — typically an S3-compatible object store such as Amazon S3, Google Cloud Storage, or MinIO. You will learn to configure the artifactRepository section in your Workflow defaults so that every pipeline run stores its intermediate outputs (lint reports, validation summaries, evaluation scorecards) in a structured path like s3://genai-pipelines/{workflowName}/{podName}/{artifactName}. This structure enables post-hoc debugging by letting engineers browse the exact artifacts produced by any historical pipeline run. You will also learn to use Argo's built-in artifact garbage collection to prevent storage costs from growing unboundedly: setting artifactGC.strategy to OnWorkflowDeletion ensures that artifacts are cleaned up when the Workflow resource is pruned from Kubernetes, while OnWorkflowCompletion deletes them immediately after the pipeline finishes — useful for ephemeral lint outputs but dangerous for evaluation scorecards you need for audit trails. For large artifacts such as model weights or vector-store snapshots, you will learn to use the archive and raw artifact modes to avoid downloading multi-gigabyte files into the step container when only a metadata sidecar file is needed for the promotion decision.
-
-
Implement pipeline stages: lint, validate, eval, promote
-
Implement pipeline stages for lint, validate, eval, and promotethat enforce correctness at every phase of the GenAI artifact lifecycle, catching schema violations during linting, semantic inconsistencies during validation, quality regressions during evaluation, and policy violations before promotion, so that your production environment only ever receives artifacts that have survived a gauntlet of automated checks.
-
Build a linting stage that validates artifact syntax and schema conformance. The lint stage is the first line of defense and must execute in under thirty seconds to provide fast feedback on pull requests. For prompt artifacts, linting means parsing the prompt template file (typically Jinja2 or Mustache syntax) to verify that all template variables are declared in the accompanying metadata YAML, that no template variable is referenced but undefined, and that the rendered prompt length stays within the model's context window when populated with maximum-length example values from the test fixture. For model configuration artifacts, linting means validating the JSON or YAML against a JSON Schema that declares required fields such as model_name, temperature, max_tokens, stop_sequences, and fallback_model, while also enforcing value constraints — for example, ensuring temperature is a float between 0.0 and 2.0 and that max_tokens does not exceed the model's published limit. For RAG configuration artifacts, linting checks that the declared embedding model exists in your model registry, that the chunk-size and chunk-overlap parameters are integers within acceptable ranges, and that the retrieval top-k value does not exceed the hard limit imposed by your vector database. For guardrail policy artifacts, linting parses the policy DSL (domain-specific language) and verifies that every referenced guardrail function exists in the guardrail registry, that threshold values are valid floats, and that the policy's boolean logic tree is well-formed — no dangling AND or OR operators, no circular references between policy rules. You will implement each linter as a container image with a standardized interface: the container receives the artifact path as an environment variable, writes a JSON report to /tmp/lint-report.json, and exits with code 0 on success or code 1 on failure. Argo captures the JSON report as an output artifact, and downstream steps can parse it to include specific error messages in the pull-request comment posted by your notification template.
-
Construct a validation stage that performs semantic checks beyond syntax. While linting catches structural problems, validation catches logical problems that require domain knowledge. For prompt artifacts, validation means running the prompt through a deterministic test harness that substitutes known input-output pairs and verifies that the prompt, when rendered with test inputs, produces outputs that match expected patterns — not via LLM inference at this stage, but via regex or JSON-schema matching on the prompt's output-formatting instructions. For example, if your prompt instructs the model to respond in JSON with a sentiment field, the validator confirms that the prompt's few-shot examples all contain valid JSON with that field. For model configurations, validation cross-references the declared model_name against your model registry API to confirm the model is deployed and healthy in the target environment, checks that the stop_sequences array does not contain tokens that would truncate valid responses, and verifies that the fallback_model is a different model than the primary to ensure genuine redundancy. For RAG configurations, validation connects to the vector database's health endpoint and confirms that the declared collection exists, that its dimensionality matches the embedding model's output dimension, and that a sample query returns results within acceptable latency bounds. For guardrail policies, validation instantiates the policy engine with the new policy and runs it against a curated test suite of benign and adversarial inputs, verifying that the policy correctly allows benign inputs and blocks adversarial ones with the expected guardrail-violation codes. Each validator container follows the same interface convention as the linter: artifact path in, JSON report out, exit code signals pass or fail. The critical architectural decision here is whether validation stages should have network access to production services. The recommended pattern is to validate against a dedicated staging replica that mirrors production's model deployments and vector stores, ensuring that validation results are meaningful without risking side effects on production traffic.
-
Design an evaluation stage that measures artifact quality against baseline metrics. The evaluation stage is where your pipeline invests real compute to answer the question: does this artifact change make things better or worse? For prompt artifacts, evaluation means running the prompt against a held-out evaluation dataset using your LLM inference service and scoring the responses with automated metrics — BLEU, ROUGE, or custom rubric-based scoring using a judge LLM. The evaluation step compares the new prompt's scores against the currently-deployed prompt's baseline scores, stored in your artifact metadata store, and fails the pipeline if any metric regresses beyond a configurable threshold (typically 2-5% relative degradation). For model configurations, evaluation means deploying the new configuration to a shadow endpoint and routing a sample of production traffic (or replayed traffic from your request log) through both the current and candidate configurations, then comparing latency percentiles, token usage, and output-quality scores. For RAG configurations, evaluation means running a retrieval benchmark suite that measures recall@k, precision@k, and mean reciprocal rank against a labeled relevance dataset, failing if retrieval quality drops below the baseline. For guardrail policies, evaluation means running an expanded adversarial test suite — not just the curated set from validation, but a dynamically generated set produced by a red-team LLM that attempts to bypass the new policy. The evaluation stage records all results as a structured scorecard artifact that the promotion stage consumes. You will learn to configure Argo's retryStrategy on evaluation steps to handle transient LLM API failures without restarting the entire pipeline, using retryPolicy: OnError with a backoff configuration that doubles the wait time between retries up to a maximum of five attempts.
-
Implement a promotion gate that enforces approval policies before deploying artifacts. The promotion stage is the final checkpoint where automation meets governance. You will learn to implement promotion as a two-phase process: an automated gate that programmatically verifies all upstream stages passed and all metric thresholds were met, followed by an optional human-approval gate for high-risk artifacts. Argo Workflows supports human approval through the suspend template type, which pauses the workflow and waits for an external signal — either a manual resume via the Argo UI or a webhook from your internal approval system (such as a Slack bot or a PagerDuty-integrated approval service). You will learn to use Argo's when expressions to make the human-approval gate conditional: production promotions require approval, while staging promotions proceed automatically. The promotion step itself executes a GitOps commit — updating the target environment's artifact manifest in your deployment repository, which is then picked up by Argo CD or Flux to perform the actual deployment. You will also implement an automated rollback trigger that monitors the promoted artifact's production metrics for a configurable bake period (typically 15-30 minutes) and initiates a revert workflow if error rates or latency breach alerting thresholds.
-
-
Create artifact-specific pipelines for prompts, models, and RAG configs
-
Create artifact-specific pipelines for prompts, models, and RAG configsthat account for the unique lifecycle, testing requirements, and deployment semantics of each GenAI artifact type, moving beyond generic CI/CD templates to purpose-built pipelines that treat prompts as versioned software, model configurations as infrastructure declarations, and RAG indexes as data products with their own freshness and quality SLAs.
-
Build a prompt CI/CD pipeline that versions, tests, and deploys prompt templates as first-class software artifacts. Prompts are the most frequently changed artifact in a GenAI system — some organizations deploy prompt updates multiple times per day — yet they are often managed with the least rigor. Your prompt pipeline must treat prompt templates with the same discipline applied to application code. The pipeline begins with a Git-triggered lint stage that parses the prompt's Jinja2 template syntax, validates variable declarations against a schema file co-located with the prompt, and checks the prompt's estimated token count against the target model's context window limit. Next, a unit-test stage renders the prompt with fixtures from a test_cases.yaml file and verifies that each rendered prompt matches expected structural patterns — for instance, confirming that a chain-of-thought prompt always includes the phrase "Let me think step by step" and that a JSON-output prompt always ends with a valid JSON opening brace. The evaluation stage runs the rendered prompts through the target LLM and scores responses against a labeled evaluation dataset using both automated metrics (exact match, F1, ROUGE-L) and a judge-LLM rubric that scores on dimensions like helpfulness, accuracy, and safety. The pipeline compares evaluation scores against the production baseline and computes a confidence interval to distinguish genuine regressions from noise caused by LLM non-determinism — you will learn why setting temperature to 0.0 during evaluation is necessary but not sufficient, since even greedy decoding can produce different outputs across different GPU batching configurations. The promotion stage writes the new prompt version to your prompt registry (a versioned object store or a dedicated prompt management service) and updates the GitOps manifest that your serving infrastructure reads. You will also implement a canary-release pattern where the new prompt serves 5% of traffic initially, with an automated metric monitor that gradually increases the percentage to 100% over a two-hour window if no quality degradation is detected, or rolls back immediately if the canary's error rate exceeds the control group by more than one standard deviation.
-
Construct a model-configuration pipeline that validates serving parameters and orchestrates shadow deployments. Model configurations govern how your serving infrastructure interacts with foundation models — parameters like temperature, top-p, frequency penalties, system prompts, stop sequences, and timeout values. A misconfigured model serving parameter can silently degrade output quality without triggering any hard errors, making automated validation especially critical. Your model-config pipeline begins with a schema-validation stage that enforces type constraints and value ranges for every parameter, cross-referencing the declared model name against your model registry to confirm compatibility. The validation stage deploys the new configuration to an isolated shadow endpoint and routes a replay of recent production requests through it, capturing both the responses and the serving telemetry (latency, token counts, error rates). The evaluation stage compares the shadow endpoint's outputs against the production endpoint's outputs using a semantic similarity metric (cosine similarity of embeddings) and flags any response pairs where similarity drops below 0.85, indicating a meaningful behavioral change. For configurations that intentionally change behavior — such as increasing temperature to make responses more creative — the pipeline accepts an expected_drift parameter that relaxes the similarity threshold and instead evaluates against a creativity-focused rubric. The promotion stage uses a blue-green deployment pattern: the new configuration is deployed to the inactive color, smoke-tested with synthetic requests, and then traffic is switched via a Kubernetes Service update. You will learn to encode the entire blue-green switch as an Argo DAG with explicit rollback edges, so that if the smoke test fails after the traffic switch, the pipeline automatically reverts the Service selector to the previous color within seconds.
-
Design a RAG configuration pipeline that validates retrieval quality and index freshness before promotion. RAG configurations define how your retrieval-augmented generation system chunks documents, generates embeddings, populates vector indexes, and retrieves context at inference time. A pipeline for RAG configs must validate not only the configuration parameters but also the resulting retrieval quality, because a seemingly innocuous change to chunk size or overlap can dramatically alter which passages the retriever surfaces. Your RAG pipeline begins with a parameter-validation stage that checks chunk-size and overlap values against empirically established bounds for your document corpus (typically 256-1024 tokens for chunk size and 10-25% for overlap), validates the embedding model reference against your model registry, and confirms that the target vector-database collection is accessible. The indexing stage applies the new configuration to a representative sample of your document corpus (typically 5-10% sampled by document type to maintain distribution) and builds a temporary vector index. The retrieval-evaluation stage runs your retrieval benchmark suite against the temporary index, measuring recall@10, precision@10, NDCG (normalized discounted cumulative gain), and mean reciprocal rank, then compares scores against the production index's baseline metrics. You will learn to implement a freshness check that verifies the time delta between the most recent document in the index and the current timestamp does not exceed your freshness SLA — typically 24 hours for knowledge-base content and 1 hour for real-time data feeds. The pipeline also measures end-to-end RAG quality by running a set of representative questions through the full RAG pipeline (retrieval plus generation) and scoring the final answers against ground-truth labels, ensuring that retrieval improvements actually translate into generation improvements rather than merely surfacing different but equally irrelevant passages. The promotion stage swaps the collection alias in the vector database to point to the newly built index, a zero-downtime operation that takes effect immediately for all inference requests.
-
Implement a guardrail-policy pipeline that tests safety boundaries with adversarial evaluation suites. Guardrail policies define the safety boundaries of your GenAI system — they determine which inputs are blocked, which outputs are filtered, and what fallback behaviors are triggered when policy violations occur. Because guardrail policies directly affect user safety and regulatory compliance, their pipeline requires the most rigorous evaluation of any artifact type. Your guardrail pipeline begins with a syntax-validation stage that parses the policy DSL, verifies that all referenced guardrail functions exist in the registry, and confirms that the policy's logical structure is valid — no unreachable rules, no conflicting priorities, no rules that would block 100% of traffic (a common misconfiguration). The functional-test stage runs the policy against a curated test suite organized by guardrail category: prompt injection detection, PII (personally identifiable information) leakage prevention, toxic content filtering, off-topic request blocking, and jailbreak resistance. Each test case specifies an input, the expected guardrail verdict (ALLOW, BLOCK, or FLAG), and the expected violation code. The adversarial-evaluation stage goes further by deploying a red-team LLM that dynamically generates novel attack prompts targeting the specific guardrail categories covered by the policy change. This stage measures the policy's true-positive rate (correctly blocked attacks), false-positive rate (incorrectly blocked benign inputs), and bypass rate (attacks that evade detection). The pipeline fails if the bypass rate exceeds a configurable threshold — typically 5% for content-safety guardrails and 1% for PII-leakage guardrails. You will learn to implement a regression-detection mechanism that compares the new policy's test results against the previous version's results, flagging any test case where the verdict changed from BLOCK to ALLOW as a potential safety regression requiring human review.
-
-
Monitor pipeline health with observability metrics
-
Monitor pipeline health with observability metricsthat provide real-time visibility into pipeline execution patterns, stage-level performance, failure rates, and resource consumption, enabling your platform team to proactively identify bottlenecks, detect reliability degradation, and maintain the SLAs that your ML engineering teams depend on for rapid, safe artifact deployment.
-
Instrument Argo Workflows to emit pipeline-level and stage-level metrics to Prometheus. Argo Workflows exposes a metrics endpoint that Prometheus can scrape, but the default metrics cover only basic workflow counts and durations. To build a comprehensive observability layer, you must augment these defaults with custom metrics that capture GenAI-specific dimensions. You will learn to configure Argo's metrics section in the workflow-controller ConfigMap to emit histograms for stage duration bucketed by artifact type (prompt, model-config, RAG-config, guardrail-policy), counters for stage outcomes (pass, fail, error, timeout) bucketed by stage name (lint, validate, eval, promote), and gauges for concurrent workflow count bucketed by priority class. Beyond the controller-level metrics, you will instrument individual pipeline stages to emit custom metrics using the Prometheus pushgateway pattern — each stage container pushes its specific metrics (such as evaluation score, number of lint violations, retrieval recall@10) to a pushgateway that Prometheus scrapes. This gives you artifact-quality metrics alongside infrastructure metrics in a single observability plane. You will learn to define recording rules in Prometheus that pre-aggregate common queries — for example, a recording rule that computes the 7-day rolling average of prompt-evaluation scores per prompt ID, enabling trend detection without expensive ad-hoc queries. You will also learn to configure Prometheus alert rules for critical pipeline health signals: alert when the pipeline failure rate exceeds 20% over a 1-hour window, alert when the median evaluation-stage duration exceeds 10 minutes (indicating LLM API degradation), and alert when the promotion-stage rollback rate exceeds 5% (indicating that artifacts are passing evaluation but failing in production, which signals an evaluation gap). Each alert should include a runbook link in its annotations that guides the on-call engineer through the diagnosis and remediation steps.
-
Build Grafana dashboards that visualize pipeline throughput, latency, and quality trends. Raw metrics in Prometheus are necessary but not sufficient — your platform team and your ML engineering stakeholders need curated dashboards that answer specific operational questions at a glance. You will learn to build a three-tier dashboard hierarchy. The first tier is an executive summary dashboard that shows the total number of artifact deployments per day (broken down by artifact type), the overall pipeline success rate, the mean time from commit to production deployment, and the number of automated rollbacks. This dashboard is designed for engineering leadership and uses large stat panels with threshold-based coloring — green above 95% success rate, yellow between 90-95%, red below 90%. The second tier is a pipeline-operations dashboard that shows per-stage latency distributions (as heatmaps), per-stage failure rates (as time-series graphs), queue depth (number of workflows waiting to be scheduled), and resource utilization (CPU and memory consumed by pipeline pods versus cluster capacity). This dashboard is designed for the platform team and uses dense panel layouts with drill-down links to individual workflow runs. The third tier is an artifact-quality dashboard that plots evaluation metrics over time — prompt ROUGE-L scores, model-config latency percentiles, RAG recall@10, and guardrail bypass rates — with annotations marking each production deployment so engineers can correlate quality trends with specific artifact changes. You will learn to use Grafana's variable system to make each dashboard filterable by artifact type, artifact ID, target environment, and time range, enabling engineers to zoom in from a fleet-wide view to a single prompt's deployment history. You will also implement Grafana alerting as a secondary alerting layer (in addition to Prometheus alerting) for visual anomaly detection — configuring the dashboards to highlight panels with orange borders when metrics deviate from their 30-day normal range by more than two standard deviations.
-
Implement distributed tracing across pipeline stages using OpenTelemetry. Metrics tell you that something is slow or failing; traces tell you why. You will learn to instrument your Argo pipeline stages with OpenTelemetry tracing so that each pipeline execution produces a single trace with spans for every stage, sub-stage, and external service call. The trace begins when the Argo sensor receives the triggering event and ends when the promotion stage completes (or the pipeline fails). Each span carries attributes that encode GenAI-specific context: the artifact type, artifact ID, artifact version, target environment, and the evaluation scores produced by that stage. You will learn to propagate trace context between Argo steps using the OTEL_TRACE_PARENT environment variable, which Argo can inject into each step container via the env section of the step template. For stages that call external services — such as the evaluation stage calling the LLM inference API or the promotion stage calling the vector-database admin API — the OpenTelemetry SDK automatically creates child spans that capture request and response metadata, giving you end-to-end latency visibility across service boundaries. You will configure your traces to be exported to a backend such as Jaeger, Tempo, or Datadog, and you will learn to set up trace-based alerting that triggers when specific span patterns occur — for example, alerting when an evaluation stage's LLM-inference span exceeds 30 seconds, which often indicates that the LLM provider is experiencing degradation before it shows up in their status page. You will also learn to use trace-derived metrics (RED metrics — Rate, Errors, Duration) as a complement to your Prometheus metrics, providing a second source of truth that can help diagnose discrepancies between what Argo reports and what actually happened at the network level. The combination of metrics, dashboards, and traces gives your platform team a complete observability stack that supports both real-time incident response and long-term capacity planning for your GenAI CI/CD infrastructure.
-