Learning Goals

  1. Do's and Don'ts

    • Implement trunk-based development with short-lived feature branchesfor AI codebases that contain models, training data, and prompt templates alongside application code

    • Design a branching strategy where the main branch remains the single source of truth for production-ready AI pipelines, ensuring that model configuration files, prompt templates, and training data manifests all pass continuous integration checks before merging. Trunk-based development fundamentally changes how AI teams collaborate because long-lived branches create drift in prompt versions, stale model references, and conflicting dataset configurations that compound over days rather than hours. Senior engineers must understand that the cost of branch divergence in an AI codebase is substantially higher than in traditional software because a two-day-old feature branch may reference a model checkpoint that has already been superseded, a prompt template that has been A/B tested and revised, or a JSONL training file that has been deduplicated and reformatted. The trunk-based model forces these conflicts to surface immediately rather than accumulating into merge nightmares that require domain expertise to resolve. When your team maintains short-lived branches with a maximum lifetime of twenty-four hours, every engineer works against the latest prompt versions and model configurations, which eliminates an entire class of "works on my branch" integration failures that plague AI teams running longer iteration cycles.

    • Establish branch naming conventions that encode AI-specific metadata such as feat/prompt-v3-summarization, fix/training-data-dedup, or experiment/gpt4o-temperature-sweep, making it immediately clear from the branch name whether the change affects prompts, training data, model configurations, or application logic. This convention matters because code review expectations differ dramatically across these categories. A change to a prompt template requires review from a domain expert who understands the downstream evaluation metrics, while a training data change requires review from someone who can verify data quality and format compliance. Branch naming conventions feed directly into automation: your CI pipeline can route prompt/* branches through evaluation benchmarks, training-data/* branches through schema validation and deduplication checks, and experiment/* branches through a lighter-weight gate that only requires notebook execution without full integration testing. The discipline of encoding intent in the branch name also forces engineers to decompose large changes into atomic units. Instead of a single branch that updates the prompt, changes the temperature parameter, and adds new training examples simultaneously, engineers create three focused branches that can each be reviewed, tested, and merged independently. This decomposition is not merely organizational preference—it is essential for AI systems where you need the ability to revert a prompt change without also reverting a training data update, because these changes have fundamentally different rollback characteristics and blast radii.

    • Configure integration frequency targets where feature branches merge back to main at least once per day, using feature flags to gate incomplete functionality rather than relying on long-lived branches to isolate work in progress. Feature flags are particularly powerful in AI codebases because they allow you to deploy a new prompt version behind a flag, run shadow evaluation against production traffic, and promote or rollback the prompt independently of the deployment cycle. The integration frequency target is not arbitrary—it is derived from the observation that merge conflict complexity in AI projects grows super-linearly with branch age because JSONL files, JSON configuration files, and prompt templates lack the structural redundancy that makes three-way merges tractable in general-purpose programming languages. A prompt template is essentially a block of natural language text where every word matters, and Git's line-based merge algorithm has no semantic understanding of prompt engineering intent. By merging daily, you ensure that at most one day's worth of prompt changes need to be reconciled, which is almost always a trivial manual review rather than a complex semantic merge.

    • Implement a branch lifecycle policy that automatically deletes merged branches, tags experiment branches with their evaluation results before archival, and enforces a maximum branch age of forty-eight hours through CI warnings that escalate to blocking checks after seventy-two hours. This lifecycle policy prevents the accumulation of stale experiment branches that reference outdated model endpoints, deleted training data files, or deprecated prompt formats. In AI teams, abandoned experiment branches are particularly dangerous because a new team member may discover a branch named experiment/better-embeddings and attempt to revive it without realizing that the embedding model it references has been decommissioned, the vector database schema has changed, or the evaluation dataset has been rebalanced. Automated cleanup with pre-deletion tagging preserves the experimental record in Git tags while keeping the branch namespace clean and unambiguous.

  2. Configure branch protection rules with required reviews and status checks on ...

    • Configure branch protection rules with required reviews and status checkson GKE-hosted repositories where AI pipelines must pass model validation, prompt evaluation, and infrastructure health gates before code reaches production

    • Define branch protection configurations that require a minimum of two approving reviews for changes to prompt templates and model configuration files, with at least one reviewer drawn from the ML engineering team and one from the platform engineering team. This dual-review requirement exists because AI system changes span two distinct failure domains: the ML domain where a prompt change might degrade response quality, increase hallucination rates, or violate content safety policies, and the infrastructure domain where a model configuration change might exceed GPU memory limits, break autoscaling policies, or create incompatible API request formats. A single reviewer rarely possesses deep expertise in both domains, and the consequences of a missed review in either domain range from degraded user experience to production outages that require emergency model rollbacks. Senior engineers configuring these rules must balance review thoroughness against development velocity—requiring four reviewers for every change creates bottlenecks that incentivize engineers to batch changes into larger, riskier pull requests, while requiring only one reviewer creates blind spots where infrastructure implications of ML changes go unnoticed. The two-reviewer minimum with domain-specific routing represents the empirically validated sweet spot for teams running AI systems on Kubernetes where both model behavior and resource allocation require expert oversight.

    • Implement required status checks that gate merges on passing CI pipelines including model artifact validation, prompt regression testing, JSONL schema compliance, and infrastructure plan verification. These status checks must be configured as required rather than optional, meaning that no branch protection bypass—not even repository administrators—can merge code that fails a prompt regression test or deploys an invalid model configuration. The specific checks you configure depend on your AI pipeline topology, but a robust minimum set includes a prompt evaluation check that runs the changed prompts against a held-out evaluation dataset and fails if any quality metric drops below the established baseline, a schema validation check that verifies all JSONL training data files conform to the expected schema with no malformed records, a model configuration check that validates resource requests against cluster capacity and verifies that referenced model endpoints exist and are healthy, and an infrastructure check that runs a Terraform or Kubernetes dry-run to verify that any manifest changes will apply cleanly to the target cluster. Each check should report its results as a GitHub commit status with a details URL that links to the full evaluation report, enabling reviewers to inspect the specific metrics rather than trusting a binary pass/fail signal.

    • Design an emergency bypass mechanism for production incidents where branch protection rules can be temporarily relaxed through a documented, audited process that requires approval from a designated incident commander and automatically reinstates full protection after a configurable timeout. Emergency bypasses are necessary because AI systems in production occasionally require urgent prompt changes—for example, when a prompt is generating harmful content, when a model endpoint has been deprecated by the provider with short notice, or when a training data leak requires immediate redaction. The bypass mechanism must log every override with the identity of the approver, the justification provided, the specific protections that were relaxed, and the timestamp when protections were reinstated. This audit trail is essential for post-incident review and for compliance frameworks that require evidence of change control even during emergency operations. The automatic reinstatement timeout prevents the common failure mode where protections are disabled during an incident and never re-enabled, leaving the repository in an unprotected state until someone notices days or weeks later.

    • Enforce commit signing and linear history requirements on protected branches to ensure that every commit in the production branch can be cryptographically attributed to a specific engineer and that the commit history provides a clean, bisectable timeline for debugging model behavior regressions. Linear history is particularly valuable in AI codebases because model behavior regressions often manifest days after the causal change was merged, and the ability to perform a clean git bisect across a linear history dramatically reduces the time required to identify which specific commit introduced the regression. Merge commits create diamond patterns in the history graph that complicate bisection and make it ambiguous which parent path contains the regressing change. By requiring rebase-and-merge or squash-and-merge strategies on the protected branch, you guarantee that every commit represents an atomic, testable unit of change that can be independently evaluated for its impact on model behavior.

  3. Build a code review workflow with PR templates for prompt changes and model c...

    • Build a code review workflow with PR templatesfor prompt changes and model configurations that capture the context reviewers need to evaluate AI-specific changes effectively

    • Create differentiated pull request templates that automatically activate based on the files changed in the PR, with separate templates for prompt engineering changes, model configuration updates, training data modifications, and infrastructure changes. Each template must capture the domain-specific context that reviewers need to make informed approval decisions. A prompt change template should require the author to specify the evaluation dataset used, the baseline metrics before the change, the expected metrics after the change, the actual metrics observed in CI, and a plain-language description of why the prompt was modified and what behavior change is intended. A model configuration template should require the author to document the resource impact of the change—including GPU memory delta, expected latency change, and cost impact—along with the rollback procedure if the new configuration underperforms. A training data template should require documentation of the data source, any filtering or deduplication applied, the schema version, and a sample of representative records that illustrate the content being added. These templates transform code review from a generic "does this code look right" exercise into a structured evaluation that guides reviewers through the specific risk factors relevant to each change type.

    • Establish review assignment rules that route prompt changes to engineers with demonstrated expertise in the specific model and use case being modified, using CODEOWNERS files that map directory paths to review teams. The CODEOWNERS configuration for an AI project differs from a traditional software project because ownership boundaries follow model and use case boundaries rather than service boundaries. A directory like prompts/summarization/ might be owned by the summarization team regardless of which microservice consumes those prompts, while prompts/content-safety/ might require mandatory review from the trust and safety team regardless of who authored the change. This ownership model ensures that prompt changes are reviewed by people who understand the downstream impact on model behavior, not merely people who are familiar with the file format. The CODEOWNERS file should also designate backup reviewers for each path to prevent single-point-of-failure bottlenecks where a critical prompt directory is owned by one engineer who is unavailable, blocking all changes to that prompt set. Review assignment automation should respect timezone distributions so that changes authored during business hours in one timezone are not blocked waiting for a reviewer who is currently asleep in another timezone—this is achieved by specifying team-based ownership rather than individual ownership in the CODEOWNERS file and ensuring each team has members across the timezones where development activity occurs.

    • Implement a review checklist system embedded in the PR template that requires reviewers to explicitly confirm they have evaluated specific risk factors before approving. For prompt changes, the checklist should include items such as confirming that the evaluation metrics meet or exceed the baseline, verifying that the prompt does not introduce new failure modes for edge cases documented in the known-issues registry, checking that the prompt change is backward-compatible with existing conversation histories or chat sessions, and verifying that the prompt does not inadvertently expose system instructions or model metadata. For model configuration changes, the checklist should include items such as confirming that the resource requests are within the cluster's available capacity, verifying that autoscaling parameters are consistent with the expected traffic patterns, and checking that the model version is supported by the inference runtime deployed in production. These checklists serve two purposes: they prevent reviewers from rubber-stamping approvals without engaging with the AI-specific implications of the change, and they create a documented record of what was evaluated during review, which is invaluable during post-incident analysis when you need to understand why a specific change was approved despite introducing a regression.

    • Design a PR metrics dashboard that tracks review cycle time, approval rates, and revision counts segmented by change type to identify bottlenecks and quality gaps in the review process. This dashboard should surface metrics such as the median time from PR creation to first review for each change type, the percentage of PRs that require revision after initial review, the distribution of review comments by category (correctness, style, performance, safety), and the correlation between review thoroughness and post-merge incident rates. These metrics enable engineering leadership to make data-driven decisions about review process improvements—for example, if prompt change PRs consistently require three rounds of revision because the initial evaluation metrics are ambiguous, that signals a need for better CI integration that surfaces clearer evaluation results directly in the PR. If model configuration PRs have a significantly longer time-to-first-review than other change types, that signals a need for additional reviewers with infrastructure expertise or better documentation that enables non-specialists to review configuration changes safely.

  4. Manage merge conflicts in JSONL training data and prompt template files

    • Manage merge conflicts in JSONL training data and prompt template filesusing strategies purpose-built for AI artifacts where standard three-way merge algorithms produce semantically invalid results

    • Develop custom merge drivers for JSONL files that understand the line-delimited JSON structure and can perform record-level merging rather than relying on Git's default line-based diff algorithm. Standard Git merging treats JSONL files as plain text and applies line-level three-way merge, which frequently produces syntactically invalid JSON when two branches add records near the same location in the file, reorder records for different preprocessing requirements, or modify different fields within the same record. A custom merge driver registered in .gitattributes can parse each line as an independent JSON object, identify records by a designated key field such as a unique identifier or hash, and perform a semantic merge that correctly handles concurrent additions without producing duplicate records or malformed JSON. The merge driver should also detect and flag conflicting modifications to the same record—for example, when one branch updates the completion field of a training example while another branch updates the system_prompt field of the same example—presenting these as meaningful conflicts that require human resolution rather than silently concatenating incompatible changes. This approach transforms JSONL merge conflicts from opaque text-level diffs that require line-by-line reconstruction into structured, record-level decisions that engineers can resolve with confidence.

    • Implement prompt template versioning strategies that prevent merge conflicts by treating prompt versions as immutable artifacts rather than mutable files. Instead of editing summarization_prompt.txt in place—which guarantees conflicts when multiple engineers iterate on the same prompt simultaneously—adopt a versioned naming scheme such as summarization_prompt_v3.txt where each new version creates a new file and the active version is referenced by a pointer file or configuration entry. This immutable versioning approach eliminates merge conflicts entirely for prompt content because two engineers creating v4 and v5 independently are creating distinct files that can both exist in the repository simultaneously, with the version pointer being the only potential conflict point—and a single-line pointer conflict is trivially resolvable compared to a multi-paragraph natural language merge conflict where Git's context-free diff algorithm cannot distinguish meaningful prompt engineering choices from incidental text rearrangement. The versioned approach also provides a complete history of every prompt iteration with the ability to instantly revert to any previous version by updating the pointer, which is invaluable for A/B testing and for diagnosing behavior regressions that may be caused by prompt changes deployed days earlier.

    • Configure pre-commit hooks that validate AI artifact integrity before commits are created, catching format violations, schema mismatches, and semantic errors at the earliest possible point in the development workflow. Pre-commit hooks for AI codebases should include a JSONL validator that parses every line of every .jsonl file and rejects commits containing malformed JSON records, a prompt template linter that checks for common prompt engineering anti-patterns such as inconsistent delimiter usage or missing system instruction boundaries, a model configuration validator that verifies referenced model names against a registry of available models and checks that hyperparameter values fall within documented valid ranges, and a large file detector that prevents accidental commits of model weights, embedding files, or training datasets that exceed the repository's size policy. These hooks must execute quickly—ideally under five seconds for the full suite—because slow pre-commit hooks incentivize engineers to bypass them with the --no-verify flag, which defeats the purpose of early validation entirely. Achieving sub-five-second execution requires careful optimization: validate only changed files rather than the entire repository, use compiled validators rather than interpreted scripts for schema checking, and cache validation results for unchanged files across consecutive commit attempts.

    • Establish conflict resolution protocols that define how AI-specific merge conflicts should be resolved, including escalation paths for conflicts that require domain expertise beyond the merging engineer's competence. Not all merge conflicts are equal in AI codebases—a conflict in a utility function can be resolved by any competent engineer, but a conflict in a prompt template that affects content safety requires resolution by someone with prompt engineering expertise and knowledge of the safety evaluation framework. The conflict resolution protocol should classify conflicts by their domain and severity: training data conflicts that involve record additions can typically be resolved by accepting both additions and running a deduplication pass, prompt template conflicts require review by the prompt owner identified in the CODEOWNERS file, model configuration conflicts require review by the infrastructure team to verify resource compatibility, and any conflict that affects content safety prompts must be escalated to the trust and safety team regardless of the apparent simplicity of the resolution. The protocol should also mandate that all conflict resolutions in prompt templates and training data are followed by a full evaluation run to verify that the resolved version meets quality baselines, because human intuition about which conflict resolution preserves the intended behavior is unreliable when the artifacts being merged are natural language prompts or statistical training examples rather than deterministic code.

Prerequisites

Students must have completed Chapter 1: Version Control Fundamentals for AI Artifacts and Chapter 2: Repository Structure for ML Projects, which establish core Git operations (clone, commit, push, pull) and the standard directory layout for AI codebases including prompts/, configs/, and data/ directories. Familiarity with basic command-line Git, GitHub or GitLab UI navigation, and YAML syntax for CI configuration files is assumed. Access to a GKE cluster with kubectl configured is required for the branch protection and deployment gate exercises.

Key Terminology

Trunk-Based Development
A branching strategy where all engineers merge small, frequent changes directly into a single main branch, reducing integration risk and enabling continuous delivery of AI model updates and pipeline changes.
Feature Branch
A short-lived branch created from the main trunk to isolate a discrete unit of work—such as a new prompt template or model configuration change—that undergoes review before merging back within one to three days.
Branch Protection Rule
A repository-level policy enforced by platforms like GitHub or GitLab that prevents direct pushes to critical branches, requiring conditions such as passing status checks, approved reviews, and signed commits before any merge is accepted.
Pull Request
A formal request to merge changes from one branch into another that serves as the primary unit of code review, discussion, and automated validation in collaborative AI development workflows.
Code Review
The systematic examination of source code changes by one or more peers to catch defects, enforce team standards, verify prompt engineering quality, and share domain knowledge before changes reach the production branch.
Merge Commit Strategy
A merge approach that creates a dedicated commit joining two branch histories, preserving the complete development timeline of a feature branch including all intermediate commits as a separate lineage in the repository graph.
Rebase
A Git operation that replays commits from a feature branch onto the tip of the target branch, producing a linear commit history without merge commits but rewriting commit hashes in the process, which requires force-pushing if the branch was previously shared.
Git Hook
A script that Git executes automatically at specific lifecycle events—such as pre-commit, pre-push, or post-merge—enabling teams to enforce linting, run model validation checks, or block commits containing secrets before they enter the repository.
Pre-Commit Framework
An open-source tool that manages and executes language-agnostic Git hook plugins, allowing AI teams to configure a declarative **.pre-commit-config.yaml** file that runs formatters, linters, and custom validators on every commit automatically.
JSONL (JSON Lines)
A text format where each line is a valid JSON object, commonly used for AI training datasets and evaluation sets, which creates unique merge conflict challenges because standard line-based diff tools cannot understand the semantic structure of individual records.
Prompt Versioning
The practice of tracking prompt template changes with explicit version identifiers, structured commit messages, and dedicated review processes so that teams can reproduce, compare, and roll back specific prompt behaviors tied to model outputs.
Renovate
An open-source automated dependency update tool that scans repository configuration files, creates pull requests for outdated packages, and supports grouping, scheduling, and auto-merge policies tailored to the specific risk profiles of AI library upgrades.
Dependabot
A GitHub-native service that monitors dependency manifests such as **requirements.txt** and **pyproject.toml**, then opens pull requests when newer versions of packages like PyTorch, Transformers, or LangChain become available.
Status Check
An external validation—typically a CI pipeline job—that reports a pass or fail result on a pull request, which branch protection rules can require before allowing a merge to ensure that model tests, linting, and integration suites have all succeeded.
Squash Merge
A merge strategy that condenses all commits from a feature branch into a single commit on the target branch, producing a clean linear history on main while discarding the granular commit-by-commit development narrative of the source branch.
CODEOWNERS File
A repository configuration file that maps file paths or glob patterns to specific reviewers or teams, automatically assigning review responsibility so that changes to prompt templates, model configs, or training data always route to the appropriate domain experts.
Merge Conflict
A state that occurs when two branches modify the same region of a file in incompatible ways, requiring manual resolution that is particularly error-prone in JSONL datasets and structured configuration files where a single malformed line can silently corrupt training data.
Automated Dependency Update
A workflow pattern where tools like Renovate or Dependabot continuously monitor, propose, and optionally merge package version bumps, reducing the security and compatibility risks of stale AI framework dependencies accumulating unpatched vulnerabilities over time.

On This Page