From Governance to Inference: The COMET-to-Skill Builder Pipeline

April 9, 2026 | Reading Time: 12-15 minutes

From Governance to Inference: The COMET-to-Skill Builder Pipeline

The Gap Between Governance and Execution

COMET decomposes organizational workflows into discrete tasks and assigns each a delegation level from 1 (fully human) to 5 (fully autonomous AI). The framework excels at classification: it produces structured task specifications with risk scores, RACI assignments, and framework citations. However, once COMET classifies a task at Level 4 or Level 5 -- indicating high AI suitability -- there has been no automated pathway from that classification to a running agent, let alone to an optimized local model that can handle the task at reduced cost. The result is a governance-to-execution gap: well-documented tasks that still require manual agent development, ad hoc prompt engineering, and expensive cloud inference for every invocation.

FORGE, the ARKONA software factory, now addresses this gap with a pipeline that connects COMET output directly to agent construction, training data collection, and local model fine-tuning. The pipeline transforms governance artifacts into production inference capacity.

Pipeline Architecture: Four Stages

The pipeline operates in four discrete stages, each producing artifacts that feed the next.

Stage 1: COMET Task Extraction

COMET produces a structured JSON specification for each task it classifies at Level 4 or Level 5. This specification includes the task name, description, input schema, expected output schema, risk classification, applicable regulatory frameworks (NIST AI RMF, ISO 42001, etc.), and the delegation rationale. The pipeline ingests these specifications directly from COMET's governance database. Only tasks meeting a minimum confidence threshold (currently 0.85) and carrying an acceptable risk score proceed to Stage 2. Tasks involving safety-critical decisions, personally identifiable information requiring human judgment, or novel scenarios without historical precedent are filtered out regardless of their delegation level.

Stage 2: Agent Construction via the Anthropic Claude Agent SDK

Each qualifying task specification is transformed into a functional agent using the Anthropic Claude Agent SDK (Python). The SDK provides the scaffolding for tool use, memory management, and multi-turn reasoning. The pipeline generates a SKILL.md file for each agent -- a structured specialization document that defines the agent's scope, permitted tools, output format, guardrails, and escalation conditions. The SKILL.md pattern serves as both runtime configuration and documentation: agents read their own SKILL.md at initialization to constrain behavior, and engineers review the same file to audit agent capabilities.

Agent construction follows a deterministic template. Given a COMET task specification, the pipeline:

Generates the SKILL.md from the task's input/output schemas and risk constraints.
Scaffolds the Agent SDK entry point with appropriate tool bindings.
Configures guardrails: token budgets, timeout limits, and output validators derived from the COMET risk score.
Registers the agent with FORGE's agent registry for lifecycle management.

At this stage, the agent runs on Claude and handles all task instances via cloud inference. This is intentional: the cloud phase serves as both a production deployment and a data collection mechanism.

Stage 3: Training Data Accumulation and Curation

Every agent invocation produces a structured log: the input context, the agent's chain-of-thought reasoning, tool calls, and the final output. These logs are stored in a normalized format suitable for supervised fine-tuning. The pipeline applies automated quality filters before any log enters the training corpus:

Output validation: The agent's output must pass the same schema validators defined in the COMET specification.
Confidence scoring: Invocations where the agent expressed low confidence or triggered escalation are excluded from the training set but retained for analysis.
Human feedback integration: When human reviewers override or correct agent outputs (captured through COMET's copilot mode), the corrected version replaces the original in the training corpus.
Deduplication: Near-duplicate input/output pairs are collapsed to prevent overfitting on common cases.

The pipeline requires a minimum of 1,000 validated examples before a task becomes eligible for fine-tuning. For complex tasks with high output variability, the threshold is higher -- typically 2,500 to 5,000 examples -- to ensure adequate coverage of edge cases.

Stage 4: Local Model Fine-Tuning

Once the training data threshold is met, the pipeline initiates fine-tuning on local hardware. The target infrastructure is a dual NVIDIA Tesla P40 configuration (24 GB VRAM each). This hardware constraint dictates the fine-tuning approach and model selection.

Fine-tuning framework: The pipeline uses Axolotl or LLaMA-Factory with QLoRA (Quantized Low-Rank Adaptation). QLoRA enables fine-tuning of models that would otherwise exceed available VRAM by quantizing the base model to 4-bit precision and training only the low-rank adapter weights. This approach achieves near-full-fine-tune quality at a fraction of the memory cost. Unsloth is explicitly excluded from the toolchain: it requires CUDA compute capability 7.0 or higher (Volta architecture), and the P40 is Pascal (compute capability 6.1).

Model selection: The pipeline targets US-origin foundation models exclusively. The current candidate list:

NVIDIA Nemotron 3 Nano (30B MoE, 3.6B active parameters): The primary target for most tasks. The mixture-of-experts architecture means only 3.6B parameters are active during inference, making it feasible on a single P40 while providing the knowledge breadth of a much larger model.
Llama-Nemotron 8B: A strong general-purpose option for tasks requiring dense reasoning without MoE complexity.
Meta Llama 3.1 8B: Well-established, extensive community fine-tuning ecosystem, strong baseline performance across task types.
Google Gemma 2 9B: Competitive performance at the 9B scale, particularly for structured output generation tasks.
Microsoft Phi-4: Compact but capable, suitable for tasks where inference latency is the primary constraint.

For models exceeding single-GPU VRAM during fine-tuning, DeepSpeed ZeRO-3 partitions optimizer states, gradients, and parameters across both P40s. This enables fine-tuning of the full Nemotron 3 Nano 30B (even with MoE, the full parameter set must be loaded during training) and provides headroom for future model sizes.

Training configuration: The pipeline uses a standardized training recipe: 3-5 epochs over the curated dataset, learning rate of 2e-4 with cosine annealing, LoRA rank of 64, and alpha of 128. Evaluation occurs every 100 steps against a held-out validation set (15% of the training corpus). Training halts automatically if validation loss increases for three consecutive evaluation checkpoints.

The MuXD Routing Integration

Fine-tuned local models do not replace Claude -- they handle the routine, well-understood instances of each task. MuXD, the hybrid LLM router, manages the routing decision for every incoming request. The routing logic considers three factors:

Task familiarity: If the input falls within the distribution of the fine-tuned model's training data (measured by embedding similarity to the training corpus), route to the local model.
Confidence threshold: The local model's output confidence must exceed a task-specific threshold (derived from the COMET risk score). Below-threshold responses are escalated to Claude.
Novelty detection: Inputs that trigger the novelty detector (out-of-distribution detection via Mahalanobis distance) are routed directly to Claude, bypassing the local model entirely.

This creates a tiered inference architecture: the local fine-tuned model handles 60-80% of routine invocations at near-zero marginal cost, while Claude handles edge cases, novel scenarios, and high-risk decisions where the governance framework demands maximum capability. MuXD continuously logs routing decisions and outcomes, feeding data back into Stage 3 for ongoing model improvement.

The Feedback Loop

The pipeline is not a one-shot process. It operates as a continuous feedback loop:

COMET classifies new tasks or reclassifies existing ones as organizational workflows evolve.
New agents are constructed automatically for newly qualifying tasks.
Agent logs from both Claude and local model inference accumulate in the training corpus.
Periodic retraining incorporates new examples, corrected outputs, and distribution shifts.
MuXD routing thresholds are recalibrated based on the retrained model's performance characteristics.

Over time, the local model's coverage expands: tasks that initially required 100% cloud inference gradually shift to 80% local / 20% cloud, then 90/10, as the fine-tuned model encounters and learns from more diverse inputs. The governance layer (COMET) retains authority over which tasks are eligible for this progression and can revoke local model authority if performance degrades or risk conditions change.

Why This Matters

The COMET-to-Skill Builder pipeline closes a fundamental gap in AI-governed organizations. Without it, governance produces documentation; with it, governance produces running infrastructure. The pipeline ensures that every AI-suitable task identified by COMET follows a deterministic path to optimized execution: first as a governed cloud agent, then as a cost-optimized local model, always with routing intelligence that preserves quality on edge cases.

The economic impact is significant. Cloud inference costs for high-volume tasks (thousands of invocations per day) can dominate operational budgets. Shifting 70% of that volume to local inference -- on hardware already provisioned and amortized -- reduces per-task cost by an order of magnitude while maintaining the quality and governance guarantees that COMET established at the outset.

The pipeline also addresses a practical engineering concern: agent proliferation. As COMET decomposes more workflows, the number of agents grows. Without a systematic approach to specialization and cost optimization, each new agent adds linearly to cloud inference costs. The pipeline converts that linear cost curve into a logarithmic one: early invocations are expensive (cloud inference + data collection), but marginal cost drops sharply once the fine-tuned model is deployed.

Current Status and Next Steps

As of April 2026, all four stages are operational within FORGE's Skill Builder view. The UI provides a complete management interface across four tabs: Overview (pipeline KPIs and funnel visualization), Skills Catalog (COMET task import with 21 seed templates and live API integration), Training Data (per-skill collection progress with manual quality labeling), and Fine-Tuning (QLoRA job launcher with configurable parameters, active job monitoring, and a model registry for deployed adapters).

Stage 1 (COMET extraction) imports Level 4-5 tasks either from COMET's live API or from seed templates covering all seven FORGE agent roles. Stage 2 (Agent SDK scaffolding) generates SKILL.md-pattern agents with prompt templates derived from COMET task specifications. Stage 3 (data accumulation) is fully automated: the agent execution engine logs input/output pairs whenever a completed task matches an active or collecting skill, with training data capped at 5,000 entries per skill. Stage 4 (fine-tuning) generates Axolotl or LLaMA-Factory training scripts with QLoRA 4-bit quantization, launches them on the dual P40 GPUs via DeepSpeed ZeRO-3, and polls training logs for progress. Graduated models are registered with MuXD's router for automatic local inference routing.

Upcoming work includes automated model selection (choosing the optimal base model for each task type based on benchmark performance), adapter merging for tasks with overlapping skill requirements, and integration with FORGE's CI/CD pipeline for automated model deployment and rollback.

Blog