ZDDC/training-data/README.md
ZDDC ea385b5366 Initial commit
ZDDC — Zero Day Document Control. A file-naming convention plus five
single-file HTML tools (archive, transmittal, classifier, mdedit,
landing) and an optional Go HTTP server (zddc-server) with ACL and a
virtual archive index. Self-contained, offline-capable, dependency-free.

See README.md for an overview, AGENTS.md and ARCHITECTURE.md for the
build/release/architecture detail, bootstrap/README.md for the
two-level deployment install pattern, and zddc/README.md for the
HTTP server.
2026-04-27 11:05:47 -05:00

167 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ZDDC Training Data Pipeline
Adaptive LoRA fine-tuning for **Qwen3-Coder-Next** (`vllm/RedHatAI/Qwen3-Coder-Next-NVFP4`).
Training data is collected reactively: when Qwen struggles on a task and a stronger model (Sonnet/Opus) is consulted, that interaction is captured as a training example. Over time, domain-specific LoRA adapters are trained to patch Qwen's weak spots.
---
## Directory Layout
```
training-data/
├── raw/
│ └── interactions.jsonl # All captured weak-spot interactions
├── processed/
│ ├── all.jsonl # Deduplicated, combined dataset
│ ├── multi-domain.jsonl # All domains merged
│ └── <domain>.jsonl # Per-domain splits (auto-generated)
├── validation/
│ ├── train.jsonl # 80% split
│ ├── val.jsonl # 10% split
│ └── test.jsonl # 10% split (never used during training)
├── adapters/
│ └── <domain>-lora-v1/ # Trained LoRA adapter
│ └── <domain>-lora-v1-merged/ # Merged standalone model
├── snapshots/
│ └── v<date>/ # Versioned dataset snapshots
├── collect-interaction.js # Capture a weak-spot interaction
├── process.sh # Cluster raw data into domain splits
├── validate.sh # Check data quality before training
├── train.sh # Train a LoRA adapter
└── deploy.sh # Merge adapter into standalone model
```
---
## Workflow
### Step 1 — Collect a weak-spot interaction
When Qwen gets stuck or you ask it to consult Sonnet/Opus:
```bash
node collect-interaction.js \
--query "How do I parse ZDDC filenames?" \
--qwen "[Qwen's suboptimal answer]" \
--expert "[Sonnet's correct answer]"
```
Optionally specify domain explicitly (otherwise auto-detected):
```bash
node collect-interaction.js \
--query "..." \
--qwen "..." \
--expert "..." \
--domain zddc-naming
```
Raw interaction is appended to `raw/interactions.jsonl`.
### Step 2 — Process (after ~50 new interactions)
```bash
bash process.sh
```
Deduplicates, clusters by domain, creates train/val/test splits.
### Step 3 — Validate
```bash
bash validate.sh
```
Checks JSONL validity, domain balance, and split sizes.
### Step 4 — Train
```bash
bash train.sh # train multi-domain adapter
bash train.sh zddc-naming # train domain-specific adapter
```
Outputs LoRA adapter to `adapters/<domain>-lora-v1/`.
### Step 5 — Deploy (optional)
```bash
bash deploy.sh # merge multi-domain adapter
bash deploy.sh zddc-naming # merge specific adapter
```
Merges the LoRA weights into the base model and saves a standalone model.
---
## Training Data Format
Each line in a `.jsonl` file is one training example:
```json
{
"messages": [
{"role": "user", "content": "Query that exposed weakness"},
{"role": "assistant", "content": "Qwen's original response"},
{"role": "user", "content": "consult Sonnet"},
{"role": "assistant", "content": "Expert's correct response"}
],
"metadata": {
"domain": "zddc-naming",
"adapter": "lora-v1-zddc_naming",
"timestamp": "2025-10-31T14:30:00.000Z",
"interaction_id": "int_1735648200000_abc123",
"source": "manual-expert-consultation"
}
}
```
---
## Auto-Detected Domains
| Domain | Trigger keywords |
|--------|----------------|
| `zddc-naming` | zddc, trackingnumber, revision, status code |
| `html-architecture` | html, spa, single-file, es module, vanilla js |
| `build-system` | build.sh, dist/, template.html |
| `coding-debugging` | debug, error, fix, console |
| `reasoning-architecture` | reason, analyze, architecture, design |
| `general-coding` | (default) |
---
## LoRA Configuration
| Parameter | Value | Notes |
|-----------|-------|-------|
| Base model | `Qwen/Qwen2.5-7B-Instruct` | Replace with Qwen3 when available on HF |
| Rank | 64 | Increase to 128 if underfitting |
| Alpha | 64 | 1:1 with rank |
| Target modules | q_proj, v_proj, k_proj, o_proj | All attention projections |
| Dropout | 0.05 | Light regularisation |
| Learning rate | 1e-4 | Cosine decay with 10% warmup |
| Epochs | 3 | Monitor val loss to catch overfitting |
| Batch size | 8 effective | 4 per-device × 2 gradient accumulation |
| Precision | bfloat16 | Requires Ampere GPU or newer |
---
## Hardware Requirements
| Setup | Min VRAM | Method | Notes |
|-------|----------|--------|-------|
| Qwen-7B LoRA | 24 GB | LoRA bf16 | Recommended |
| Qwen-7B QLoRA | 16 GB | QLoRA 4-bit | Add `--load_in_4bit` flag |
| Qwen-14B LoRA | 48 GB | LoRA bf16 | Better quality |
Your system has 96 GB VRAM — full LoRA on Qwen-14B is feasible.
---
## When to Retrain
- Every **50100 new interactions** collected
- When a new domain accumulates **200+ examples**
- After a major project phase where Qwen struggled repeatedly