chore: remove training-data/

This directory (interaction-log scripts and tooling for AI training
data) was included by mistake when the repo was migrated. It has no
relationship to ZDDC the project; remove from the repo and the
matching section from AGENTS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
ZDDC 2026-04-27 21:45:35 -05:00
parent cc35f7179b
commit 1da25eff3f
10 changed files with 0 additions and 958 deletions

View file

@ -242,13 +242,6 @@ Use `git worktree` to run multiple agents on separate branches simultaneously wi
- Toast UI Editor v3.2.2 is bundled in `vendor/`; `template.html` loads it from CDN for dev convenience - Toast UI Editor v3.2.2 is bundled in `vendor/`; `template.html` loads it from CDN for dev convenience
- `</` escaping is essential: `sed 's#</#<\\/#g'` runs on both app JS and vendor JS at build time - `</` escaping is essential: `sed 's#</#<\\/#g'` runs on both app JS and vendor JS at build time
## Training-data
The `training-data/` directory scripts and README are committed. The data itself is excluded via `training-data/.gitignore`:
- `raw/` — raw interaction logs (never committed)
- `processed/`, `validation/`, `adapters/`, `snapshots/` — generated data (never committed)
## zddc-server ## zddc-server
Go HTTP server sub-project living at `zddc/`. Replaces `caddy file-server --browse` for ZDDC archives. Go HTTP server sub-project living at `zddc/`. Replaces `caddy file-server --browse` for ZDDC archives.

View file

@ -1,6 +0,0 @@
# Training data — never commit actual data to the repo
raw/
processed/
validation/
adapters/
snapshots/

View file

@ -1,167 +0,0 @@
# ZDDC Training Data Pipeline
Adaptive LoRA fine-tuning for **Qwen3-Coder-Next** (`vllm/RedHatAI/Qwen3-Coder-Next-NVFP4`).
Training data is collected reactively: when Qwen struggles on a task and a stronger model (Sonnet/Opus) is consulted, that interaction is captured as a training example. Over time, domain-specific LoRA adapters are trained to patch Qwen's weak spots.
---
## Directory Layout
```
training-data/
├── raw/
│ └── interactions.jsonl # All captured weak-spot interactions
├── processed/
│ ├── all.jsonl # Deduplicated, combined dataset
│ ├── multi-domain.jsonl # All domains merged
│ └── <domain>.jsonl # Per-domain splits (auto-generated)
├── validation/
│ ├── train.jsonl # 80% split
│ ├── val.jsonl # 10% split
│ └── test.jsonl # 10% split (never used during training)
├── adapters/
│ └── <domain>-lora-v1/ # Trained LoRA adapter
│ └── <domain>-lora-v1-merged/ # Merged standalone model
├── snapshots/
│ └── v<date>/ # Versioned dataset snapshots
├── collect-interaction.js # Capture a weak-spot interaction
├── process.sh # Cluster raw data into domain splits
├── validate.sh # Check data quality before training
├── train.sh # Train a LoRA adapter
└── deploy.sh # Merge adapter into standalone model
```
---
## Workflow
### Step 1 — Collect a weak-spot interaction
When Qwen gets stuck or you ask it to consult Sonnet/Opus:
```bash
node collect-interaction.js \
--query "How do I parse ZDDC filenames?" \
--qwen "[Qwen's suboptimal answer]" \
--expert "[Sonnet's correct answer]"
```
Optionally specify domain explicitly (otherwise auto-detected):
```bash
node collect-interaction.js \
--query "..." \
--qwen "..." \
--expert "..." \
--domain zddc-naming
```
Raw interaction is appended to `raw/interactions.jsonl`.
### Step 2 — Process (after ~50 new interactions)
```bash
bash process.sh
```
Deduplicates, clusters by domain, creates train/val/test splits.
### Step 3 — Validate
```bash
bash validate.sh
```
Checks JSONL validity, domain balance, and split sizes.
### Step 4 — Train
```bash
bash train.sh # train multi-domain adapter
bash train.sh zddc-naming # train domain-specific adapter
```
Outputs LoRA adapter to `adapters/<domain>-lora-v1/`.
### Step 5 — Deploy (optional)
```bash
bash deploy.sh # merge multi-domain adapter
bash deploy.sh zddc-naming # merge specific adapter
```
Merges the LoRA weights into the base model and saves a standalone model.
---
## Training Data Format
Each line in a `.jsonl` file is one training example:
```json
{
"messages": [
{"role": "user", "content": "Query that exposed weakness"},
{"role": "assistant", "content": "Qwen's original response"},
{"role": "user", "content": "consult Sonnet"},
{"role": "assistant", "content": "Expert's correct response"}
],
"metadata": {
"domain": "zddc-naming",
"adapter": "lora-v1-zddc_naming",
"timestamp": "2025-10-31T14:30:00.000Z",
"interaction_id": "int_1735648200000_abc123",
"source": "manual-expert-consultation"
}
}
```
---
## Auto-Detected Domains
| Domain | Trigger keywords |
|--------|----------------|
| `zddc-naming` | zddc, trackingnumber, revision, status code |
| `html-architecture` | html, spa, single-file, es module, vanilla js |
| `build-system` | build.sh, dist/, template.html |
| `coding-debugging` | debug, error, fix, console |
| `reasoning-architecture` | reason, analyze, architecture, design |
| `general-coding` | (default) |
---
## LoRA Configuration
| Parameter | Value | Notes |
|-----------|-------|-------|
| Base model | `Qwen/Qwen2.5-7B-Instruct` | Replace with Qwen3 when available on HF |
| Rank | 64 | Increase to 128 if underfitting |
| Alpha | 64 | 1:1 with rank |
| Target modules | q_proj, v_proj, k_proj, o_proj | All attention projections |
| Dropout | 0.05 | Light regularisation |
| Learning rate | 1e-4 | Cosine decay with 10% warmup |
| Epochs | 3 | Monitor val loss to catch overfitting |
| Batch size | 8 effective | 4 per-device × 2 gradient accumulation |
| Precision | bfloat16 | Requires Ampere GPU or newer |
---
## Hardware Requirements
| Setup | Min VRAM | Method | Notes |
|-------|----------|--------|-------|
| Qwen-7B LoRA | 24 GB | LoRA bf16 | Recommended |
| Qwen-7B QLoRA | 16 GB | QLoRA 4-bit | Add `--load_in_4bit` flag |
| Qwen-14B LoRA | 48 GB | LoRA bf16 | Better quality |
Your system has 96 GB VRAM — full LoRA on Qwen-14B is feasible.
---
## When to Retrain
- Every **50100 new interactions** collected
- When a new domain accumulates **200+ examples**
- After a major project phase where Qwen struggled repeatedly

View file

@ -1,55 +0,0 @@
#!/bin/bash
# build-interaction-collector.sh
# Install dependencies for the interaction collector
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
echo "=== Installing Node.js Dependencies ==="
# Check if Node.js is available
if ! command -v node &> /dev/null; then
echo "Error: Node.js is not installed"
echo "Please install Node.js 18+ from https://nodejs.org/"
exit 1
fi
NODE_VERSION=$(node --version)
echo "Node.js version: $NODE_VERSION"
# Check minimal version
NODE_MAJOR=$(echo $NODE_VERSION | cut -c2- | cut -d. -f1)
if [ "$NODE_MAJOR" -lt 18 ]; then
echo "Error: Node.js 18+ is required"
exit 1
fi
echo "✓ Node.js version check passed"
# Create package.json if it doesn't exist
if [ ! -f package.json ]; then
cat > package.json << 'EOF'
{
"name": "zddc-training-data",
"version": "1.0.0",
"description": "Training data collection for ZDDC fine-tuning",
"type": "module",
"main": "collect-interaction.js",
"scripts": {
"collect": "node collect-interaction.js"
},
"keywords": ["zddc", "training", "lora"],
"license": "MIT"
}
EOF
fi
echo "✓ Created/verified package.json"
echo ""
echo "=== Ready to use ==="
echo "Usage: node collect-interaction.js --query \"...\" --qwen \"...\" --expert \"...\""
echo " node collect-interaction.js --query \"How do I...\" --qwen \"[Qwen answer]\" --expert \"[Expert answer]\""
echo ""
echo "To batch process interactions, use process.sh"

View file

@ -1,197 +0,0 @@
#!/usr/bin/env node
/**
* Training Data Collector for Qwen3-Coder-Next
*
* Captures conversations where Qwen needs expert (Sonnet/Opus) assistance
* and stores them in a format suitable for LoRA fine-tuning
*
* Usage:
* node collect-interaction.js --query "..." --qwen "..." --expert "..." --domain "domain-name"
*/
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
// Get script directory
const SCRIPT_DIR = __dirname;
const RAW_FILE = path.join(SCRIPT_DIR, 'raw', 'interactions.jsonl');
/**
* Generate unique interaction ID
*/
function generateInteractionId() {
const timestamp = Date.now();
const random = Math.random().toString(36).substring(2, 10);
return `int_${timestamp}_${random}`;
}
/**
* Detect domain from conversation content
*/
function detectDomain(messages) {
const text = JSON.stringify(messages).toLowerCase();
// ZDDC naming patterns
if (text.includes('zddc') ||
text.includes('trackingnumber') ||
text.includes('revision') ||
text.includes('_a (ifr)') ||
text.includes('status code')) {
return 'zddc-naming';
}
// HTML SPA patterns
if (text.includes('html') ||
text.includes('spa') ||
text.includes('single-file') ||
text.includes('es module') ||
text.includes('vanilla js')) {
return 'html-architecture';
}
// Build system patterns
if (text.includes('build') ||
text.includes('build.sh') ||
text.includes('dist/') ||
text.includes('template.html')) {
return 'build-system';
}
// Debugging patterns
if (text.includes('debug') ||
text.includes('error') ||
text.includes('fix') ||
text.includes('console')) {
return 'coding-debugging';
}
// Reasoning patterns
if (text.includes('reason') ||
text.includes('analyze') ||
text.includes('architecture') ||
text.includes('design')) {
return 'reasoning-architecture';
}
// Default to general coding
return 'general-coding';
}
/**
* Create training example object
*/
function createTrainingExample(userQuery, qwenResponse, expertResponse, options = {}) {
const domain = options.domain || detectDomain([userQuery, qwenResponse, expertResponse]);
return {
messages: [
{ role: 'user', content: userQuery },
{ role: 'assistant', content: qwenResponse },
{ role: 'user', content: 'consult Sonnet' },
{ role: 'assistant', content: expertResponse }
],
metadata: {
domain: domain,
adapter: `lora-v1-${domain.replace(/-/g, '_')}`,
timestamp: new Date().toISOString(),
interaction_id: generateInteractionId(),
source: 'manual-expert-consultation',
...options.metadata
}
};
}
/**
* Append to JSONL file
*/
function appendToJSONL(filePath, data) {
const jsonLine = JSON.stringify(data);
fs.appendFileSync(filePath, jsonLine + '\n');
}
/**
* Format domain name from detected or provided
*/
function formatDomainName(domain) {
// Convert hyphens to underscores for adapter name
return domain.replace(/-/g, '_');
}
/**
* Collect a training example
*/
export function collect({
userQuery,
qwenResponse,
expertResponse,
domain = null,
metadata = {}
}) {
if (!userQuery || !qwenResponse || !expertResponse) {
console.error('Error: Missing required parameters');
console.error('Usage: node collect-interaction.js --query "..." --qwen "..." --expert "..."');
process.exit(1);
}
const trainingExample = createTrainingExample(
userQuery,
qwenResponse,
expertResponse,
{ domain, metadata }
);
// Ensure raw directory exists
fs.mkdirSync(path.dirname(RAW_FILE), { recursive: true });
// Append to raw file
appendToJSONL(RAW_FILE, trainingExample);
console.log('\n=== Training Example Captured ===');
console.log(`Domain: ${trainingExample.metadata.domain}`);
console.log(`Adapter: ${trainingExample.metadata.adapter}`);
console.log(`Interaction ID: ${trainingExample.metadata.interaction_id}`);
console.log(`Timestamp: ${trainingExample.metadata.timestamp}`);
console.log(`Raw file: ${RAW_FILE}`);
console.log('=================================\n');
return trainingExample;
}
/**
* CLI interface
*/
function main() {
const args = process.argv.slice(2);
// Parse arguments
const queryIdx = args.findIndex(arg => arg === '--query');
const qwenIdx = args.findIndex(arg => arg === '--qwen');
const expertIdx = args.findIndex(arg => arg === '--expert');
const domainIdx = args.findIndex(arg => arg === '--domain');
if (queryIdx === -1 || qwenIdx === -1 || expertIdx === -1) {
console.error('Error: Missing required arguments');
console.error('Usage: node collect-interaction.js --query "..." --qwen "..." --expert "..." [--domain "domain"]');
console.error(' node collect-interaction.js --query "Query" --qwen "Qwen answer" --expert "Expert answer"');
process.exit(1);
}
const userQuery = args[queryIdx + 1];
const qwenResponse = args[qwenIdx + 1];
const expertResponse = args[expertIdx + 1];
const domain = domainIdx !== -1 ? args[domainIdx + 1] : null;
collect({ userQuery, qwenResponse, expertResponse, domain });
}
// Export for module usage
export default { collect, createTrainingExample, detectDomain };
// Run if executed directly
if (import.meta.url === `file://${process.argv[1]}`) {
main();
}

View file

@ -1,97 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# deploy.sh — Merge a trained LoRA adapter into a standalone model
# Usage:
# bash deploy.sh # deploy multi-domain adapter
# bash deploy.sh zddc-naming # deploy specific domain adapter
#
# Output: adapters/<domain>-lora-v1-merged/
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
DOMAIN="${1:-multi-domain}"
BASE_MODEL="Qwen/Qwen2.5-7B-Instruct"
ADAPTER_DIR="adapters/${DOMAIN}-lora-v1"
MERGED_DIR="adapters/${DOMAIN}-lora-v1-merged"
echo "=== ZDDC LoRA Deployment ==="
echo "Domain: $DOMAIN"
echo "Adapter: $ADAPTER_DIR"
echo "Output: $MERGED_DIR"
echo ""
if [ ! -d "$ADAPTER_DIR" ]; then
echo "Error: adapter not found: $ADAPTER_DIR"
echo "Run: bash train.sh $DOMAIN"
exit 1
fi
if [ ! -f "$ADAPTER_DIR/adapter_config.json" ]; then
echo "Error: adapter incomplete (missing adapter_config.json)"
exit 1
fi
command -v python3 &>/dev/null || { echo "Error: python3 required"; exit 1; }
python3 -c "import torch, transformers, peft" 2>/dev/null || \
pip install torch transformers peft --quiet
mkdir -p "$MERGED_DIR"
DEPLOY_PY=$(mktemp /tmp/deploy_lora_XXXXXX.py)
trap 'rm -f "$DEPLOY_PY"' EXIT
cat > "$DEPLOY_PY" << 'PYEOF'
import sys, os, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base_model_name = sys.argv[1]
adapter_dir = sys.argv[2]
merged_dir = sys.argv[3]
print(f"Loading base model: {base_model_name}")
tok = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
base_model_name, torch_dtype=torch.bfloat16,
device_map="auto", trust_remote_code=True)
print(f"Loading adapter: {adapter_dir}")
model = PeftModel.from_pretrained(model, adapter_dir)
print("Merging weights into base model...")
model = model.merge_and_unload()
print(f"Saving merged model to {merged_dir} ...")
model.save_pretrained(merged_dir, safe_serialization=True)
tok.save_pretrained(merged_dir)
print("\nRunning test inference...")
prompt = "<|im_start|>user\nWhat is the ZDDC file naming convention?<|im_end|>\n<|im_start|>assistant\n"
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs, max_new_tokens=128, temperature=0.7,
do_sample=True, pad_token_id=tok.eos_token_id)
response = tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(f"Test prompt: What is the ZDDC file naming convention?")
print(f"Model response: {response}")
size = sum(
os.path.getsize(os.path.join(merged_dir, f))
for f in os.listdir(merged_dir)
if os.path.isfile(os.path.join(merged_dir, f)))
print(f"\nMerged model size: {size/(1024**3):.2f} GB")
print(f"Saved to: {merged_dir}")
PYEOF
python3 "$DEPLOY_PY" "$BASE_MODEL" "$ADAPTER_DIR" "$MERGED_DIR"
echo ""
echo "=== Deployment Complete ==="
echo "Merged model: $MERGED_DIR"
echo ""
echo "To use:"
echo " from transformers import AutoTokenizer, AutoModelForCausalLM"
echo " model = AutoModelForCausalLM.from_pretrained('$MERGED_DIR')"
echo " tokenizer = AutoTokenizer.from_pretrained('$MERGED_DIR')"

View file

@ -1,16 +0,0 @@
{
"name": "zddc-training-data",
"version": "1.0.0",
"description": "Training data collection and LoRA fine-tuning pipeline for Qwen3-Coder-Next",
"type": "module",
"main": "collect-interaction.js",
"scripts": {
"collect": "node collect-interaction.js",
"process": "bash process.sh",
"validate": "bash validate.sh",
"train": "bash train.sh",
"deploy": "bash deploy.sh"
},
"keywords": ["zddc", "training", "lora", "fine-tuning", "qwen"],
"license": "AGPL-3.0"
}

View file

@ -1,94 +0,0 @@
#!/bin/bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
echo "=== ZDDC Training Data Processor ==="
echo "Timestamp: $(date +%Y-%m-%d_%H-%M-%S)"
# Step 1: Combine raw interactions
echo "Step 1: Combining raw interactions..."
cat raw/*.jsonl > raw/all-interactions.jsonl 2>/dev/null || echo "No raw interactions yet"
# Step 2: Deduplicate by conversation content
echo "Step 2: Deduplicating..."
if [ -f raw/all-interactions.jsonl ]; then
jq -s 'unique_by(.messages | tojson)' raw/all-interactions.jsonl > processed/all.jsonl
else
echo "Warning: No raw interactions to deduplicate"
fi
# Step 3: Categorize by domain
echo "Step 3: Categorizing by domain..."
# Check if we have data to process
if [ -f processed/all.jsonl ]; then
# Extract unique domains
domains=$(jq -r '.metadata.domain' processed/all.jsonl | sort -u)
for domain in $domains; do
if [ -n "$domain" ]; then
echo " Processing domain: $domain"
jq -s "grep(.metadata.domain == \"$domain\")" processed/all.jsonl > "processed/${domain}.jsonl" 2>/dev/null || \
jq -s 'map(select(.metadata.domain == "'"$domain"'"))' processed/all.jsonl > "processed/${domain}.jsonl"
fi
done
else
echo "Warning: No processed data found"
fi
# Step 4: Create multi-domain dataset
echo "Step 4: Creating multi-domain dataset..."
if [ -f processed/all.jsonl ]; then
cp processed/all.jsonl processed/multi-domain.jsonl
fi
# Step 5: Initialize validation split (empty until we have data)
echo "Step 5: Splitting into train/val/test..."
mkdir -p validation
# Create empty placeholders if no data
if [ ! -f validation/train.jsonl ]; then
touch validation/train.jsonl
fi
if [ ! -f validation/val.jsonl ]; then
touch validation/val.jsonl
fi
if [ ! -f validation/test.jsonl ]; then
touch validation/test.jsonl
fi
# Step 6: Create snapshot
echo "Step 6: Creating snapshot..."
SNAPSHOT_NAME="v$(date +%Y-%m-%d)"
mkdir -p snapshots/$SNAPSHOT_NAME
if [ -f processed/all.jsonl ]; then
cp processed/all.jsonl snapshots/$SNAPSHOT_NAME/dataset.jsonl
cp validation/*.jsonl snapshots/$SNAPSHOT_NAME/
fi
echo ""
echo "=== Processing Complete ==="
# Display summary
if [ -f processed/all.jsonl ]; then
TOTAL=$(wc -l < processed/all.jsonl)
VALIDATION=$(wc -l < processed/val.jsonl 2>/dev/null || echo 0)
echo "Total examples: $TOTAL"
echo ""
echo "Domain breakdown:"
if [ -f processed/all.jsonl ]; then
jq -s 'group_by(.metadata.domain) | map({domain: .[0].metadata.domain, count: length})' processed/all.jsonl | jq '.[] | " \(.domain): \(.count)"'
fi
else
echo "No data processed yet. Collect some interactions first."
fi
echo ""
echo "Next steps:"
echo " 1. Collect interactions: node collect-interaction.js --query \"...\" --qwen \"...\" --expert \"...\""
echo " 2. Process: bash process.sh"
echo " 3. Train: bash train.sh [domain]"

View file

@ -1,205 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
# train.sh — Train a LoRA adapter for Qwen3-Coder-Next
# Usage:
# bash train.sh # train on all domains (multi-domain)
# bash train.sh zddc-naming # train on a specific domain
#
# Output: adapters/<domain>-lora-v1/
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
DOMAIN="${1:-all}"
BASE_MODEL="Qwen/Qwen2.5-7B-Instruct"
LORA_RANK=64
LORA_ALPHA=64
LEARNING_RATE="1e-4"
NUM_EPOCHS=3
MAX_SEQ_LENGTH=2048
WARMUP_RATIO=0.1
WEIGHT_DECAY=0.01
if [ "$DOMAIN" = "all" ]; then
TRAIN_FILE="validation/train.jsonl"
OUTPUT_DIR="adapters/multi-domain-lora-v1"
else
TRAIN_FILE="processed/${DOMAIN}.jsonl"
OUTPUT_DIR="adapters/${DOMAIN}-lora-v1"
fi
VAL_FILE="validation/val.jsonl"
echo "=== ZDDC LoRA Fine-Tuning ==="
echo "Domain: $DOMAIN"
echo "Base model: $BASE_MODEL"
echo "Train file: $TRAIN_FILE"
echo "Output dir: $OUTPUT_DIR"
echo "LoRA rank: $LORA_RANK / alpha: $LORA_ALPHA"
echo "Learning rate: $LEARNING_RATE Epochs: $NUM_EPOCHS"
echo ""
if [ ! -f "$TRAIN_FILE" ]; then
echo "Error: training file not found: $TRAIN_FILE"
echo "Run bash process.sh first."
exit 1
fi
TRAIN_COUNT=$(grep -c . "$TRAIN_FILE" 2>/dev/null || echo 0)
if [ "$TRAIN_COUNT" -eq 0 ]; then
echo "Error: training file is empty. Collect at least 50 interactions first."
exit 1
fi
echo "Training examples: $TRAIN_COUNT"
NO_EVAL=0
if [ ! -f "$VAL_FILE" ] || [ "$(grep -c . "$VAL_FILE" 2>/dev/null || echo 0)" -eq 0 ]; then
echo "Warning: no validation data found. Skipping eval."
NO_EVAL=1
fi
command -v python3 &>/dev/null || { echo "Error: python3 required"; exit 1; }
echo "Checking Python dependencies..."
python3 -c "import torch, transformers, peft, trl, datasets, accelerate" 2>/dev/null || {
echo "Installing required packages..."
pip install torch transformers peft trl datasets accelerate --quiet
}
echo "Dependencies OK"
echo ""
mkdir -p "$OUTPUT_DIR"
TRAIN_PY=$(mktemp /tmp/train_lora_XXXXXX.py)
trap 'rm -f "$TRAIN_PY"' EXIT
cat > "$TRAIN_PY" << 'PYEOF'
import json, sys, argparse, torch
from pathlib import Path
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
def load_jsonl(path):
out = []
with open(path) as f:
for line in f:
line = line.strip()
if line:
out.append(json.loads(line))
return out
def fmt(ex):
text = ""
for m in ex["messages"]:
text += f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>\n"
return {"text": text}
p = argparse.ArgumentParser()
p.add_argument("--model_name", default="Qwen/Qwen2.5-7B-Instruct")
p.add_argument("--train_file", required=True)
p.add_argument("--val_file")
p.add_argument("--output_dir", required=True)
p.add_argument("--lora_rank", type=int, default=64)
p.add_argument("--lora_alpha", type=int, default=64)
p.add_argument("--learning_rate", type=float, default=1e-4)
p.add_argument("--num_epochs", type=int, default=3)
p.add_argument("--max_seq_length", type=int, default=2048)
p.add_argument("--warmup_ratio", type=float, default=0.1)
p.add_argument("--weight_decay", type=float, default=0.01)
p.add_argument("--no_eval", action="store_true")
args = p.parse_args()
print(f"Loading tokenizer: {args.model_name}")
tok = AutoTokenizer.from_pretrained(
args.model_name, model_max_length=args.max_seq_length,
padding_side="right", trust_remote_code=True)
if tok.pad_token is None:
tok.pad_token = tok.eos_token
print("Loading base model...")
model = AutoModelForCausalLM.from_pretrained(
args.model_name, torch_dtype=torch.bfloat16,
device_map="auto", trust_remote_code=True)
model.gradient_checkpointing_enable()
print("Applying LoRA...")
lora_cfg = LoraConfig(
r=args.lora_rank, lora_alpha=args.lora_alpha,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()
train_ds = Dataset.from_list([fmt(d) for d in load_jsonl(args.train_file)])
print(f"Training examples: {len(train_ds)}")
eval_ds = None
if not args.no_eval and args.val_file:
vp = Path(args.val_file)
if vp.exists() and vp.stat().st_size > 0:
vdata = load_jsonl(args.val_file)
if vdata:
eval_ds = Dataset.from_list([fmt(d) for d in vdata])
print(f"Validation examples: {len(eval_ds)}")
ta = TrainingArguments(
output_dir=args.output_dir,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=args.learning_rate,
num_train_epochs=args.num_epochs,
warmup_ratio=args.warmup_ratio,
weight_decay=args.weight_decay,
lr_scheduler_type="cosine",
bf16=True, fp16=False,
logging_steps=10,
save_steps=100,
eval_steps=50 if eval_ds else None,
evaluation_strategy="steps" if eval_ds else "no",
save_strategy="steps",
save_total_limit=3,
load_best_model_at_end=(eval_ds is not None),
report_to=[],
seed=42,
)
trainer = SFTTrainer(
model=model, args=ta,
train_dataset=train_ds, eval_dataset=eval_ds,
dataset_text_field="text",
max_seq_length=args.max_seq_length,
tokenizer=tok)
print("\nStarting training...")
trainer.train()
print(f"\nSaving adapter to {args.output_dir} ...")
trainer.model.save_pretrained(args.output_dir)
tok.save_pretrained(args.output_dir)
print(f"Done. Adapter saved to: {args.output_dir}")
PYEOF
CMD="python3 $TRAIN_PY \
--model_name $BASE_MODEL \
--train_file $TRAIN_FILE \
--val_file $VAL_FILE \
--output_dir $OUTPUT_DIR \
--lora_rank $LORA_RANK \
--lora_alpha $LORA_ALPHA \
--learning_rate $LEARNING_RATE \
--num_epochs $NUM_EPOCHS \
--max_seq_length $MAX_SEQ_LENGTH \
--warmup_ratio $WARMUP_RATIO \
--weight_decay $WEIGHT_DECAY"
[ "$NO_EVAL" -eq 1 ] && CMD="$CMD --no_eval"
echo "Launching training..."
eval "$CMD"
echo ""
echo "=== Training Complete ==="
echo "Adapter: $OUTPUT_DIR"
echo "Next: bash test.sh $DOMAIN OR bash deploy.sh $DOMAIN"

View file

@ -1,114 +0,0 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
DOMAIN="${1:-all}"
PASS=0
WARN=0
FAIL=0
ok() { echo "$*"; ((PASS+=1)) || true; }
warn() { echo "$*"; ((WARN+=1)) || true; }
fail() { echo "$*"; ((FAIL+=1)) || true; }
echo "=== ZDDC Training Data Validator ==="
echo "Domain: $DOMAIN"
echo ""
echo "[ Raw Data ]"
RAW_FILE="raw/interactions.jsonl"
if [ ! -f "$RAW_FILE" ]; then
warn "No raw interactions file found. Collect interactions first."
else
RAW_COUNT=$(grep -c . "$RAW_FILE" 2>/dev/null || echo 0)
ok "Raw interactions: $RAW_COUNT"
fi
echo ""
echo "[ JSONL Validity ]"
check_jsonl() {
local file="$1"
if [ ! -f "$file" ]; then warn "File not found: $file"; return; fi
if [ ! -s "$file" ]; then warn "File is empty: $file"; return; fi
local count
count=$(grep -c . "$file" 2>/dev/null || echo 0)
ok "$file$count lines"
}
if [ "$DOMAIN" = "all" ]; then
for f in processed/*.jsonl validation/*.jsonl; do
[ -f "$f" ] && check_jsonl "$f"
done
else
check_jsonl "processed/${DOMAIN}.jsonl"
fi
echo ""
echo "[ Domain Balance ]"
if [ -f "processed/all.jsonl" ] && [ -s "processed/all.jsonl" ]; then
python3 -c "
import json
from collections import Counter
counts = Counter()
with open('processed/all.jsonl') as f:
for line in f:
line = line.strip()
if line:
try:
obj = json.loads(line)
domain = obj.get('metadata', {}).get('domain', 'unknown')
counts[domain] += 1
except Exception:
pass
if not counts:
print(' no domain data found')
else:
print(f' Total: {sum(counts.values())} examples')
for domain, count in sorted(counts.items(), key=lambda x: -x[1]):
status = 'OK' if count >= 200 else 'LOW'
print(f' [{status}] {domain}: {count}')
"
else
warn "processed/all.jsonl not found — run bash process.sh first"
fi
echo ""
echo "[ Train/Val/Test Split ]"
for split in train val test; do
f="validation/${split}.jsonl"
if [ ! -f "$f" ] || [ ! -s "$f" ]; then
warn "$split split missing or empty"
else
count=$(grep -c . "$f" 2>/dev/null || echo 0)
ok "$split: $count examples"
fi
done
echo ""
echo "[ Existing Adapters ]"
if [ -d "adapters" ] && [ "$(ls -A adapters 2>/dev/null)" ]; then
for adapter_dir in adapters/*/; do
name=$(basename "$adapter_dir")
if [ -f "${adapter_dir}adapter_config.json" ]; then
ok "$name (trained)"
else
warn "$name (incomplete)"
fi
done
else
echo " (no adapters yet)"
fi
echo ""
echo "================================="
echo "PASS: $PASS WARN: $WARN FAIL: $FAIL"
if [ "$FAIL" -gt 0 ]; then
echo "Status: FAIL"
exit 1
elif [ "$WARN" -gt 0 ]; then
echo "Status: WARN"
else
echo "Status: PASS"
fi