chore: remove training-data/
This directory (interaction-log scripts and tooling for AI training data) was included by mistake when the repo was migrated. It has no relationship to ZDDC the project; remove from the repo and the matching section from AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
cc35f7179b
commit
1da25eff3f
10 changed files with 0 additions and 958 deletions
|
|
@ -242,13 +242,6 @@ Use `git worktree` to run multiple agents on separate branches simultaneously wi
|
|||
- Toast UI Editor v3.2.2 is bundled in `vendor/`; `template.html` loads it from CDN for dev convenience
|
||||
- `</` escaping is essential: `sed 's#</#<\\/#g'` runs on both app JS and vendor JS at build time
|
||||
|
||||
## Training-data
|
||||
|
||||
The `training-data/` directory scripts and README are committed. The data itself is excluded via `training-data/.gitignore`:
|
||||
|
||||
- `raw/` — raw interaction logs (never committed)
|
||||
- `processed/`, `validation/`, `adapters/`, `snapshots/` — generated data (never committed)
|
||||
|
||||
## zddc-server
|
||||
|
||||
Go HTTP server sub-project living at `zddc/`. Replaces `caddy file-server --browse` for ZDDC archives.
|
||||
|
|
|
|||
6
training-data/.gitignore
vendored
6
training-data/.gitignore
vendored
|
|
@ -1,6 +0,0 @@
|
|||
# Training data — never commit actual data to the repo
|
||||
raw/
|
||||
processed/
|
||||
validation/
|
||||
adapters/
|
||||
snapshots/
|
||||
|
|
@ -1,167 +0,0 @@
|
|||
# ZDDC Training Data Pipeline
|
||||
|
||||
Adaptive LoRA fine-tuning for **Qwen3-Coder-Next** (`vllm/RedHatAI/Qwen3-Coder-Next-NVFP4`).
|
||||
|
||||
Training data is collected reactively: when Qwen struggles on a task and a stronger model (Sonnet/Opus) is consulted, that interaction is captured as a training example. Over time, domain-specific LoRA adapters are trained to patch Qwen's weak spots.
|
||||
|
||||
---
|
||||
|
||||
## Directory Layout
|
||||
|
||||
```
|
||||
training-data/
|
||||
├── raw/
|
||||
│ └── interactions.jsonl # All captured weak-spot interactions
|
||||
├── processed/
|
||||
│ ├── all.jsonl # Deduplicated, combined dataset
|
||||
│ ├── multi-domain.jsonl # All domains merged
|
||||
│ └── <domain>.jsonl # Per-domain splits (auto-generated)
|
||||
├── validation/
|
||||
│ ├── train.jsonl # 80% split
|
||||
│ ├── val.jsonl # 10% split
|
||||
│ └── test.jsonl # 10% split (never used during training)
|
||||
├── adapters/
|
||||
│ └── <domain>-lora-v1/ # Trained LoRA adapter
|
||||
│ └── <domain>-lora-v1-merged/ # Merged standalone model
|
||||
├── snapshots/
|
||||
│ └── v<date>/ # Versioned dataset snapshots
|
||||
├── collect-interaction.js # Capture a weak-spot interaction
|
||||
├── process.sh # Cluster raw data into domain splits
|
||||
├── validate.sh # Check data quality before training
|
||||
├── train.sh # Train a LoRA adapter
|
||||
└── deploy.sh # Merge adapter into standalone model
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow
|
||||
|
||||
### Step 1 — Collect a weak-spot interaction
|
||||
|
||||
When Qwen gets stuck or you ask it to consult Sonnet/Opus:
|
||||
|
||||
```bash
|
||||
node collect-interaction.js \
|
||||
--query "How do I parse ZDDC filenames?" \
|
||||
--qwen "[Qwen's suboptimal answer]" \
|
||||
--expert "[Sonnet's correct answer]"
|
||||
```
|
||||
|
||||
Optionally specify domain explicitly (otherwise auto-detected):
|
||||
|
||||
```bash
|
||||
node collect-interaction.js \
|
||||
--query "..." \
|
||||
--qwen "..." \
|
||||
--expert "..." \
|
||||
--domain zddc-naming
|
||||
```
|
||||
|
||||
Raw interaction is appended to `raw/interactions.jsonl`.
|
||||
|
||||
### Step 2 — Process (after ~50 new interactions)
|
||||
|
||||
```bash
|
||||
bash process.sh
|
||||
```
|
||||
|
||||
Deduplicates, clusters by domain, creates train/val/test splits.
|
||||
|
||||
### Step 3 — Validate
|
||||
|
||||
```bash
|
||||
bash validate.sh
|
||||
```
|
||||
|
||||
Checks JSONL validity, domain balance, and split sizes.
|
||||
|
||||
### Step 4 — Train
|
||||
|
||||
```bash
|
||||
bash train.sh # train multi-domain adapter
|
||||
bash train.sh zddc-naming # train domain-specific adapter
|
||||
```
|
||||
|
||||
Outputs LoRA adapter to `adapters/<domain>-lora-v1/`.
|
||||
|
||||
### Step 5 — Deploy (optional)
|
||||
|
||||
```bash
|
||||
bash deploy.sh # merge multi-domain adapter
|
||||
bash deploy.sh zddc-naming # merge specific adapter
|
||||
```
|
||||
|
||||
Merges the LoRA weights into the base model and saves a standalone model.
|
||||
|
||||
---
|
||||
|
||||
## Training Data Format
|
||||
|
||||
Each line in a `.jsonl` file is one training example:
|
||||
|
||||
```json
|
||||
{
|
||||
"messages": [
|
||||
{"role": "user", "content": "Query that exposed weakness"},
|
||||
{"role": "assistant", "content": "Qwen's original response"},
|
||||
{"role": "user", "content": "consult Sonnet"},
|
||||
{"role": "assistant", "content": "Expert's correct response"}
|
||||
],
|
||||
"metadata": {
|
||||
"domain": "zddc-naming",
|
||||
"adapter": "lora-v1-zddc_naming",
|
||||
"timestamp": "2025-10-31T14:30:00.000Z",
|
||||
"interaction_id": "int_1735648200000_abc123",
|
||||
"source": "manual-expert-consultation"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Auto-Detected Domains
|
||||
|
||||
| Domain | Trigger keywords |
|
||||
|--------|----------------|
|
||||
| `zddc-naming` | zddc, trackingnumber, revision, status code |
|
||||
| `html-architecture` | html, spa, single-file, es module, vanilla js |
|
||||
| `build-system` | build.sh, dist/, template.html |
|
||||
| `coding-debugging` | debug, error, fix, console |
|
||||
| `reasoning-architecture` | reason, analyze, architecture, design |
|
||||
| `general-coding` | (default) |
|
||||
|
||||
---
|
||||
|
||||
## LoRA Configuration
|
||||
|
||||
| Parameter | Value | Notes |
|
||||
|-----------|-------|-------|
|
||||
| Base model | `Qwen/Qwen2.5-7B-Instruct` | Replace with Qwen3 when available on HF |
|
||||
| Rank | 64 | Increase to 128 if underfitting |
|
||||
| Alpha | 64 | 1:1 with rank |
|
||||
| Target modules | q_proj, v_proj, k_proj, o_proj | All attention projections |
|
||||
| Dropout | 0.05 | Light regularisation |
|
||||
| Learning rate | 1e-4 | Cosine decay with 10% warmup |
|
||||
| Epochs | 3 | Monitor val loss to catch overfitting |
|
||||
| Batch size | 8 effective | 4 per-device × 2 gradient accumulation |
|
||||
| Precision | bfloat16 | Requires Ampere GPU or newer |
|
||||
|
||||
---
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
| Setup | Min VRAM | Method | Notes |
|
||||
|-------|----------|--------|-------|
|
||||
| Qwen-7B LoRA | 24 GB | LoRA bf16 | Recommended |
|
||||
| Qwen-7B QLoRA | 16 GB | QLoRA 4-bit | Add `--load_in_4bit` flag |
|
||||
| Qwen-14B LoRA | 48 GB | LoRA bf16 | Better quality |
|
||||
|
||||
Your system has 96 GB VRAM — full LoRA on Qwen-14B is feasible.
|
||||
|
||||
---
|
||||
|
||||
## When to Retrain
|
||||
|
||||
- Every **50–100 new interactions** collected
|
||||
- When a new domain accumulates **200+ examples**
|
||||
- After a major project phase where Qwen struggled repeatedly
|
||||
|
|
@ -1,55 +0,0 @@
|
|||
#!/bin/bash
|
||||
# build-interaction-collector.sh
|
||||
# Install dependencies for the interaction collector
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
echo "=== Installing Node.js Dependencies ==="
|
||||
|
||||
# Check if Node.js is available
|
||||
if ! command -v node &> /dev/null; then
|
||||
echo "Error: Node.js is not installed"
|
||||
echo "Please install Node.js 18+ from https://nodejs.org/"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
NODE_VERSION=$(node --version)
|
||||
echo "Node.js version: $NODE_VERSION"
|
||||
|
||||
# Check minimal version
|
||||
NODE_MAJOR=$(echo $NODE_VERSION | cut -c2- | cut -d. -f1)
|
||||
if [ "$NODE_MAJOR" -lt 18 ]; then
|
||||
echo "Error: Node.js 18+ is required"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✓ Node.js version check passed"
|
||||
|
||||
# Create package.json if it doesn't exist
|
||||
if [ ! -f package.json ]; then
|
||||
cat > package.json << 'EOF'
|
||||
{
|
||||
"name": "zddc-training-data",
|
||||
"version": "1.0.0",
|
||||
"description": "Training data collection for ZDDC fine-tuning",
|
||||
"type": "module",
|
||||
"main": "collect-interaction.js",
|
||||
"scripts": {
|
||||
"collect": "node collect-interaction.js"
|
||||
},
|
||||
"keywords": ["zddc", "training", "lora"],
|
||||
"license": "MIT"
|
||||
}
|
||||
EOF
|
||||
fi
|
||||
|
||||
echo "✓ Created/verified package.json"
|
||||
echo ""
|
||||
echo "=== Ready to use ==="
|
||||
echo "Usage: node collect-interaction.js --query \"...\" --qwen \"...\" --expert \"...\""
|
||||
echo " node collect-interaction.js --query \"How do I...\" --qwen \"[Qwen answer]\" --expert \"[Expert answer]\""
|
||||
echo ""
|
||||
echo "To batch process interactions, use process.sh"
|
||||
|
|
@ -1,197 +0,0 @@
|
|||
#!/usr/bin/env node
|
||||
/**
|
||||
* Training Data Collector for Qwen3-Coder-Next
|
||||
*
|
||||
* Captures conversations where Qwen needs expert (Sonnet/Opus) assistance
|
||||
* and stores them in a format suitable for LoRA fine-tuning
|
||||
*
|
||||
* Usage:
|
||||
* node collect-interaction.js --query "..." --qwen "..." --expert "..." --domain "domain-name"
|
||||
*/
|
||||
|
||||
import fs from 'fs';
|
||||
import path from 'path';
|
||||
import { fileURLToPath } from 'url';
|
||||
|
||||
const __filename = fileURLToPath(import.meta.url);
|
||||
const __dirname = path.dirname(__filename);
|
||||
|
||||
// Get script directory
|
||||
const SCRIPT_DIR = __dirname;
|
||||
const RAW_FILE = path.join(SCRIPT_DIR, 'raw', 'interactions.jsonl');
|
||||
|
||||
/**
|
||||
* Generate unique interaction ID
|
||||
*/
|
||||
function generateInteractionId() {
|
||||
const timestamp = Date.now();
|
||||
const random = Math.random().toString(36).substring(2, 10);
|
||||
return `int_${timestamp}_${random}`;
|
||||
}
|
||||
|
||||
/**
|
||||
* Detect domain from conversation content
|
||||
*/
|
||||
function detectDomain(messages) {
|
||||
const text = JSON.stringify(messages).toLowerCase();
|
||||
|
||||
// ZDDC naming patterns
|
||||
if (text.includes('zddc') ||
|
||||
text.includes('trackingnumber') ||
|
||||
text.includes('revision') ||
|
||||
text.includes('_a (ifr)') ||
|
||||
text.includes('status code')) {
|
||||
return 'zddc-naming';
|
||||
}
|
||||
|
||||
// HTML SPA patterns
|
||||
if (text.includes('html') ||
|
||||
text.includes('spa') ||
|
||||
text.includes('single-file') ||
|
||||
text.includes('es module') ||
|
||||
text.includes('vanilla js')) {
|
||||
return 'html-architecture';
|
||||
}
|
||||
|
||||
// Build system patterns
|
||||
if (text.includes('build') ||
|
||||
text.includes('build.sh') ||
|
||||
text.includes('dist/') ||
|
||||
text.includes('template.html')) {
|
||||
return 'build-system';
|
||||
}
|
||||
|
||||
// Debugging patterns
|
||||
if (text.includes('debug') ||
|
||||
text.includes('error') ||
|
||||
text.includes('fix') ||
|
||||
text.includes('console')) {
|
||||
return 'coding-debugging';
|
||||
}
|
||||
|
||||
// Reasoning patterns
|
||||
if (text.includes('reason') ||
|
||||
text.includes('analyze') ||
|
||||
text.includes('architecture') ||
|
||||
text.includes('design')) {
|
||||
return 'reasoning-architecture';
|
||||
}
|
||||
|
||||
// Default to general coding
|
||||
return 'general-coding';
|
||||
}
|
||||
|
||||
/**
|
||||
* Create training example object
|
||||
*/
|
||||
function createTrainingExample(userQuery, qwenResponse, expertResponse, options = {}) {
|
||||
const domain = options.domain || detectDomain([userQuery, qwenResponse, expertResponse]);
|
||||
|
||||
return {
|
||||
messages: [
|
||||
{ role: 'user', content: userQuery },
|
||||
{ role: 'assistant', content: qwenResponse },
|
||||
{ role: 'user', content: 'consult Sonnet' },
|
||||
{ role: 'assistant', content: expertResponse }
|
||||
],
|
||||
metadata: {
|
||||
domain: domain,
|
||||
adapter: `lora-v1-${domain.replace(/-/g, '_')}`,
|
||||
timestamp: new Date().toISOString(),
|
||||
interaction_id: generateInteractionId(),
|
||||
source: 'manual-expert-consultation',
|
||||
...options.metadata
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Append to JSONL file
|
||||
*/
|
||||
function appendToJSONL(filePath, data) {
|
||||
const jsonLine = JSON.stringify(data);
|
||||
fs.appendFileSync(filePath, jsonLine + '\n');
|
||||
}
|
||||
|
||||
/**
|
||||
* Format domain name from detected or provided
|
||||
*/
|
||||
function formatDomainName(domain) {
|
||||
// Convert hyphens to underscores for adapter name
|
||||
return domain.replace(/-/g, '_');
|
||||
}
|
||||
|
||||
/**
|
||||
* Collect a training example
|
||||
*/
|
||||
export function collect({
|
||||
userQuery,
|
||||
qwenResponse,
|
||||
expertResponse,
|
||||
domain = null,
|
||||
metadata = {}
|
||||
}) {
|
||||
if (!userQuery || !qwenResponse || !expertResponse) {
|
||||
console.error('Error: Missing required parameters');
|
||||
console.error('Usage: node collect-interaction.js --query "..." --qwen "..." --expert "..."');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const trainingExample = createTrainingExample(
|
||||
userQuery,
|
||||
qwenResponse,
|
||||
expertResponse,
|
||||
{ domain, metadata }
|
||||
);
|
||||
|
||||
// Ensure raw directory exists
|
||||
fs.mkdirSync(path.dirname(RAW_FILE), { recursive: true });
|
||||
|
||||
// Append to raw file
|
||||
appendToJSONL(RAW_FILE, trainingExample);
|
||||
|
||||
console.log('\n=== Training Example Captured ===');
|
||||
console.log(`Domain: ${trainingExample.metadata.domain}`);
|
||||
console.log(`Adapter: ${trainingExample.metadata.adapter}`);
|
||||
console.log(`Interaction ID: ${trainingExample.metadata.interaction_id}`);
|
||||
console.log(`Timestamp: ${trainingExample.metadata.timestamp}`);
|
||||
console.log(`Raw file: ${RAW_FILE}`);
|
||||
console.log('=================================\n');
|
||||
|
||||
return trainingExample;
|
||||
}
|
||||
|
||||
/**
|
||||
* CLI interface
|
||||
*/
|
||||
function main() {
|
||||
const args = process.argv.slice(2);
|
||||
|
||||
// Parse arguments
|
||||
const queryIdx = args.findIndex(arg => arg === '--query');
|
||||
const qwenIdx = args.findIndex(arg => arg === '--qwen');
|
||||
const expertIdx = args.findIndex(arg => arg === '--expert');
|
||||
const domainIdx = args.findIndex(arg => arg === '--domain');
|
||||
|
||||
if (queryIdx === -1 || qwenIdx === -1 || expertIdx === -1) {
|
||||
console.error('Error: Missing required arguments');
|
||||
console.error('Usage: node collect-interaction.js --query "..." --qwen "..." --expert "..." [--domain "domain"]');
|
||||
console.error(' node collect-interaction.js --query "Query" --qwen "Qwen answer" --expert "Expert answer"');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const userQuery = args[queryIdx + 1];
|
||||
const qwenResponse = args[qwenIdx + 1];
|
||||
const expertResponse = args[expertIdx + 1];
|
||||
const domain = domainIdx !== -1 ? args[domainIdx + 1] : null;
|
||||
|
||||
collect({ userQuery, qwenResponse, expertResponse, domain });
|
||||
}
|
||||
|
||||
// Export for module usage
|
||||
export default { collect, createTrainingExample, detectDomain };
|
||||
|
||||
// Run if executed directly
|
||||
if (import.meta.url === `file://${process.argv[1]}`) {
|
||||
main();
|
||||
}
|
||||
|
|
@ -1,97 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# deploy.sh — Merge a trained LoRA adapter into a standalone model
|
||||
# Usage:
|
||||
# bash deploy.sh # deploy multi-domain adapter
|
||||
# bash deploy.sh zddc-naming # deploy specific domain adapter
|
||||
#
|
||||
# Output: adapters/<domain>-lora-v1-merged/
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
DOMAIN="${1:-multi-domain}"
|
||||
BASE_MODEL="Qwen/Qwen2.5-7B-Instruct"
|
||||
ADAPTER_DIR="adapters/${DOMAIN}-lora-v1"
|
||||
MERGED_DIR="adapters/${DOMAIN}-lora-v1-merged"
|
||||
|
||||
echo "=== ZDDC LoRA Deployment ==="
|
||||
echo "Domain: $DOMAIN"
|
||||
echo "Adapter: $ADAPTER_DIR"
|
||||
echo "Output: $MERGED_DIR"
|
||||
echo ""
|
||||
|
||||
if [ ! -d "$ADAPTER_DIR" ]; then
|
||||
echo "Error: adapter not found: $ADAPTER_DIR"
|
||||
echo "Run: bash train.sh $DOMAIN"
|
||||
exit 1
|
||||
fi
|
||||
if [ ! -f "$ADAPTER_DIR/adapter_config.json" ]; then
|
||||
echo "Error: adapter incomplete (missing adapter_config.json)"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
command -v python3 &>/dev/null || { echo "Error: python3 required"; exit 1; }
|
||||
python3 -c "import torch, transformers, peft" 2>/dev/null || \
|
||||
pip install torch transformers peft --quiet
|
||||
|
||||
mkdir -p "$MERGED_DIR"
|
||||
|
||||
DEPLOY_PY=$(mktemp /tmp/deploy_lora_XXXXXX.py)
|
||||
trap 'rm -f "$DEPLOY_PY"' EXIT
|
||||
|
||||
cat > "$DEPLOY_PY" << 'PYEOF'
|
||||
import sys, os, torch
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
from peft import PeftModel
|
||||
|
||||
base_model_name = sys.argv[1]
|
||||
adapter_dir = sys.argv[2]
|
||||
merged_dir = sys.argv[3]
|
||||
|
||||
print(f"Loading base model: {base_model_name}")
|
||||
tok = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
base_model_name, torch_dtype=torch.bfloat16,
|
||||
device_map="auto", trust_remote_code=True)
|
||||
|
||||
print(f"Loading adapter: {adapter_dir}")
|
||||
model = PeftModel.from_pretrained(model, adapter_dir)
|
||||
|
||||
print("Merging weights into base model...")
|
||||
model = model.merge_and_unload()
|
||||
|
||||
print(f"Saving merged model to {merged_dir} ...")
|
||||
model.save_pretrained(merged_dir, safe_serialization=True)
|
||||
tok.save_pretrained(merged_dir)
|
||||
|
||||
print("\nRunning test inference...")
|
||||
prompt = "<|im_start|>user\nWhat is the ZDDC file naming convention?<|im_end|>\n<|im_start|>assistant\n"
|
||||
inputs = tok(prompt, return_tensors="pt").to(model.device)
|
||||
with torch.no_grad():
|
||||
out = model.generate(
|
||||
**inputs, max_new_tokens=128, temperature=0.7,
|
||||
do_sample=True, pad_token_id=tok.eos_token_id)
|
||||
response = tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
||||
print(f"Test prompt: What is the ZDDC file naming convention?")
|
||||
print(f"Model response: {response}")
|
||||
|
||||
size = sum(
|
||||
os.path.getsize(os.path.join(merged_dir, f))
|
||||
for f in os.listdir(merged_dir)
|
||||
if os.path.isfile(os.path.join(merged_dir, f)))
|
||||
print(f"\nMerged model size: {size/(1024**3):.2f} GB")
|
||||
print(f"Saved to: {merged_dir}")
|
||||
PYEOF
|
||||
|
||||
python3 "$DEPLOY_PY" "$BASE_MODEL" "$ADAPTER_DIR" "$MERGED_DIR"
|
||||
|
||||
echo ""
|
||||
echo "=== Deployment Complete ==="
|
||||
echo "Merged model: $MERGED_DIR"
|
||||
echo ""
|
||||
echo "To use:"
|
||||
echo " from transformers import AutoTokenizer, AutoModelForCausalLM"
|
||||
echo " model = AutoModelForCausalLM.from_pretrained('$MERGED_DIR')"
|
||||
echo " tokenizer = AutoTokenizer.from_pretrained('$MERGED_DIR')"
|
||||
|
|
@ -1,16 +0,0 @@
|
|||
{
|
||||
"name": "zddc-training-data",
|
||||
"version": "1.0.0",
|
||||
"description": "Training data collection and LoRA fine-tuning pipeline for Qwen3-Coder-Next",
|
||||
"type": "module",
|
||||
"main": "collect-interaction.js",
|
||||
"scripts": {
|
||||
"collect": "node collect-interaction.js",
|
||||
"process": "bash process.sh",
|
||||
"validate": "bash validate.sh",
|
||||
"train": "bash train.sh",
|
||||
"deploy": "bash deploy.sh"
|
||||
},
|
||||
"keywords": ["zddc", "training", "lora", "fine-tuning", "qwen"],
|
||||
"license": "AGPL-3.0"
|
||||
}
|
||||
|
|
@ -1,94 +0,0 @@
|
|||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
echo "=== ZDDC Training Data Processor ==="
|
||||
echo "Timestamp: $(date +%Y-%m-%d_%H-%M-%S)"
|
||||
|
||||
# Step 1: Combine raw interactions
|
||||
echo "Step 1: Combining raw interactions..."
|
||||
cat raw/*.jsonl > raw/all-interactions.jsonl 2>/dev/null || echo "No raw interactions yet"
|
||||
|
||||
# Step 2: Deduplicate by conversation content
|
||||
echo "Step 2: Deduplicating..."
|
||||
if [ -f raw/all-interactions.jsonl ]; then
|
||||
jq -s 'unique_by(.messages | tojson)' raw/all-interactions.jsonl > processed/all.jsonl
|
||||
else
|
||||
echo "Warning: No raw interactions to deduplicate"
|
||||
fi
|
||||
|
||||
# Step 3: Categorize by domain
|
||||
echo "Step 3: Categorizing by domain..."
|
||||
|
||||
# Check if we have data to process
|
||||
if [ -f processed/all.jsonl ]; then
|
||||
# Extract unique domains
|
||||
domains=$(jq -r '.metadata.domain' processed/all.jsonl | sort -u)
|
||||
|
||||
for domain in $domains; do
|
||||
if [ -n "$domain" ]; then
|
||||
echo " Processing domain: $domain"
|
||||
jq -s "grep(.metadata.domain == \"$domain\")" processed/all.jsonl > "processed/${domain}.jsonl" 2>/dev/null || \
|
||||
jq -s 'map(select(.metadata.domain == "'"$domain"'"))' processed/all.jsonl > "processed/${domain}.jsonl"
|
||||
fi
|
||||
done
|
||||
else
|
||||
echo "Warning: No processed data found"
|
||||
fi
|
||||
|
||||
# Step 4: Create multi-domain dataset
|
||||
echo "Step 4: Creating multi-domain dataset..."
|
||||
if [ -f processed/all.jsonl ]; then
|
||||
cp processed/all.jsonl processed/multi-domain.jsonl
|
||||
fi
|
||||
|
||||
# Step 5: Initialize validation split (empty until we have data)
|
||||
echo "Step 5: Splitting into train/val/test..."
|
||||
mkdir -p validation
|
||||
|
||||
# Create empty placeholders if no data
|
||||
if [ ! -f validation/train.jsonl ]; then
|
||||
touch validation/train.jsonl
|
||||
fi
|
||||
if [ ! -f validation/val.jsonl ]; then
|
||||
touch validation/val.jsonl
|
||||
fi
|
||||
if [ ! -f validation/test.jsonl ]; then
|
||||
touch validation/test.jsonl
|
||||
fi
|
||||
|
||||
# Step 6: Create snapshot
|
||||
echo "Step 6: Creating snapshot..."
|
||||
SNAPSHOT_NAME="v$(date +%Y-%m-%d)"
|
||||
mkdir -p snapshots/$SNAPSHOT_NAME
|
||||
|
||||
if [ -f processed/all.jsonl ]; then
|
||||
cp processed/all.jsonl snapshots/$SNAPSHOT_NAME/dataset.jsonl
|
||||
cp validation/*.jsonl snapshots/$SNAPSHOT_NAME/
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "=== Processing Complete ==="
|
||||
|
||||
# Display summary
|
||||
if [ -f processed/all.jsonl ]; then
|
||||
TOTAL=$(wc -l < processed/all.jsonl)
|
||||
VALIDATION=$(wc -l < processed/val.jsonl 2>/dev/null || echo 0)
|
||||
echo "Total examples: $TOTAL"
|
||||
|
||||
echo ""
|
||||
echo "Domain breakdown:"
|
||||
if [ -f processed/all.jsonl ]; then
|
||||
jq -s 'group_by(.metadata.domain) | map({domain: .[0].metadata.domain, count: length})' processed/all.jsonl | jq '.[] | " \(.domain): \(.count)"'
|
||||
fi
|
||||
else
|
||||
echo "No data processed yet. Collect some interactions first."
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo " 1. Collect interactions: node collect-interaction.js --query \"...\" --qwen \"...\" --expert \"...\""
|
||||
echo " 2. Process: bash process.sh"
|
||||
echo " 3. Train: bash train.sh [domain]"
|
||||
|
|
@ -1,205 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# train.sh — Train a LoRA adapter for Qwen3-Coder-Next
|
||||
# Usage:
|
||||
# bash train.sh # train on all domains (multi-domain)
|
||||
# bash train.sh zddc-naming # train on a specific domain
|
||||
#
|
||||
# Output: adapters/<domain>-lora-v1/
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
DOMAIN="${1:-all}"
|
||||
BASE_MODEL="Qwen/Qwen2.5-7B-Instruct"
|
||||
LORA_RANK=64
|
||||
LORA_ALPHA=64
|
||||
LEARNING_RATE="1e-4"
|
||||
NUM_EPOCHS=3
|
||||
MAX_SEQ_LENGTH=2048
|
||||
WARMUP_RATIO=0.1
|
||||
WEIGHT_DECAY=0.01
|
||||
|
||||
if [ "$DOMAIN" = "all" ]; then
|
||||
TRAIN_FILE="validation/train.jsonl"
|
||||
OUTPUT_DIR="adapters/multi-domain-lora-v1"
|
||||
else
|
||||
TRAIN_FILE="processed/${DOMAIN}.jsonl"
|
||||
OUTPUT_DIR="adapters/${DOMAIN}-lora-v1"
|
||||
fi
|
||||
VAL_FILE="validation/val.jsonl"
|
||||
|
||||
echo "=== ZDDC LoRA Fine-Tuning ==="
|
||||
echo "Domain: $DOMAIN"
|
||||
echo "Base model: $BASE_MODEL"
|
||||
echo "Train file: $TRAIN_FILE"
|
||||
echo "Output dir: $OUTPUT_DIR"
|
||||
echo "LoRA rank: $LORA_RANK / alpha: $LORA_ALPHA"
|
||||
echo "Learning rate: $LEARNING_RATE Epochs: $NUM_EPOCHS"
|
||||
echo ""
|
||||
|
||||
if [ ! -f "$TRAIN_FILE" ]; then
|
||||
echo "Error: training file not found: $TRAIN_FILE"
|
||||
echo "Run bash process.sh first."
|
||||
exit 1
|
||||
fi
|
||||
TRAIN_COUNT=$(grep -c . "$TRAIN_FILE" 2>/dev/null || echo 0)
|
||||
if [ "$TRAIN_COUNT" -eq 0 ]; then
|
||||
echo "Error: training file is empty. Collect at least 50 interactions first."
|
||||
exit 1
|
||||
fi
|
||||
echo "Training examples: $TRAIN_COUNT"
|
||||
|
||||
NO_EVAL=0
|
||||
if [ ! -f "$VAL_FILE" ] || [ "$(grep -c . "$VAL_FILE" 2>/dev/null || echo 0)" -eq 0 ]; then
|
||||
echo "Warning: no validation data found. Skipping eval."
|
||||
NO_EVAL=1
|
||||
fi
|
||||
|
||||
command -v python3 &>/dev/null || { echo "Error: python3 required"; exit 1; }
|
||||
|
||||
echo "Checking Python dependencies..."
|
||||
python3 -c "import torch, transformers, peft, trl, datasets, accelerate" 2>/dev/null || {
|
||||
echo "Installing required packages..."
|
||||
pip install torch transformers peft trl datasets accelerate --quiet
|
||||
}
|
||||
echo "Dependencies OK"
|
||||
echo ""
|
||||
|
||||
mkdir -p "$OUTPUT_DIR"
|
||||
|
||||
TRAIN_PY=$(mktemp /tmp/train_lora_XXXXXX.py)
|
||||
trap 'rm -f "$TRAIN_PY"' EXIT
|
||||
|
||||
cat > "$TRAIN_PY" << 'PYEOF'
|
||||
import json, sys, argparse, torch
|
||||
from pathlib import Path
|
||||
from datasets import Dataset
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
|
||||
from peft import LoraConfig, get_peft_model, TaskType
|
||||
from trl import SFTTrainer
|
||||
|
||||
def load_jsonl(path):
|
||||
out = []
|
||||
with open(path) as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
out.append(json.loads(line))
|
||||
return out
|
||||
|
||||
def fmt(ex):
|
||||
text = ""
|
||||
for m in ex["messages"]:
|
||||
text += f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>\n"
|
||||
return {"text": text}
|
||||
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--model_name", default="Qwen/Qwen2.5-7B-Instruct")
|
||||
p.add_argument("--train_file", required=True)
|
||||
p.add_argument("--val_file")
|
||||
p.add_argument("--output_dir", required=True)
|
||||
p.add_argument("--lora_rank", type=int, default=64)
|
||||
p.add_argument("--lora_alpha", type=int, default=64)
|
||||
p.add_argument("--learning_rate", type=float, default=1e-4)
|
||||
p.add_argument("--num_epochs", type=int, default=3)
|
||||
p.add_argument("--max_seq_length", type=int, default=2048)
|
||||
p.add_argument("--warmup_ratio", type=float, default=0.1)
|
||||
p.add_argument("--weight_decay", type=float, default=0.01)
|
||||
p.add_argument("--no_eval", action="store_true")
|
||||
args = p.parse_args()
|
||||
|
||||
print(f"Loading tokenizer: {args.model_name}")
|
||||
tok = AutoTokenizer.from_pretrained(
|
||||
args.model_name, model_max_length=args.max_seq_length,
|
||||
padding_side="right", trust_remote_code=True)
|
||||
if tok.pad_token is None:
|
||||
tok.pad_token = tok.eos_token
|
||||
|
||||
print("Loading base model...")
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
args.model_name, torch_dtype=torch.bfloat16,
|
||||
device_map="auto", trust_remote_code=True)
|
||||
model.gradient_checkpointing_enable()
|
||||
|
||||
print("Applying LoRA...")
|
||||
lora_cfg = LoraConfig(
|
||||
r=args.lora_rank, lora_alpha=args.lora_alpha,
|
||||
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
|
||||
lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM)
|
||||
model = get_peft_model(model, lora_cfg)
|
||||
model.print_trainable_parameters()
|
||||
|
||||
train_ds = Dataset.from_list([fmt(d) for d in load_jsonl(args.train_file)])
|
||||
print(f"Training examples: {len(train_ds)}")
|
||||
|
||||
eval_ds = None
|
||||
if not args.no_eval and args.val_file:
|
||||
vp = Path(args.val_file)
|
||||
if vp.exists() and vp.stat().st_size > 0:
|
||||
vdata = load_jsonl(args.val_file)
|
||||
if vdata:
|
||||
eval_ds = Dataset.from_list([fmt(d) for d in vdata])
|
||||
print(f"Validation examples: {len(eval_ds)}")
|
||||
|
||||
ta = TrainingArguments(
|
||||
output_dir=args.output_dir,
|
||||
per_device_train_batch_size=4,
|
||||
per_device_eval_batch_size=4,
|
||||
gradient_accumulation_steps=2,
|
||||
learning_rate=args.learning_rate,
|
||||
num_train_epochs=args.num_epochs,
|
||||
warmup_ratio=args.warmup_ratio,
|
||||
weight_decay=args.weight_decay,
|
||||
lr_scheduler_type="cosine",
|
||||
bf16=True, fp16=False,
|
||||
logging_steps=10,
|
||||
save_steps=100,
|
||||
eval_steps=50 if eval_ds else None,
|
||||
evaluation_strategy="steps" if eval_ds else "no",
|
||||
save_strategy="steps",
|
||||
save_total_limit=3,
|
||||
load_best_model_at_end=(eval_ds is not None),
|
||||
report_to=[],
|
||||
seed=42,
|
||||
)
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model=model, args=ta,
|
||||
train_dataset=train_ds, eval_dataset=eval_ds,
|
||||
dataset_text_field="text",
|
||||
max_seq_length=args.max_seq_length,
|
||||
tokenizer=tok)
|
||||
|
||||
print("\nStarting training...")
|
||||
trainer.train()
|
||||
|
||||
print(f"\nSaving adapter to {args.output_dir} ...")
|
||||
trainer.model.save_pretrained(args.output_dir)
|
||||
tok.save_pretrained(args.output_dir)
|
||||
print(f"Done. Adapter saved to: {args.output_dir}")
|
||||
PYEOF
|
||||
|
||||
CMD="python3 $TRAIN_PY \
|
||||
--model_name $BASE_MODEL \
|
||||
--train_file $TRAIN_FILE \
|
||||
--val_file $VAL_FILE \
|
||||
--output_dir $OUTPUT_DIR \
|
||||
--lora_rank $LORA_RANK \
|
||||
--lora_alpha $LORA_ALPHA \
|
||||
--learning_rate $LEARNING_RATE \
|
||||
--num_epochs $NUM_EPOCHS \
|
||||
--max_seq_length $MAX_SEQ_LENGTH \
|
||||
--warmup_ratio $WARMUP_RATIO \
|
||||
--weight_decay $WEIGHT_DECAY"
|
||||
|
||||
[ "$NO_EVAL" -eq 1 ] && CMD="$CMD --no_eval"
|
||||
|
||||
echo "Launching training..."
|
||||
eval "$CMD"
|
||||
|
||||
echo ""
|
||||
echo "=== Training Complete ==="
|
||||
echo "Adapter: $OUTPUT_DIR"
|
||||
echo "Next: bash test.sh $DOMAIN OR bash deploy.sh $DOMAIN"
|
||||
|
|
@ -1,114 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
cd "$SCRIPT_DIR"
|
||||
|
||||
DOMAIN="${1:-all}"
|
||||
PASS=0
|
||||
WARN=0
|
||||
FAIL=0
|
||||
|
||||
ok() { echo " ✓ $*"; ((PASS+=1)) || true; }
|
||||
warn() { echo " ⚠ $*"; ((WARN+=1)) || true; }
|
||||
fail() { echo " ✗ $*"; ((FAIL+=1)) || true; }
|
||||
|
||||
echo "=== ZDDC Training Data Validator ==="
|
||||
echo "Domain: $DOMAIN"
|
||||
echo ""
|
||||
|
||||
echo "[ Raw Data ]"
|
||||
RAW_FILE="raw/interactions.jsonl"
|
||||
if [ ! -f "$RAW_FILE" ]; then
|
||||
warn "No raw interactions file found. Collect interactions first."
|
||||
else
|
||||
RAW_COUNT=$(grep -c . "$RAW_FILE" 2>/dev/null || echo 0)
|
||||
ok "Raw interactions: $RAW_COUNT"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "[ JSONL Validity ]"
|
||||
check_jsonl() {
|
||||
local file="$1"
|
||||
if [ ! -f "$file" ]; then warn "File not found: $file"; return; fi
|
||||
if [ ! -s "$file" ]; then warn "File is empty: $file"; return; fi
|
||||
local count
|
||||
count=$(grep -c . "$file" 2>/dev/null || echo 0)
|
||||
ok "$file — $count lines"
|
||||
}
|
||||
|
||||
if [ "$DOMAIN" = "all" ]; then
|
||||
for f in processed/*.jsonl validation/*.jsonl; do
|
||||
[ -f "$f" ] && check_jsonl "$f"
|
||||
done
|
||||
else
|
||||
check_jsonl "processed/${DOMAIN}.jsonl"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "[ Domain Balance ]"
|
||||
if [ -f "processed/all.jsonl" ] && [ -s "processed/all.jsonl" ]; then
|
||||
python3 -c "
|
||||
import json
|
||||
from collections import Counter
|
||||
counts = Counter()
|
||||
with open('processed/all.jsonl') as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
try:
|
||||
obj = json.loads(line)
|
||||
domain = obj.get('metadata', {}).get('domain', 'unknown')
|
||||
counts[domain] += 1
|
||||
except Exception:
|
||||
pass
|
||||
if not counts:
|
||||
print(' no domain data found')
|
||||
else:
|
||||
print(f' Total: {sum(counts.values())} examples')
|
||||
for domain, count in sorted(counts.items(), key=lambda x: -x[1]):
|
||||
status = 'OK' if count >= 200 else 'LOW'
|
||||
print(f' [{status}] {domain}: {count}')
|
||||
"
|
||||
else
|
||||
warn "processed/all.jsonl not found — run bash process.sh first"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "[ Train/Val/Test Split ]"
|
||||
for split in train val test; do
|
||||
f="validation/${split}.jsonl"
|
||||
if [ ! -f "$f" ] || [ ! -s "$f" ]; then
|
||||
warn "$split split missing or empty"
|
||||
else
|
||||
count=$(grep -c . "$f" 2>/dev/null || echo 0)
|
||||
ok "$split: $count examples"
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "[ Existing Adapters ]"
|
||||
if [ -d "adapters" ] && [ "$(ls -A adapters 2>/dev/null)" ]; then
|
||||
for adapter_dir in adapters/*/; do
|
||||
name=$(basename "$adapter_dir")
|
||||
if [ -f "${adapter_dir}adapter_config.json" ]; then
|
||||
ok "$name (trained)"
|
||||
else
|
||||
warn "$name (incomplete)"
|
||||
fi
|
||||
done
|
||||
else
|
||||
echo " (no adapters yet)"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "================================="
|
||||
echo "PASS: $PASS WARN: $WARN FAIL: $FAIL"
|
||||
if [ "$FAIL" -gt 0 ]; then
|
||||
echo "Status: FAIL"
|
||||
exit 1
|
||||
elif [ "$WARN" -gt 0 ]; then
|
||||
echo "Status: WARN"
|
||||
else
|
||||
echo "Status: PASS"
|
||||
fi
|
||||
Loading…
Reference in a new issue