test: add tests/data/test-archive.sh — synthetic ZDDC fixture builder

Generates a realistic ZDDC archive layout for end-to-end testing of
master + cache + mirror, with zero identifying data and a script
that creates and clears on demand.

Output: ~/zddc-test-data (default; override via TEST_ARCHIVE_DIR),
intentionally OUTSIDE the repo. Defensive .gitignore entries cover
in-repo redirects and the source-reference CSV (~/archive-export*.csv,
which the script never reads at runtime — distributions are baked in
here as constants extracted from a one-time inspection).

Layout mirrors a real archive's shape (project → Archive → party →
Received|Issued → dated transmittal folder → tracking-numbered file)
with synthetic codes throughout — Project-1/2/3, PartyA/B/C, FAC1-4,
lorem-ipsum titles, example.com emails. Disciplines, doc-type codes,
status codes (IFR/IFI/IFA/IFU/RSB), revision letters (A/B/0/0A/0B/C/D),
and tracking-number format are kept as-is — they're public ZDDC
convention vocabularies, not identifying data.

Each file's content is the metadata block:
  Tracking Number: <synthetic>
  Revision: <letter>
  Status: <code>
  Title: <lorem-ipsum>
rendered into the appropriate format per extension. Open any file and
verify it's the right one — md as a table, yaml as keys, html as a
styled table, .zddc as YAML, .zip with three views (md+yaml+html), pdf
rendered via docker.io/pandoc/latex (already-existing 563MB image)
through podman with --userns=keep-id so output is host-user-owned.
Falls back to a hand-rolled minimal valid PDF (Python stdlib only)
when podman or the pandoc image is unavailable.

Subcommands:
  build [--small]   Generate the fixture. --small produces ~12 files,
                    full produces ~550 with every one of the six
                    extensions (md/yaml/pdf/html/zddc/zip) guaranteed
                    in every transmittal.
  clear             rm -rf the fixture. Refuses unless target contains
                    a .zddc — defense against an accidental misconfigured
                    TEST_ARCHIVE_DIR pointing at something important.
  info              File count, total size, by-extension breakdown,
                    top-level layout. No content snippets.

POSIX sh (dash-compatible). Randomness via /dev/urandom (no $RANDOM;
dash doesn't expose it). Per-directory .zddc ACL configs use synthetic
emails from RFC-2606 example.com.

Verified: full fixture builds in ~3min (PDF generation dominates),
contains 144 PDFs all valid 1-page, no real-archive tokens leak
(grep -i for known sentinels from the source CSV returns zero hits).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
ZDDC 2026-05-08 09:03:38 -05:00
parent 55852a9efb
commit e19667b5a2
2 changed files with 524 additions and 0 deletions

8
.gitignore vendored
View file

@ -43,3 +43,11 @@ package-lock.json
zddc-knowledge*.json zddc-knowledge*.json
zddc-knowledge*.md zddc-knowledge*.md
zddc-knowledge*.html zddc-knowledge*.html
# tests/data/test-archive.sh fixture output. Default is ~/zddc-test-data
# (outside the repo); these patterns catch in-repo redirects via
# TEST_ARCHIVE_DIR. Defense in depth — the real-archive CSV reference
# at ~/archive-export*.csv must NEVER end up in the repo.
/zddc-test-data/
/tests/data/output/
/archive-export*.csv

516
tests/data/test-archive.sh Executable file
View file

@ -0,0 +1,516 @@
#!/bin/sh
# test-archive.sh — build/clear a synthetic ZDDC archive for end-to-end
# testing of master + cache + mirror.
#
# The fixture mimics the SHAPE of a real ZDDC archive (project →
# Archive → party → Received|Issued → dated transmittal folder →
# tracking-number-named files) but contains zero identifying data.
# Every file's content is a 4-line metadata block:
#
# Tracking Number: FAC1-EL-CAL-0020
# Revision: A
# Status: IFI
# Title: <synthetic lorem-ipsum phrase>
#
# rendered into the appropriate format per extension. Open any file
# and you can verify it's the right one. Tracking-number / revision /
# status / extension distributions are derived from a real archive
# CSV (~/archive-export*.csv) but the script never reads that CSV
# at runtime — distributions are baked in here as constants.
#
# Output lives at $TEST_ARCHIVE_DIR (default ~/zddc-test-data),
# OUTSIDE the repo. There is also a defensive .gitignore entry
# matching common in-repo paths in case someone redirects.
#
# PDF generation uses docker.io/pandoc/latex via podman with
# --userns=keep-id so output is owned by the host user. If podman
# isn't available, PDF generation falls back to plaintext PDF
# (a hand-rolled minimal valid PDF — opens but no formatting).
set -eu
# ---------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------
TARGET="${TEST_ARCHIVE_DIR:-$HOME/zddc-test-data}"
SMALL=0
PROJECTS_FULL="Project-1 Project-2 Project-3"
PROJECTS_SMALL="Project-1"
PARTIES_FULL="PartyA PartyB PartyC"
PARTIES_SMALL="PartyA"
TRANSMITTALS_PER_PARTY_FULL=6
TRANSMITTALS_PER_PARTY_SMALL=2
# Each transmittal contains at least one of every extension in
# EXTENSIONS_GUARANTEED (6 of them), plus extras up to this total.
FILES_PER_TRANSMITTAL_FULL=10
FILES_PER_TRANSMITTAL_SMALL=6
PANDOC_IMAGE="docker.io/pandoc/latex:latest"
# Status / revision / extension / discipline distributions, derived
# from a 773-row sample of a real archive. Format-preserving — these
# are public ZDDC convention vocabularies, no identifying data.
STATUSES="IFR IFR IFR IFR IFR IFR IFI IFI IFU IFU IFA RSB" # weighted
REVISIONS="A B 0 0A 0B C D"
# The full extension set per the test plan. Each transmittal gets one
# of each (so every fixture exercises every extension), then EXTRAS
# are sampled from the weighted distribution.
EXTENSIONS_GUARANTEED="md yaml pdf html zddc zip"
EXTENSIONS_WEIGHTED="pdf pdf pdf pdf md md yaml html zip zddc"
DISCIPLINES="EL PM CAL CPT TRN INT MEC SPC"
DOC_TYPES="CAL CPT TRN SPC DRW LST RPT MDL"
# Lorem-ipsum-style title fragments. No real-world references.
TITLE_WORDS="lorem ipsum dolor sit amet consectetur adipiscing elit \
sed eiusmod tempor incididunt labore magna aliqua veniam nostrud \
exercitation ullamco laboris nisi commodo duis aute irure dolore"
# Synthetic admin emails for .zddc ACLs. example.com is reserved
# (RFC 2606), guaranteed not to belong to anyone real.
ADMIN_EMAIL="admin@example.com"
USER_EMAILS="alice@example.com bob@example.com carol@example.com"
# ---------------------------------------------------------------------
# Subcommand dispatch
# ---------------------------------------------------------------------
# ---------------------------------------------------------------------
# Random-number helper. dash (POSIX /bin/sh) has no $RANDOM, so we read
# 2 bytes from /dev/urandom each call and decode to a 16-bit unsigned
# int. Fast (no exec, no awk per call) and properly random across runs.
# ---------------------------------------------------------------------
_rand() {
od -An -N2 -tu2 /dev/urandom | tr -d ' \n'
}
usage() {
cat <<EOF
Usage: $0 <subcommand> [--small]
Subcommands:
build [--small] Generate the synthetic archive (small = ~10x fewer files).
clear Remove the archive directory entirely.
info Show what's there (file count, total size, top-level layout).
help Print this message.
Configuration:
TEST_ARCHIVE_DIR Output directory (default: ~/zddc-test-data).
PDF generation:
Uses $PANDOC_IMAGE via podman (with --userns=keep-id so output is
owned by the host user). Falls back to plaintext PDF when podman
is unavailable.
EOF
}
cmd="${1:-help}"
shift 2>/dev/null || true
for arg in "$@"; do
case "$arg" in
--small) SMALL=1 ;;
*) echo "unknown flag: $arg" >&2; exit 2 ;;
esac
done
# ---------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------
# pick_word <list> — random pick from a whitespace-separated list.
pick_word() {
list="$1"
n=$(echo "$list" | wc -w)
idx=$(( $(_rand) % n + 1 ))
echo "$list" | cut -d' ' -f"$idx"
}
random_int() {
awk -v min="$1" -v max="$2" 'BEGIN { srand(); printf "%d\n", min + int(rand() * (max - min + 1)) }'
}
# Pick a date string YYYY-MM-DD between Jan 1 of last year and today.
random_date() {
days_back=$(awk 'BEGIN { srand(); printf "%d\n", int(rand() * 730) }')
date -d "$days_back days ago" +%Y-%m-%d
}
# Build a 3-6-word lorem title.
random_title() {
n=$(random_int 3 6)
out=""
i=0
while [ "$i" -lt "$n" ]; do
w=$(pick_word "$TITLE_WORDS")
# Capitalize first letter for the first word.
if [ "$i" = 0 ]; then
first=$(printf '%s' "$w" | cut -c1 | tr 'a-z' 'A-Z')
rest=$(printf '%s' "$w" | cut -c2-)
w="${first}${rest}"
fi
out="${out}${out:+ }${w}"
i=$((i + 1))
done
printf '%s' "$out"
}
# Synthetic tracking number: <party>-<facility>-<discipline>-<doctype>-NNNN
make_tracking() {
party="$1" # PartyA → A
party_short=$(printf '%s' "$party" | sed 's/^Party//')
facility="FAC$(random_int 1 4)"
discipline=$(pick_word "$DISCIPLINES")
doctype=$(pick_word "$DOC_TYPES")
seq=$(printf '%04d' "$(random_int 1 999)")
printf '%s-%s-%s-%s-%s' "$party_short" "$facility" "$discipline" "$doctype" "$seq"
}
# Render the metadata block in the right format for an extension.
# Args: ext, tracking, rev, status, title, outpath
render_file() {
ext="$1"; tracking="$2"; rev="$3"; status="$4"; title="$5"; out="$6"
case "$ext" in
md)
cat > "$out" <<EOF
# $tracking
| Field | Value |
|---|---|
| Tracking Number | $tracking |
| Revision | $rev |
| Status | $status |
| Title | $title |
This is a synthetic test fixture. Generated by tests/data/test-archive.sh.
EOF
;;
yaml)
cat > "$out" <<EOF
tracking_number: "$tracking"
revision: "$rev"
status: "$status"
title: "$title"
synthetic: true
generated_by: tests/data/test-archive.sh
EOF
;;
html)
cat > "$out" <<EOF
<!doctype html>
<html><head><meta charset="utf-8"><title>$tracking</title></head>
<body style="font-family:sans-serif;padding:2em">
<h1>$tracking</h1>
<table border="1" cellpadding="6" cellspacing="0">
<tr><th>Tracking Number</th><td>$tracking</td></tr>
<tr><th>Revision</th><td>$rev</td></tr>
<tr><th>Status</th><td>$status</td></tr>
<tr><th>Title</th><td>$title</td></tr>
</table>
<p><em>Synthetic test fixture. Generated by tests/data/test-archive.sh.</em></p>
</body></html>
EOF
;;
zddc)
# *.zddc as a data file (not the special config file).
# YAML-shape since .zddc files ARE YAML.
cat > "$out" <<EOF
# Synthetic .zddc data file (not an ACL config).
tracking_number: "$tracking"
revision: "$rev"
status: "$status"
title: "$title"
synthetic: true
EOF
;;
pdf)
render_pdf "$tracking" "$rev" "$status" "$title" "$out"
;;
zip)
render_zip "$tracking" "$rev" "$status" "$title" "$out"
;;
*)
echo "render_file: unknown extension $ext" >&2
exit 1
;;
esac
}
# Render PDF via pandoc/latex if podman is available; fall back to a
# hand-rolled minimal PDF otherwise.
render_pdf() {
tracking="$1"; rev="$2"; status="$3"; title="$4"; out="$5"
if [ "${PDF_BACKEND:-pandoc}" = "minimal" ] || ! command -v podman >/dev/null 2>&1; then
render_pdf_minimal "$tracking" "$rev" "$status" "$title" "$out"
return
fi
# Build a temp .md alongside, render, drop the .md.
tmp_md="${out%.pdf}.tmp.md"
cat > "$tmp_md" <<EOF
# $tracking
| Field | Value |
|---|---|
| Tracking Number | $tracking |
| Revision | $rev |
| Status | $status |
| Title | $title |
Synthetic test fixture. Generated by tests/data/test-archive.sh.
EOF
# Run pandoc in container. Output dir must be writable by the
# in-container UID; --userns=keep-id keeps it as the host user.
dir=$(dirname "$out")
md_base=$(basename "$tmp_md")
pdf_base=$(basename "$out")
if ! podman run --rm --userns=keep-id \
-v "$dir":/data:Z \
"$PANDOC_IMAGE" "/data/$md_base" -o "/data/$pdf_base" >/dev/null 2>&1; then
# Pandoc failed (image missing? network blocked?). Fall back.
rm -f "$tmp_md"
render_pdf_minimal "$tracking" "$rev" "$status" "$title" "$out"
return
fi
rm -f "$tmp_md"
}
# Hand-rolled minimal valid PDF — opens in any reader, displays the
# metadata block. ~600 bytes. Used only when pandoc isn't reachable.
render_pdf_minimal() {
tracking="$1"; rev="$2"; status="$3"; title="$4"; out="$5"
# PDF strings escape (, ), \ — the lorem-ipsum titles never include
# these so a basic substitution is enough for our fixture.
safe() { printf '%s' "$1" | sed 's/[()\\]/_/g'; }
t=$(safe "$tracking"); r=$(safe "$rev"); s=$(safe "$status"); ti=$(safe "$title")
python3 - "$out" "$t" "$r" "$s" "$ti" <<'PY'
import sys
out, t, r, s, ti = sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5]
text = (
f"BT /F1 14 Tf 72 720 Td ({t}) Tj ET\n"
f"BT /F1 12 Tf 72 700 Td (Revision: {r}) Tj ET\n"
f"BT /F1 12 Tf 72 685 Td (Status: {s}) Tj ET\n"
f"BT /F1 12 Tf 72 670 Td (Title: {ti}) Tj ET\n"
f"BT /F1 10 Tf 72 640 Td (Synthetic test fixture - tests/data/test-archive.sh) Tj ET\n"
)
text_b = text.encode("latin-1", errors="replace")
objs = [
b"<</Type/Catalog/Pages 2 0 R>>",
b"<</Type/Pages/Kids[3 0 R]/Count 1>>",
b"<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R/Resources<</Font<</F1 5 0 R>>>>>>",
b"<</Length " + str(len(text_b)).encode() + b">>stream\n" + text_b + b"endstream",
b"<</Type/Font/Subtype/Type1/BaseFont/Helvetica>>",
]
buf = bytearray(b"%PDF-1.4\n%\xe2\xe3\xcf\xd3\n")
offsets = []
for i, obj in enumerate(objs, 1):
offsets.append(len(buf))
buf += f"{i} 0 obj\n".encode() + obj + b"\nendobj\n"
xref = len(buf)
buf += f"xref\n0 {len(objs)+1}\n".encode()
buf += b"0000000000 65535 f \n"
for o in offsets:
buf += f"{o:010d} 00000 n \n".encode()
buf += f"trailer <</Size {len(objs)+1}/Root 1 0 R>>\nstartxref\n{xref}\n%%EOF\n".encode()
open(out, "wb").write(buf)
PY
}
# Render a .zip containing a .md, .yaml, and .html with the same
# metadata so unzipping shows three views of the same record. POSIX
# sh has no function-local scope, so nested render_file calls would
# clobber $out — copy to z_out first.
render_zip() {
z_track="$1"; z_rev="$2"; z_status="$3"; z_title="$4"; z_out="$5"
tmpdir=$(mktemp -d)
render_file md "$z_track" "$z_rev" "$z_status" "$z_title" "$tmpdir/$z_track.md"
render_file yaml "$z_track" "$z_rev" "$z_status" "$z_title" "$tmpdir/$z_track.yaml"
render_file html "$z_track" "$z_rev" "$z_status" "$z_title" "$tmpdir/$z_track.html"
(cd "$tmpdir" && zip -q "$z_out" ./*)
rm -rf "$tmpdir"
}
# Write a per-directory .zddc ACL config. Synthetic emails only.
write_zddc_config() {
out="$1"
role="${2:-default}" # default | party | project
case "$role" in
project|party)
cat > "$out" <<EOF
title: "Synthetic ${role} ACL — test fixture"
admins:
- $ADMIN_EMAIL
acl:
permissions:
"$ADMIN_EMAIL": rwcda
"alice@example.com": rwcd
"bob@example.com": rw
"carol@example.com": r
EOF
;;
*)
cat > "$out" <<EOF
title: "Synthetic root ACL — test fixture"
admins:
- $ADMIN_EMAIL
acl:
permissions:
"$ADMIN_EMAIL": rwcda
"*@example.com": r
EOF
;;
esac
}
# ---------------------------------------------------------------------
# build
# ---------------------------------------------------------------------
cmd_build() {
if [ -e "$TARGET" ]; then
echo "$TARGET already exists. Run '$0 clear' first." >&2
exit 1
fi
if [ "$SMALL" = 1 ]; then
projects="$PROJECTS_SMALL"
parties="$PARTIES_SMALL"
per_party=$TRANSMITTALS_PER_PARTY_SMALL
per_trans=$FILES_PER_TRANSMITTAL_SMALL
echo "building SMALL fixture at $TARGET"
else
projects="$PROJECTS_FULL"
parties="$PARTIES_FULL"
per_party=$TRANSMITTALS_PER_PARTY_FULL
per_trans=$FILES_PER_TRANSMITTAL_FULL
echo "building FULL fixture at $TARGET"
fi
# 0777 on the archive dir lets the rootless-podman pandoc container
# write PDF output regardless of UID-namespace mapping. We're in
# $HOME so the parent dir is already access-controlled by user.
mkdir -p "$TARGET"
chmod 0777 "$TARGET"
# Root .zddc — admins + read-only-for-anyone-with-an-example.com-email.
write_zddc_config "$TARGET/.zddc" default
file_count=0
pdf_count=0
for project in $projects; do
proj_dir="$TARGET/$project"
mkdir -p "$proj_dir"
chmod 0777 "$proj_dir"
write_zddc_config "$proj_dir/.zddc" project
for party in $parties; do
party_dir="$proj_dir/Archive/$party"
mkdir -p "$party_dir/Received" "$party_dir/Issued"
chmod 0777 "$party_dir" "$party_dir/Received" "$party_dir/Issued"
write_zddc_config "$party_dir/.zddc" party
i=0
while [ "$i" -lt "$per_party" ]; do
i=$((i + 1))
# Alternate Received / Issued.
if [ $((i % 2)) = 0 ]; then
bucket="Received"
else
bucket="Issued"
fi
# Transmittal envelope: <date>_<tracking> (<status>) - <title>
t_track=$(make_tracking "$party")
t_status=$(pick_word "$STATUSES")
t_title=$(random_title)
t_date=$(random_date)
t_dir="$party_dir/$bucket/${t_date}_${t_track} (${t_status}) - ${t_title}"
mkdir -p "$t_dir"
chmod 0777 "$t_dir"
# Build the per-transmittal extension list: every
# extension in EXTENSIONS_GUARANTEED at least once,
# then weighted-random extras to reach per_trans total.
file_exts="$EXTENSIONS_GUARANTEED"
guaranteed_count=$(echo "$file_exts" | wc -w)
extras=$((per_trans - guaranteed_count))
if [ "$extras" -gt 0 ]; then
k=0
while [ "$k" -lt "$extras" ]; do
file_exts="$file_exts $(pick_word "$EXTENSIONS_WEIGHTED")"
k=$((k + 1))
done
fi
for f_ext in $file_exts; do
f_track=$(make_tracking "$party")
f_rev=$(pick_word "$REVISIONS")
f_status=$(pick_word "$STATUSES")
f_title=$(random_title)
# Filename per ZDDC convention.
f_name="${f_track}_${f_rev} (${f_status}) - ${f_title}.${f_ext}"
f_path="$t_dir/$f_name"
render_file "$f_ext" "$f_track" "$f_rev" "$f_status" "$f_title" "$f_path"
file_count=$((file_count + 1))
if [ "$f_ext" = "pdf" ]; then
pdf_count=$((pdf_count + 1))
fi
done
done
done
done
echo "built: $file_count files ($pdf_count PDFs) at $TARGET"
echo "info: $0 info"
}
# ---------------------------------------------------------------------
# clear
# ---------------------------------------------------------------------
cmd_clear() {
if [ ! -e "$TARGET" ]; then
echo "$TARGET does not exist; nothing to clear"
return 0
fi
# Defense in depth: refuse to rm anything that doesn't look like
# a test-archive directory.
if [ ! -f "$TARGET/.zddc" ]; then
echo "$TARGET does not contain a .zddc — refusing to rm" >&2
echo "(set TEST_ARCHIVE_DIR explicitly if your fixture lives elsewhere)" >&2
exit 1
fi
rm -rf "$TARGET"
echo "cleared $TARGET"
}
# ---------------------------------------------------------------------
# info
# ---------------------------------------------------------------------
cmd_info() {
if [ ! -e "$TARGET" ]; then
echo "$TARGET does not exist (run '$0 build' first)"
return 0
fi
echo "fixture: $TARGET"
files=$(find "$TARGET" -type f | wc -l)
bytes=$(du -sb "$TARGET" 2>/dev/null | awk '{print $1}')
echo "files: $files"
if [ -n "$bytes" ]; then
# Format bytes as KB/MB.
awk -v b="$bytes" 'BEGIN {
if (b < 1024) printf "size: %d B\n", b
else if (b < 1048576) printf "size: %.1f KB\n", b / 1024
else printf "size: %.1f MB\n", b / 1048576
}'
fi
echo "by extension:"
find "$TARGET" -type f -name '*.*' | sed -E 's/.*\.([a-z]+)$/\1/' | sort | uniq -c | sort -rn | head | awk '{printf " %5d %s\n", $1, $2}'
echo "top-level layout:"
find "$TARGET" -maxdepth 3 -mindepth 1 -type d | sed "s|^$TARGET| .|" | head -20
}
case "$cmd" in
build) cmd_build ;;
clear) cmd_clear ;;
info) cmd_info ;;
help|-h|--help) usage ;;
*) echo "unknown subcommand: $cmd" >&2; usage; exit 2 ;;
esac