fix(pandoc): correctness, robustness & doc cleanup of convert tools

Audit-driven cleanup of the standalone pandoc/ CLI tools (no changes to the server's own zddc/internal/convert engine). convert: - DOCX→MD now reads lowercase client/project from zddc.conf (was $CLIENT/ $PROJECT, always empty) - ZDDC filename parsing via a shared parse_zddc_filename helper that extracts each field with its own backref, so a '|' in the title no longer truncates it (was cut -d'|') - drop duplicate --section-divs and no-op --id-prefix= convert-diff: - replace hardcoded "(AR 28088)" in the diff header with the configured $project_number (omitted when unset) - only pass --template when one was found (empty --template= errors out) - drop the false "Loading ZDDC configuration" log and the sed quote-escape that leaked backslashes into custom_header - remove dead REV_A/REV_B and rev*_date extraction; fix usage typo; pin LC_TIME=C on date calls index.sh: - relative_path passes paths to python via argv (no -c interpolation) and uses realpath --relative-to as the fallback instead of an absolute path - escape '|' in title/status before emitting the markdown table row README: - rewrite the stale server-side section to match the real binary+bubblewrap design and flags/defaults (was a non-existent podman/docker/image design) - fix the invalid zddc.conf example (sourced shell, four real vars) and the understated input-format list Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 10:53:26 -05:00 · 2026-06-04 10:53:26 -05:00 · d10cd23076
commit d10cd23076
parent 613092b30e
4 changed files with 132 additions and 98 deletions
--- a/pandoc/README.md
+++ b/pandoc/README.md
@ -4,41 +4,52 @@ A collection of tools for converting Markdown documents to HTML with a professio

 ## Server-side conversion (`zddc-server`)

-zddc-server can offer the same conversions on demand: a `.md` file in any
-served directory becomes downloadable as `.docx`, `.html`, and `.pdf` via the
-`?convert=` query parameter, surfaced as Download buttons in the browse app's
-markdown editor.
+> The shell scripts in this folder are standalone CLI/batch tools. `zddc-server`
+> implements its **own** on-demand conversion (Go package `zddc/internal/convert`)
+> and does **not** call these scripts. It does, however, reuse the same
+> `viewer-template.html` and `custom.css` (embedded at build time). See
+> AGENTS.md → "Server-side document conversion" for the authoritative reference.

-The server shells out to two upstream container images, pulling each on
-first use via `--pull=missing`. No custom image build is required —
-operators just install `podman` (preferred) or `docker`, and the first
-conversion request pulls the image:
+zddc-server can render any served `.md` on demand: requesting the sibling URL
+`<path>/foo.docx` (or `.html` / `.pdf`) returns the converted bytes — no query
+string. A real on-disk file of that name always wins; the virtual conversion
+only fires when the requested file doesn't exist but `foo.md` does. The browse
+app's markdown editor surfaces these as DOCX/HTML/PDF download links (auto-saving
+a dirty buffer first so the output matches what's on screen).

- `docker.io/pandoc/latex:latest` — MD → DOCX and MD → HTML
-  (override: `--convert-pandoc-image=` or `ZDDC_CONVERT_PANDOC_IMAGE`;
-  switch to `docker.io/pandoc/core:latest` for a ~90% size reduction
-  if you don't need pandoc's native LaTeX-PDF path)
- `docker.io/zenika/alpine-chrome:latest` — HTML → PDF
-  (override: `--convert-chromium-image=` or `ZDDC_CONVERT_CHROMIUM_IMAGE`)
+**Architecture.** The Go code does the minimum — it `exec`s `pandoc` and
+`chromium-browser` directly. The sandbox and resource caps live in the runtime
+**image**, where `/usr/local/bin/{pandoc,chromium-browser}` are wrapper scripts
+that run the real binary inside a per-conversion bubblewrap sandbox
+(`--unshare-all`, read-only binds, `--tmpfs /tmp`, `--clearenv`) under cgroup v2
+memory/PID caps. I/O is via stdin/stdout plus a per-call scratch dir. There is no
+container runtime and no image pulling at request time.

 The PDF flow is two-stage: pandoc renders the markdown through
-`viewer-template.html` to standalone HTML, then headless Chromium
-prints that HTML to PDF. This preserves the existing print-media CSS
-authored for the viewer template rather than going through pandoc's
-LaTeX template.
+`viewer-template.html` to standalone HTML, then headless Chromium prints that HTML
+to PDF — preserving the viewer template's print-media CSS rather than going
+through pandoc's LaTeX template.

-If neither podman nor docker is on PATH the endpoint serves 503 with
-a clear "no container runtime" message. Engine choice is overridable
-via `--convert-engine=` or `ZDDC_CONVERT_ENGINE`.
+Converted bytes are cached at `<dir>/.zddc.d/converted/<base>.<ext>` with mtime
+synced to the source, so a fresh cache hit is a stat-and-serve with no `exec`.
+A PUT/DELETE/MOVE on the source `.md` purges the sidecars. Per-project header
+metadata (client/project/contractor/project_number) comes from the `.zddc`
+`convert:` cascade; title/tracking_number/revision/status are derived from the
+filename via `zddc.ParseFilename`.

-Resource limits are per-container and configurable: `--convert-mem-mib`
-(default 512), `--convert-cpus` (default "2"), `--convert-pids`
-(default 100), `--convert-timeout` (default 30s).
+Relevant flags (defaults in parens):

-Each conversion runs in a throw-away container with
-`--rm --network=none --read-only --tmpfs=/tmp --cap-drop=ALL
--security-opt=no-new-privileges` plus a bind-mounted scratch dir
-for I/O (read-only for the template; read-write for the PDF output).
+- `--convert-pandoc-binary` (`pandoc`) / `--convert-chromium-binary`
+  (`chromium-browser`; `chromium` on Debian) — PATH-resolved name or absolute path
+- `--convert-scratch-dir` (`$TMPDIR`) — host scratch root for template + intermediates
+- `--convert-mem-mib` (`1024`) — per-conversion memory cap (cgroup `memory.max`)
+- `--convert-pids` (`256`) — per-conversion PID cap (cgroup `pids.max`)
+- `--convert-timeout` (`60s`) — per-conversion wall clock (Go `context.WithTimeout`)
+
+If `pandoc`/`chromium` aren't on PATH (e.g. running zddc-server outside the runtime
+image) the endpoint serves 503 with a `Retry-After`; the rest of the server keeps
+working. Running against raw pandoc/chromium with no wrapper gives a working but
+**unsandboxed** endpoint — fine for dev iteration.

 ## Features

@ -80,20 +91,18 @@ for I/O (read-only for the template; read-write for the PDF output).
 ```

 ### Configuration (`zddc.conf`)
-Create a `zddc.conf` file in your project directory:
-```ini
-# Project metadata
-title = "Project Documentation"
-author = "Your Organization"
-date = "2024"
-
-# Template settings
-template = "/path/to/viewer-template.html"
-css = "custom-styles.css"
-
-# Output settings
-output_dir = "rendered"
+Create a `zddc.conf` file in your project directory. It is **sourced as shell**,
+so use `var="value"` syntax (no spaces around `=`). Only these four variables are
+read; all are optional and feed the document header via pandoc `--variable`:
+```sh
+contractor="Contractor Name"   # contracting organization (header)
+client="Client Name"           # client org (header, paired with project)
+project="Project Name"         # full project name
+project_number="AR 28088"      # shown in parentheses after the project name
 ```
+The template path is discovered automatically (input dir → script dir →
+symlink target) or set per-run with `-T`; the output directory is set with `-o`.
+They are **not** `zddc.conf` keys.

 ### Directory Structure
 ```
@ -157,8 +166,10 @@ fi

 ## File Types Supported

- **Input**: Markdown (`.md`) files with pandoc extensions
- **Output**: HTML files with embedded CSS and JavaScript
+- **Input**: Markdown (`.md`), DOCX (`.docx`), and HTML (`.html`/`.htm`) files
+  (auto-detected: DOCX→MD, MD→HTML, HTML→MD; override with `-t md|html|docx`).
+  Direct DOCX→HTML is not supported — convert to MD first.
+- **Output**: HTML files with embedded CSS and JavaScript (plus MD and DOCX targets)
 - **Images**: Supports embedded images and diagrams
 - **Tables**: Full table support with print optimization
 - **Code**: Syntax highlighting for code blocks
--- a/pandoc/convert
+++ b/pandoc/convert
@ -124,6 +124,23 @@ SUCCESSFUL=0
 FAILED=0
 SKIPPED=0

+# Parse a ZDDC filename stem (no extension) into ZDDC_TRACKING / ZDDC_REVISION /
+# ZDDC_STATUS / ZDDC_TITLE. Returns 0 on a full match, 1 otherwise.
+# Each field is extracted with its own sed backref rather than a delimiter-joined
+# string + cut, so a title containing the join character (e.g. '|') can't corrupt
+# the split.
+parse_zddc_filename() {
+    local stem="$1"
+    local sub='s/^\([^_]*\)_\([^ ]*\) *(\([^)]*\)) *- *\(.*\)$'
+    # Gate on a full match before extracting (empty fields are otherwise ambiguous).
+    printf '%s\n' "$stem" | grep -Eq '^[^_]+_[^ ]+ *\([^)]*\) *- *.+$' || return 1
+    ZDDC_TRACKING=$(printf '%s\n' "$stem" | sed -n "${sub}/\\1/p")
+    ZDDC_REVISION=$(printf '%s\n' "$stem" | sed -n "${sub}/\\2/p")
+    ZDDC_STATUS=$(printf '%s\n'  "$stem" | sed -n "${sub}/\\3/p")
+    ZDDC_TITLE=$(printf '%s\n'   "$stem" | sed -n "${sub}/\\4/p")
+    return 0
+}
+
 # Function to convert DOCX to Markdown
 convert_docx_to_md() {
    local INPUT="$1"
@ -137,14 +154,12 @@ convert_docx_to_md() {
    if pandoc -f docx -t gfm --markdown-headings=atx --extract-media="$MEDIA_DIR" --wrap=none --standalone "$INPUT" -o "$TEMP_FILE"; then
        
        # Parse ZDDC filename pattern: trackingNumber_revision (status) - title.extension
-        # Use sed to extract ZDDC components
-        ZDDC_MATCH=$(echo "$FILENAME_NO_EXT" | sed -n 's/^\([^_]*\)_\([^ ]*\) *(\([^)]*\)) *- *\(.*\)$/\1|\2|\3|\4/p')
-        if [ -n "$ZDDC_MATCH" ]; then
-            TRACKING_NUMBER=$(echo "$ZDDC_MATCH" | cut -d'|' -f1)
-            REVISION=$(echo "$ZDDC_MATCH" | cut -d'|' -f2)
-            STATUS=$(echo "$ZDDC_MATCH" | cut -d'|' -f3)
-            TITLE=$(echo "$ZDDC_MATCH" | cut -d'|' -f4)
-            
+        if parse_zddc_filename "$FILENAME_NO_EXT"; then
+            TRACKING_NUMBER="$ZDDC_TRACKING"
+            REVISION="$ZDDC_REVISION"
+            STATUS="$ZDDC_STATUS"
+            TITLE="$ZDDC_TITLE"
+
            echo "  → ZDDC metadata detected:"
            echo "    • Tracking: $TRACKING_NUMBER"
            echo "    • Revision: $REVISION"
@ -154,8 +169,8 @@ convert_docx_to_md() {
            # Create YAML front matter and combine with content
            {
                echo "---"
-                echo "client: \"${CLIENT:-}\""
-                echo "project: \"${PROJECT:-}\""
+                echo "client: \"${client:-}\""
+                echo "project: \"${project:-}\""
                echo "tracking_number: \"$TRACKING_NUMBER\""
                echo "revision: \"$REVISION\""
                echo "status: \"$STATUS\""
@ -293,8 +308,8 @@ convert_md_to_html() {
    ORIGINAL_DIR=$(pwd)
    cd "$INPUT_DIR"
    
-    # Build pandoc command using positional arguments (安全方式，无 eval)
-    # 以空格分隔的参数数组，避免 shell 注入
+    # Build pandoc command as an argument array (safe form, no eval — each value
+    # is a separate array element so it can't be re-split or injected by the shell).
    PANDOC_ARGS=()
    PANDOC_ARGS+=("--from" "markdown+yaml_metadata_block")
    PANDOC_ARGS+=("--standalone")
@ -315,13 +330,12 @@ convert_md_to_html() {
    
    # Extract ZDDC metadata from filename for template variables
    FILENAME_NO_EXT=$(basename "$INPUT" .md)
-    ZDDC_MATCH=$(echo "$FILENAME_NO_EXT" | sed -n 's/^\([^_]*\)_\([^ ]*\) *(\([^)]*\)) *- *\(.*\)$/\1|\2|\3|\4/p')
-    if [ -n "$ZDDC_MATCH" ]; then
-        TRACKING_NUMBER=$(echo "$ZDDC_MATCH" | cut -d'|' -f1)
-        REVISION=$(echo "$ZDDC_MATCH" | cut -d'|' -f2)
-        STATUS=$(echo "$ZDDC_MATCH" | cut -d'|' -f3)
-        TITLE=$(echo "$ZDDC_MATCH" | cut -d'|' -f4)
-        
+    if parse_zddc_filename "$FILENAME_NO_EXT"; then
+        TRACKING_NUMBER="$ZDDC_TRACKING"
+        REVISION="$ZDDC_REVISION"
+        STATUS="$ZDDC_STATUS"
+        TITLE="$ZDDC_TITLE"
+
        # Pass ZDDC variables to template (each as separate args to avoid injection)
        PANDOC_ARGS+=("--variable" "tracking_number=$TRACKING_NUMBER")
        PANDOC_ARGS+=("--variable" "revision=$REVISION")
@ -357,11 +371,10 @@ convert_md_to_html() {
        PANDOC_ARGS+=("--variable" "no-toc=true")
    fi
    
-    PANDOC_ARGS+=("--section-divs")
-    PANDOC_ARGS+=("--id-prefix=")
+    # (--section-divs already added above)
    PANDOC_ARGS+=("--html-q-tags")
-    
-    # Run pandoc with positional arguments (安全方式)
+
+    # Run pandoc with positional arguments (safe form, no eval)
    # All variables passed as separate arguments to avoid shell injection
    if pandoc "$(basename "$INPUT_ABS")" -o "$OUTPUT_ABS" "${PANDOC_ARGS[@]}"; then
        
--- a/pandoc/convert-diff
+++ b/pandoc/convert-diff
@ -11,7 +11,7 @@ NO_TOC=false
 show_help() {
    echo "Batch Markdown Diff Converter"
    echo "Compares pairs of markdown files and outputs HTML diffs using the same template as convert script"
-    echo "Usage: $0 [-f] [-o outputdir] [-T template] [--no-toc] file1_rev_a.md file1_rev_b.md [file2_rev_a.md file1_rev_b.md ...]"
+    echo "Usage: $0 [-f] [-o outputdir] [-T template] [--no-toc] file1_rev_a.md file1_rev_b.md [file2_rev_a.md file2_rev_b.md ...]"
    echo "  -f: Force overwrite existing output files"
    echo "  -o: Output directory (default: same as first input file)"
    echo "  -T: Template file path (default: viewer-template.html)"
@ -350,11 +350,10 @@ while [ $# -gt 0 ]; do
    fi
    
    # Load ZDDC configuration from first file's directory
+    # (load_zddc_config logs the path itself, but only when a config is found)
    FILE1_DIR=$(dirname "$FILE1")
    load_zddc_config "$FILE1_DIR"
-    
-    echo "  → Loading ZDDC configuration from: $FILE1_DIR/zddc.conf"
-    
+
    # Determine template to use
    TEMPLATE_ABS=""
    if [ -n "$CUSTOM_TEMPLATE" ]; then
@ -423,11 +422,7 @@ while [ $# -gt 0 ]; do
    
    echo "  ✓ Diff generated successfully"
    echo "Stage 2: Adding TOC and styling with pandoc..."
-    
-    # Extract revision info from filenames for metadata
-    REV_A=$(basename "$FILE1" .md | sed 's/.*_\([^_]*\)$/\1/')
-    REV_B=$(basename "$FILE2" .md | sed 's/.*_\([^_]*\)$/\1/')
-    
+
    # Extract metadata from both files (safe - no eval, uses heredoc)
    {
        # Extract YAML frontmatter and parse fields safely
@ -437,7 +432,6 @@ while [ $# -gt 0 ]; do
        rev1_revision=$(grep '^revision:' "$TEMP_METADATA_REV1" | sed 's/^revision: *"\(.*\)"$/\1/' | head -1)
        rev1_status=$(grep '^status:' "$TEMP_METADATA_REV1" | sed 's/^status: *"\(.*\)"$/\1/' | head -1)
        rev1_project=$(grep '^project:' "$TEMP_METADATA_REV1" | sed 's/^project: *"\(.*\)"$/\1/' | head -1)
-        rev1_date=$(grep '^date:' "$TEMP_METADATA_REV1" | sed 's/^date: *"\(.*\)"$/\1/' | head -1)
    }
    {
        awk '/^---$/{if(NR==1){p=1}else{p=0}} p && !/^---$/{print}' "$FILE2" > "$TEMP_METADATA_REV2"
@ -446,7 +440,6 @@ while [ $# -gt 0 ]; do
        rev2_revision=$(grep '^revision:' "$TEMP_METADATA_REV2" | sed 's/^revision: *"\(.*\)"$/\1/' | head -1)
        rev2_status=$(grep '^status:' "$TEMP_METADATA_REV2" | sed 's/^status: *"\(.*\)"$/\1/' | head -1)
        rev2_project=$(grep '^project:' "$TEMP_METADATA_REV2" | sed 's/^project: *"\(.*\)"$/\1/' | head -1)
-        rev2_date=$(grep '^date:' "$TEMP_METADATA_REV2" | sed 's/^date: *"\(.*\)"$/\1/' | head -1)
    }
    
    # Clean up metadata temp files
@ -456,8 +449,9 @@ while [ $# -gt 0 ]; do
    generate_diff_header() {
        local header_html=""
        
-        # Project title (should be same for both)
-        header_html="<div class=\"header-line client-project\">$rev2_project (AR 28088)</div>"
+        # Project title (should be same for both). Append the project number from
+        # zddc.conf when set, e.g. "Project Name (AR 28088)"; omit the parens otherwise.
+        header_html="<div class=\"header-line client-project\">${rev2_project}${project_number:+ ($project_number)}</div>"
        
        # Document title with diff
        if [ "$rev1_title" != "$rev2_title" ]; then
@ -490,7 +484,7 @@ while [ $# -gt 0 ]; do
        
        # Add draft marker if revision contains ~
        if echo "$rev2_revision" | grep -q "~"; then
-            header_html="$header_html<div class=\"header-line metadata-line draft-line\"><span class=\"draft-status\">[DRAFT Generated at $(date '+%B %d, %Y at %I:%M:%S %p %Z')]</span></div>"
+            header_html="$header_html<div class=\"header-line metadata-line draft-line\"><span class=\"draft-status\">[DRAFT Generated at $(LC_TIME=C date '+%B %d, %Y at %I:%M:%S %p %Z')]</span></div>"
        fi
        
        echo "$header_html"
@ -498,23 +492,29 @@ while [ $# -gt 0 ]; do
    
    DIFF_HEADER_HTML=$(generate_diff_header)
    
-    # Generate timestamp for conversion
-    GENERATION_TIME=$(date '+%B %d, %Y at %I:%M:%S %p %Z')
-    
+    # Generate timestamp for conversion (force English locale, matching convert)
+    GENERATION_TIME=$(LC_TIME=C date '+%B %d, %Y at %I:%M:%S %p %Z')
+
    # Set resource path to second file directory for resource resolution
    FILE2_DIR=$(dirname "$FILE2")
-    
-    # Escape HTML for safe shell usage
-    ESCAPED_HEADER_HTML=$(printf '%s' "$DIFF_HEADER_HTML" | sed 's/"/\\"/g')
-    
-    # Build pandoc command as array (not string with eval)
+
+    # Build pandoc command as array (not string with eval). Header HTML is passed
+    # as a single array element below, so no shell escaping is needed — escaping the
+    # quotes here would leak backslashes into the rendered output.
    PANDOC_ARGS=(
        "pandoc" "$TEMP_DIFF" "-o" "$OUTPUT_FILE"
        "--from" "html"
        "--standalone"
-        "--template=$TEMPLATE_ABS"
    )
-    
+
+    # Only pass --template when one was actually found; pandoc errors on an empty
+    # --template= value, so fall back to its default template otherwise.
+    if [ -n "$TEMPLATE_ABS" ]; then
+        PANDOC_ARGS+=("--template=$TEMPLATE_ABS")
+    else
+        echo "  ⚠ Warning: viewer-template.html not found, using pandoc default template"
+    fi
+
    # Add TOC args if not disabled
    if [ "$NO_TOC" != "true" ]; then
        PANDOC_ARGS+=("--toc" "--toc-depth=3")
@ -526,7 +526,7 @@ while [ $# -gt 0 ]; do
        "--metadata" "title=$rev2_title"
        "--metadata" "generation_time=$GENERATION_TIME"
        "--metadata" "diff_mode=true"
-        "--metadata" "custom_header=$ESCAPED_HEADER_HTML"
+        "--metadata" "custom_header=$DIFF_HEADER_HTML"
    )
    
    # Add ZDDC configuration variables from zddc.conf (only once)
@ -548,7 +548,7 @@ while [ $# -gt 0 ]; do
        PANDOC_ARGS+=("--variable" "no-toc=true")
    fi
    
-    PANDOC_ARGS+=("--section-divs" "--id-prefix=" "--html-q-tags")
+    PANDOC_ARGS+=("--section-divs" "--html-q-tags")
    
    # Execute pandoc via array (no eval)
    if "${PANDOC_ARGS[@]}"; then
--- a/pandoc/index.sh
+++ b/pandoc/index.sh
@ -59,15 +59,21 @@ done
 mkdir -p "$OUTPUT_DIR"

 # Function to get relative path from $1 (base dir) to $2 (target path)
-# Uses Python for portability (works on both GNU and BSD systems)
+# Prefers python3 for portability (works on both GNU and BSD systems). Paths are
+# passed as argv, not interpolated into the -c source, so quotes/specials in a
+# path can't break or inject into the Python snippet.
 relative_path() {
    local base_dir="$1"
    local target_path="$2"
-    
+
    if command -v python3 >/dev/null 2>&1; then
-        python3 -c "import os; print(os.path.relpath('$target_path', '$base_dir'))"
+        python3 -c 'import os, sys; print(os.path.relpath(sys.argv[1], sys.argv[2]))' \
+            "$target_path" "$base_dir"
+    elif realpath --relative-to=/ / >/dev/null 2>&1; then
+        # GNU realpath supports --relative-to; keep symlink targets relative.
+        realpath --relative-to="$base_dir" "$target_path"
    else
-        # Fallback: use absolute paths if python3 not available
+        # Last resort: absolute path (still a valid symlink target, just not relative).
        realpath "$target_path"
    fi
 }
@ -265,9 +271,13 @@ EOF
        
        # Create truncated SHA256 for display
        sha256_short="${sha256:0:6}...${sha256: -6}"
-        
+
+        # Escape pipe chars so a title/status containing '|' can't break the table row
+        md_title=$(printf '%s' "$doc_title" | sed 's/|/\\|/g')
+        md_status=$(printf '%s' "$status" | sed 's/|/\\|/g')
+
        # Add to markdown table
-        echo "| $row_counter | $tracking_link | $doc_title | $revision_link | $status | <span class=\"sha256\" title=\"$sha256\">$sha256_short</span> |" >> "$index_md_file"
+        echo "| $row_counter | $tracking_link | $md_title | $revision_link | $md_status | <span class=\"sha256\" title=\"$sha256\">$sha256_short</span> |" >> "$index_md_file"
        
        echo "  $filename -> symlinks created"
    done < <(find "$folder" -maxdepth 1 \( -type f -o -type l \) -print0)