diff --git a/pandoc/README.md b/pandoc/README.md index d161e91..3b8eeca 100644 --- a/pandoc/README.md +++ b/pandoc/README.md @@ -4,41 +4,52 @@ A collection of tools for converting Markdown documents to HTML with a professio ## Server-side conversion (`zddc-server`) -zddc-server can offer the same conversions on demand: a `.md` file in any -served directory becomes downloadable as `.docx`, `.html`, and `.pdf` via the -`?convert=` query parameter, surfaced as Download buttons in the browse app's -markdown editor. +> The shell scripts in this folder are standalone CLI/batch tools. `zddc-server` +> implements its **own** on-demand conversion (Go package `zddc/internal/convert`) +> and does **not** call these scripts. It does, however, reuse the same +> `viewer-template.html` and `custom.css` (embedded at build time). See +> AGENTS.md → "Server-side document conversion" for the authoritative reference. -The server shells out to two upstream container images, pulling each on -first use via `--pull=missing`. No custom image build is required — -operators just install `podman` (preferred) or `docker`, and the first -conversion request pulls the image: +zddc-server can render any served `.md` on demand: requesting the sibling URL +`/foo.docx` (or `.html` / `.pdf`) returns the converted bytes — no query +string. A real on-disk file of that name always wins; the virtual conversion +only fires when the requested file doesn't exist but `foo.md` does. The browse +app's markdown editor surfaces these as DOCX/HTML/PDF download links (auto-saving +a dirty buffer first so the output matches what's on screen). -- `docker.io/pandoc/latex:latest` — MD → DOCX and MD → HTML - (override: `--convert-pandoc-image=` or `ZDDC_CONVERT_PANDOC_IMAGE`; - switch to `docker.io/pandoc/core:latest` for a ~90% size reduction - if you don't need pandoc's native LaTeX-PDF path) -- `docker.io/zenika/alpine-chrome:latest` — HTML → PDF - (override: `--convert-chromium-image=` or `ZDDC_CONVERT_CHROMIUM_IMAGE`) +**Architecture.** The Go code does the minimum — it `exec`s `pandoc` and +`chromium-browser` directly. The sandbox and resource caps live in the runtime +**image**, where `/usr/local/bin/{pandoc,chromium-browser}` are wrapper scripts +that run the real binary inside a per-conversion bubblewrap sandbox +(`--unshare-all`, read-only binds, `--tmpfs /tmp`, `--clearenv`) under cgroup v2 +memory/PID caps. I/O is via stdin/stdout plus a per-call scratch dir. There is no +container runtime and no image pulling at request time. The PDF flow is two-stage: pandoc renders the markdown through -`viewer-template.html` to standalone HTML, then headless Chromium -prints that HTML to PDF. This preserves the existing print-media CSS -authored for the viewer template rather than going through pandoc's -LaTeX template. +`viewer-template.html` to standalone HTML, then headless Chromium prints that HTML +to PDF — preserving the viewer template's print-media CSS rather than going +through pandoc's LaTeX template. -If neither podman nor docker is on PATH the endpoint serves 503 with -a clear "no container runtime" message. Engine choice is overridable -via `--convert-engine=` or `ZDDC_CONVERT_ENGINE`. +Converted bytes are cached at `/.zddc.d/converted/.` with mtime +synced to the source, so a fresh cache hit is a stat-and-serve with no `exec`. +A PUT/DELETE/MOVE on the source `.md` purges the sidecars. Per-project header +metadata (client/project/contractor/project_number) comes from the `.zddc` +`convert:` cascade; title/tracking_number/revision/status are derived from the +filename via `zddc.ParseFilename`. -Resource limits are per-container and configurable: `--convert-mem-mib` -(default 512), `--convert-cpus` (default "2"), `--convert-pids` -(default 100), `--convert-timeout` (default 30s). +Relevant flags (defaults in parens): -Each conversion runs in a throw-away container with -`--rm --network=none --read-only --tmpfs=/tmp --cap-drop=ALL ---security-opt=no-new-privileges` plus a bind-mounted scratch dir -for I/O (read-only for the template; read-write for the PDF output). +- `--convert-pandoc-binary` (`pandoc`) / `--convert-chromium-binary` + (`chromium-browser`; `chromium` on Debian) — PATH-resolved name or absolute path +- `--convert-scratch-dir` (`$TMPDIR`) — host scratch root for template + intermediates +- `--convert-mem-mib` (`1024`) — per-conversion memory cap (cgroup `memory.max`) +- `--convert-pids` (`256`) — per-conversion PID cap (cgroup `pids.max`) +- `--convert-timeout` (`60s`) — per-conversion wall clock (Go `context.WithTimeout`) + +If `pandoc`/`chromium` aren't on PATH (e.g. running zddc-server outside the runtime +image) the endpoint serves 503 with a `Retry-After`; the rest of the server keeps +working. Running against raw pandoc/chromium with no wrapper gives a working but +**unsandboxed** endpoint — fine for dev iteration. ## Features @@ -80,20 +91,18 @@ for I/O (read-only for the template; read-write for the PDF output). ``` ### Configuration (`zddc.conf`) -Create a `zddc.conf` file in your project directory: -```ini -# Project metadata -title = "Project Documentation" -author = "Your Organization" -date = "2024" - -# Template settings -template = "/path/to/viewer-template.html" -css = "custom-styles.css" - -# Output settings -output_dir = "rendered" +Create a `zddc.conf` file in your project directory. It is **sourced as shell**, +so use `var="value"` syntax (no spaces around `=`). Only these four variables are +read; all are optional and feed the document header via pandoc `--variable`: +```sh +contractor="Contractor Name" # contracting organization (header) +client="Client Name" # client org (header, paired with project) +project="Project Name" # full project name +project_number="AR 28088" # shown in parentheses after the project name ``` +The template path is discovered automatically (input dir → script dir → +symlink target) or set per-run with `-T`; the output directory is set with `-o`. +They are **not** `zddc.conf` keys. ### Directory Structure ``` @@ -157,8 +166,10 @@ fi ## File Types Supported -- **Input**: Markdown (`.md`) files with pandoc extensions -- **Output**: HTML files with embedded CSS and JavaScript +- **Input**: Markdown (`.md`), DOCX (`.docx`), and HTML (`.html`/`.htm`) files + (auto-detected: DOCX→MD, MD→HTML, HTML→MD; override with `-t md|html|docx`). + Direct DOCX→HTML is not supported — convert to MD first. +- **Output**: HTML files with embedded CSS and JavaScript (plus MD and DOCX targets) - **Images**: Supports embedded images and diagrams - **Tables**: Full table support with print optimization - **Code**: Syntax highlighting for code blocks diff --git a/pandoc/convert b/pandoc/convert index 7083394..932173d 100644 --- a/pandoc/convert +++ b/pandoc/convert @@ -124,6 +124,23 @@ SUCCESSFUL=0 FAILED=0 SKIPPED=0 +# Parse a ZDDC filename stem (no extension) into ZDDC_TRACKING / ZDDC_REVISION / +# ZDDC_STATUS / ZDDC_TITLE. Returns 0 on a full match, 1 otherwise. +# Each field is extracted with its own sed backref rather than a delimiter-joined +# string + cut, so a title containing the join character (e.g. '|') can't corrupt +# the split. +parse_zddc_filename() { + local stem="$1" + local sub='s/^\([^_]*\)_\([^ ]*\) *(\([^)]*\)) *- *\(.*\)$' + # Gate on a full match before extracting (empty fields are otherwise ambiguous). + printf '%s\n' "$stem" | grep -Eq '^[^_]+_[^ ]+ *\([^)]*\) *- *.+$' || return 1 + ZDDC_TRACKING=$(printf '%s\n' "$stem" | sed -n "${sub}/\\1/p") + ZDDC_REVISION=$(printf '%s\n' "$stem" | sed -n "${sub}/\\2/p") + ZDDC_STATUS=$(printf '%s\n' "$stem" | sed -n "${sub}/\\3/p") + ZDDC_TITLE=$(printf '%s\n' "$stem" | sed -n "${sub}/\\4/p") + return 0 +} + # Function to convert DOCX to Markdown convert_docx_to_md() { local INPUT="$1" @@ -137,14 +154,12 @@ convert_docx_to_md() { if pandoc -f docx -t gfm --markdown-headings=atx --extract-media="$MEDIA_DIR" --wrap=none --standalone "$INPUT" -o "$TEMP_FILE"; then # Parse ZDDC filename pattern: trackingNumber_revision (status) - title.extension - # Use sed to extract ZDDC components - ZDDC_MATCH=$(echo "$FILENAME_NO_EXT" | sed -n 's/^\([^_]*\)_\([^ ]*\) *(\([^)]*\)) *- *\(.*\)$/\1|\2|\3|\4/p') - if [ -n "$ZDDC_MATCH" ]; then - TRACKING_NUMBER=$(echo "$ZDDC_MATCH" | cut -d'|' -f1) - REVISION=$(echo "$ZDDC_MATCH" | cut -d'|' -f2) - STATUS=$(echo "$ZDDC_MATCH" | cut -d'|' -f3) - TITLE=$(echo "$ZDDC_MATCH" | cut -d'|' -f4) - + if parse_zddc_filename "$FILENAME_NO_EXT"; then + TRACKING_NUMBER="$ZDDC_TRACKING" + REVISION="$ZDDC_REVISION" + STATUS="$ZDDC_STATUS" + TITLE="$ZDDC_TITLE" + echo " → ZDDC metadata detected:" echo " • Tracking: $TRACKING_NUMBER" echo " • Revision: $REVISION" @@ -154,8 +169,8 @@ convert_docx_to_md() { # Create YAML front matter and combine with content { echo "---" - echo "client: \"${CLIENT:-}\"" - echo "project: \"${PROJECT:-}\"" + echo "client: \"${client:-}\"" + echo "project: \"${project:-}\"" echo "tracking_number: \"$TRACKING_NUMBER\"" echo "revision: \"$REVISION\"" echo "status: \"$STATUS\"" @@ -293,8 +308,8 @@ convert_md_to_html() { ORIGINAL_DIR=$(pwd) cd "$INPUT_DIR" - # Build pandoc command using positional arguments (安全方式,无 eval) - # 以空格分隔的参数数组,避免 shell 注入 + # Build pandoc command as an argument array (safe form, no eval — each value + # is a separate array element so it can't be re-split or injected by the shell). PANDOC_ARGS=() PANDOC_ARGS+=("--from" "markdown+yaml_metadata_block") PANDOC_ARGS+=("--standalone") @@ -315,13 +330,12 @@ convert_md_to_html() { # Extract ZDDC metadata from filename for template variables FILENAME_NO_EXT=$(basename "$INPUT" .md) - ZDDC_MATCH=$(echo "$FILENAME_NO_EXT" | sed -n 's/^\([^_]*\)_\([^ ]*\) *(\([^)]*\)) *- *\(.*\)$/\1|\2|\3|\4/p') - if [ -n "$ZDDC_MATCH" ]; then - TRACKING_NUMBER=$(echo "$ZDDC_MATCH" | cut -d'|' -f1) - REVISION=$(echo "$ZDDC_MATCH" | cut -d'|' -f2) - STATUS=$(echo "$ZDDC_MATCH" | cut -d'|' -f3) - TITLE=$(echo "$ZDDC_MATCH" | cut -d'|' -f4) - + if parse_zddc_filename "$FILENAME_NO_EXT"; then + TRACKING_NUMBER="$ZDDC_TRACKING" + REVISION="$ZDDC_REVISION" + STATUS="$ZDDC_STATUS" + TITLE="$ZDDC_TITLE" + # Pass ZDDC variables to template (each as separate args to avoid injection) PANDOC_ARGS+=("--variable" "tracking_number=$TRACKING_NUMBER") PANDOC_ARGS+=("--variable" "revision=$REVISION") @@ -357,11 +371,10 @@ convert_md_to_html() { PANDOC_ARGS+=("--variable" "no-toc=true") fi - PANDOC_ARGS+=("--section-divs") - PANDOC_ARGS+=("--id-prefix=") + # (--section-divs already added above) PANDOC_ARGS+=("--html-q-tags") - - # Run pandoc with positional arguments (安全方式) + + # Run pandoc with positional arguments (safe form, no eval) # All variables passed as separate arguments to avoid shell injection if pandoc "$(basename "$INPUT_ABS")" -o "$OUTPUT_ABS" "${PANDOC_ARGS[@]}"; then diff --git a/pandoc/convert-diff b/pandoc/convert-diff index d4e6b68..98a6aad 100644 --- a/pandoc/convert-diff +++ b/pandoc/convert-diff @@ -11,7 +11,7 @@ NO_TOC=false show_help() { echo "Batch Markdown Diff Converter" echo "Compares pairs of markdown files and outputs HTML diffs using the same template as convert script" - echo "Usage: $0 [-f] [-o outputdir] [-T template] [--no-toc] file1_rev_a.md file1_rev_b.md [file2_rev_a.md file1_rev_b.md ...]" + echo "Usage: $0 [-f] [-o outputdir] [-T template] [--no-toc] file1_rev_a.md file1_rev_b.md [file2_rev_a.md file2_rev_b.md ...]" echo " -f: Force overwrite existing output files" echo " -o: Output directory (default: same as first input file)" echo " -T: Template file path (default: viewer-template.html)" @@ -350,11 +350,10 @@ while [ $# -gt 0 ]; do fi # Load ZDDC configuration from first file's directory + # (load_zddc_config logs the path itself, but only when a config is found) FILE1_DIR=$(dirname "$FILE1") load_zddc_config "$FILE1_DIR" - - echo " → Loading ZDDC configuration from: $FILE1_DIR/zddc.conf" - + # Determine template to use TEMPLATE_ABS="" if [ -n "$CUSTOM_TEMPLATE" ]; then @@ -423,11 +422,7 @@ while [ $# -gt 0 ]; do echo " ✓ Diff generated successfully" echo "Stage 2: Adding TOC and styling with pandoc..." - - # Extract revision info from filenames for metadata - REV_A=$(basename "$FILE1" .md | sed 's/.*_\([^_]*\)$/\1/') - REV_B=$(basename "$FILE2" .md | sed 's/.*_\([^_]*\)$/\1/') - + # Extract metadata from both files (safe - no eval, uses heredoc) { # Extract YAML frontmatter and parse fields safely @@ -437,7 +432,6 @@ while [ $# -gt 0 ]; do rev1_revision=$(grep '^revision:' "$TEMP_METADATA_REV1" | sed 's/^revision: *"\(.*\)"$/\1/' | head -1) rev1_status=$(grep '^status:' "$TEMP_METADATA_REV1" | sed 's/^status: *"\(.*\)"$/\1/' | head -1) rev1_project=$(grep '^project:' "$TEMP_METADATA_REV1" | sed 's/^project: *"\(.*\)"$/\1/' | head -1) - rev1_date=$(grep '^date:' "$TEMP_METADATA_REV1" | sed 's/^date: *"\(.*\)"$/\1/' | head -1) } { awk '/^---$/{if(NR==1){p=1}else{p=0}} p && !/^---$/{print}' "$FILE2" > "$TEMP_METADATA_REV2" @@ -446,7 +440,6 @@ while [ $# -gt 0 ]; do rev2_revision=$(grep '^revision:' "$TEMP_METADATA_REV2" | sed 's/^revision: *"\(.*\)"$/\1/' | head -1) rev2_status=$(grep '^status:' "$TEMP_METADATA_REV2" | sed 's/^status: *"\(.*\)"$/\1/' | head -1) rev2_project=$(grep '^project:' "$TEMP_METADATA_REV2" | sed 's/^project: *"\(.*\)"$/\1/' | head -1) - rev2_date=$(grep '^date:' "$TEMP_METADATA_REV2" | sed 's/^date: *"\(.*\)"$/\1/' | head -1) } # Clean up metadata temp files @@ -456,8 +449,9 @@ while [ $# -gt 0 ]; do generate_diff_header() { local header_html="" - # Project title (should be same for both) - header_html="
$rev2_project (AR 28088)
" + # Project title (should be same for both). Append the project number from + # zddc.conf when set, e.g. "Project Name (AR 28088)"; omit the parens otherwise. + header_html="
${rev2_project}${project_number:+ ($project_number)}
" # Document title with diff if [ "$rev1_title" != "$rev2_title" ]; then @@ -490,7 +484,7 @@ while [ $# -gt 0 ]; do # Add draft marker if revision contains ~ if echo "$rev2_revision" | grep -q "~"; then - header_html="$header_html
[DRAFT Generated at $(date '+%B %d, %Y at %I:%M:%S %p %Z')]
" + header_html="$header_html
[DRAFT Generated at $(LC_TIME=C date '+%B %d, %Y at %I:%M:%S %p %Z')]
" fi echo "$header_html" @@ -498,23 +492,29 @@ while [ $# -gt 0 ]; do DIFF_HEADER_HTML=$(generate_diff_header) - # Generate timestamp for conversion - GENERATION_TIME=$(date '+%B %d, %Y at %I:%M:%S %p %Z') - + # Generate timestamp for conversion (force English locale, matching convert) + GENERATION_TIME=$(LC_TIME=C date '+%B %d, %Y at %I:%M:%S %p %Z') + # Set resource path to second file directory for resource resolution FILE2_DIR=$(dirname "$FILE2") - - # Escape HTML for safe shell usage - ESCAPED_HEADER_HTML=$(printf '%s' "$DIFF_HEADER_HTML" | sed 's/"/\\"/g') - - # Build pandoc command as array (not string with eval) + + # Build pandoc command as array (not string with eval). Header HTML is passed + # as a single array element below, so no shell escaping is needed — escaping the + # quotes here would leak backslashes into the rendered output. PANDOC_ARGS=( "pandoc" "$TEMP_DIFF" "-o" "$OUTPUT_FILE" "--from" "html" "--standalone" - "--template=$TEMPLATE_ABS" ) - + + # Only pass --template when one was actually found; pandoc errors on an empty + # --template= value, so fall back to its default template otherwise. + if [ -n "$TEMPLATE_ABS" ]; then + PANDOC_ARGS+=("--template=$TEMPLATE_ABS") + else + echo " ⚠ Warning: viewer-template.html not found, using pandoc default template" + fi + # Add TOC args if not disabled if [ "$NO_TOC" != "true" ]; then PANDOC_ARGS+=("--toc" "--toc-depth=3") @@ -526,7 +526,7 @@ while [ $# -gt 0 ]; do "--metadata" "title=$rev2_title" "--metadata" "generation_time=$GENERATION_TIME" "--metadata" "diff_mode=true" - "--metadata" "custom_header=$ESCAPED_HEADER_HTML" + "--metadata" "custom_header=$DIFF_HEADER_HTML" ) # Add ZDDC configuration variables from zddc.conf (only once) @@ -548,7 +548,7 @@ while [ $# -gt 0 ]; do PANDOC_ARGS+=("--variable" "no-toc=true") fi - PANDOC_ARGS+=("--section-divs" "--id-prefix=" "--html-q-tags") + PANDOC_ARGS+=("--section-divs" "--html-q-tags") # Execute pandoc via array (no eval) if "${PANDOC_ARGS[@]}"; then diff --git a/pandoc/index.sh b/pandoc/index.sh index a42e60c..4f31986 100644 --- a/pandoc/index.sh +++ b/pandoc/index.sh @@ -59,15 +59,21 @@ done mkdir -p "$OUTPUT_DIR" # Function to get relative path from $1 (base dir) to $2 (target path) -# Uses Python for portability (works on both GNU and BSD systems) +# Prefers python3 for portability (works on both GNU and BSD systems). Paths are +# passed as argv, not interpolated into the -c source, so quotes/specials in a +# path can't break or inject into the Python snippet. relative_path() { local base_dir="$1" local target_path="$2" - + if command -v python3 >/dev/null 2>&1; then - python3 -c "import os; print(os.path.relpath('$target_path', '$base_dir'))" + python3 -c 'import os, sys; print(os.path.relpath(sys.argv[1], sys.argv[2]))' \ + "$target_path" "$base_dir" + elif realpath --relative-to=/ / >/dev/null 2>&1; then + # GNU realpath supports --relative-to; keep symlink targets relative. + realpath --relative-to="$base_dir" "$target_path" else - # Fallback: use absolute paths if python3 not available + # Last resort: absolute path (still a valid symlink target, just not relative). realpath "$target_path" fi } @@ -265,9 +271,13 @@ EOF # Create truncated SHA256 for display sha256_short="${sha256:0:6}...${sha256: -6}" - + + # Escape pipe chars so a title/status containing '|' can't break the table row + md_title=$(printf '%s' "$doc_title" | sed 's/|/\\|/g') + md_status=$(printf '%s' "$status" | sed 's/|/\\|/g') + # Add to markdown table - echo "| $row_counter | $tracking_link | $doc_title | $revision_link | $status | $sha256_short |" >> "$index_md_file" + echo "| $row_counter | $tracking_link | $md_title | $revision_link | $md_status | $sha256_short |" >> "$index_md_file" echo " $filename -> symlinks created" done < <(find "$folder" -maxdepth 1 \( -type f -o -type l \) -print0)