Commit graph

8 commits

Author SHA1 Message Date
cef7188a77 refactor(convert): wrapper-in-image owns the sandbox; Go just exec's binaries
The bwrap engine + OCI engine that lived in internal/convert/runner.go
both leak isolation policy into Go code. Replaced with a single image-
side wrapper that drop-in-shadows pandoc and chromium-browser on PATH.
zddc-server's only contract with the image is now "exec.Command(name,
args) gets you that tool's behavior" — sandboxing, resource caps, and
namespace setup live entirely in shell scripts shipped by the image.

Architecture:
- zddc/runtime/zddc-cgroup-init runs at container start. cgroup v2's
  "no internal processes" constraint forbids a cgroup from having both
  children and processes; the init script moves PID 1 into a child,
  enables +memory +pids in subtree_control, then exec's zddc-server.
  Best-effort: degrades cleanly to "no resource caps" if cgroupfs
  isn't writable.
- zddc/runtime/zddc-sandbox-exec is the per-call wrapper, symlinked
  from /usr/local/bin/{pandoc,chromium-browser}. Creates a transient
  cgroup v2 (memory.max + pids.max), then bubblewrap-sandboxes the
  real binary at /usr/bin/<name>: --unshare-all, --ro-bind /usr,
  --proc /proc, --tmpfs /tmp, --clearenv. Caller's scratch dir comes
  in via ZDDC_SCRATCH env and is bind-mounted at the SAME path so
  absolute paths round-trip unchanged.

Go simplifications (~250 lines net deletion):
- Runner interface: Run(ctx, binary, stdin, scratchDir, cmd) — no
  ToolSpec, no mount list, no engine concept. Single localRunner
  implementation; bwrapRunner + containerRunner both deleted.
- health.Probe just looks up pandoc + chromium on PATH; Capabilities
  drops engine kinds.
- Convert.go: ToHTML/ToPDF write to a per-call scratch dir under
  TMPDIR and pass absolute paths; the wrapper bind-mounts the dir.
  No more "/tpl" / "/pdf" mount-point indirection.
- Config drops --convert-pandoc-image, --convert-chromium-image,
  --convert-engine, --convert-podman-socket (OCI engine gone) and
  --convert-cpus (CPU caps don't apply in the new model — wall-clock
  + memory + pids is the cap set). Defaults raised to match the new
  caps the user authorized: mem 512→1024 MiB, pids 100→256,
  timeout 30→60 s.

Image:
- zddc/runtime.Containerfile builds the production runtime image
  (alpine + bubblewrap + pandoc + chromium + font-noto). Two
  COPY statements pull in the wrapper scripts; ln -s symlinks the
  shadow names.
- bitnest dev image mirrors this layout under /var/lib/zddc-dev-build/.

Container privilege required:
- Nested bwrap needs the outer container to permit user + mount
  namespace creation + MS_SLAVE on root. The default seccomp +
  AppArmor profiles block all of these. Quadlet adds:
    --cap-add=ALL
    --security-opt=seccomp=unconfined
    --security-opt=apparmor=unconfined
    --security-opt=unmask=ALL
  Helm chart sets the equivalent via securityContext (capabilities.
  add: SYS_ADMIN, seccompProfile.type: Unconfined, appArmorProfile.
  type: Unconfined). Trade-off documented in AGENTS.md: zddc-server
  RCE now has near-root power within the container, but the bind-
  mount layout still bounds blast radius; bwrap is the real boundary
  between zddc-server and untrusted markdown.

Tests: convert_test.go fully rewritten for the new Runner signature.
Drops TestBwrapArgs_* (functionality moved out of Go) and
TestImageTag (no more image refs). All 15 Go test packages green.

Verified live on bitnest: pandoc --version round-trip exits 0
through the wrapper; MD→DOCX produces a valid Word 2007+ file
end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:47:58 -05:00
da4754b6ef feat(convert): bwrap engine as production default
Replaces the always-spawn-an-OCI-container model with a per-call
bubblewrap sandbox. Pandoc and chromium binaries are baked into the
zddc-server runtime image; each conversion runs them under bwrap's
Linux-namespace isolation. No daemon, no socket, no privileged outer
container, no OCI image pull at conversion time.

Why: the OCI engine paid ≈ 350 MB image pulls + 400 MB persistent
storage + ~300 ms per-conversion startup, plus required either an
on-host daemon socket (zddc-RCE → host-RCE in one hop) or nested
container privileges. bwrap gets the same sandbox properties
(--unshare-all, ro-bind /usr, tmpfs /tmp, clearenv, no-network) at
~5 ms per call and zero external dependencies. This is the same
primitive Flatpak uses for every app launch — battle-tested at scale
for "untrusted-input, short-lived, isolated."

Runner abstraction:
- `Runner.Run` signature: image string → ToolSpec{Image, Binary}.
  Both fields populated by entry points; whichever engine is
  installed reads the one it needs.
- `bwrapRunner` (new): assembles bwrap argv via `buildBwrapArgs`
  helper (testable in isolation), spawns bwrap with the binary.
- `containerRunner` (renamed conceptually to "legacy fallback"):
  unchanged behavior, still reachable for hosts that prefer OCI
  containers per conversion.

Probe order in health.Probe: bwrap → podman → docker. First hit wins.
Engine kinds in Capabilities: "bwrap" | "podman" | "docker". The
no-engine error message now lists all three.

Config (cmd/zddc-server):
- new --convert-pandoc-binary  / ZDDC_CONVERT_PANDOC_BINARY  (default "pandoc")
- new --convert-chromium-binary / ZDDC_CONVERT_CHROMIUM_BINARY (default "chromium-browser")
- existing --convert-pandoc-image / --convert-chromium-image kept
  for the OCI engine, doc updated to clarify they only apply there.
- --convert-engine helptext lists bwrap first.

Images:
- New `zddc/runtime.Containerfile` — alpine + bubblewrap + pandoc-cli +
  chromium + font-noto. Documents build/publish workflow.
- helm/zddc-server-prod/values.yaml.example: runtimeImage default
  switched to a placeholder for the new bundled runtime image; bare
  alpine NO LONGER works for /.convert (clearly called out in the
  comment).
- bitnest dev: /var/lib/zddc-dev-build/Containerfile mirrors the
  production runtime image. Quadlet at /etc/containers/systemd/
  zddc.container drops the podman-socket mount (no longer needed)
  and sets ZDDC_CONVERT_ENGINE=bwrap explicitly to avoid silent
  downgrades if a stray podman ends up on PATH.

Tests:
- convert_test.go: fakeRunner / recordingRunner now record ToolSpec.
- New TestToolSpecPopulation pins that both Image and Binary are
  filled by every entry point.
- New TestBwrapArgs_SandboxFlagsPresent / MountTranslation /
  RejectsBadMountSpec lock in the bwrap argv shape — a refactor that
  drops a hardening flag or misroutes a mount fails this loud.

Docs:
- AGENTS.md § "Server-side document conversion" rewritten around
  the bwrap-first model with podman/docker as legacy fallbacks.
- ARCHITECTURE.md convert reference updated.
- internal/convert package doc reflects the two-engine probe order.

Verified end-to-end on bitnest: probe reports
  engine=bwrap pandoc_binary=pandoc chromium_binary=chromium-browser
on startup. All 15 Go test packages green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:42:28 -05:00
ac7553f940 fix(client): plug confused-deputy bind in client mode
A focused security review of phases 1-4 surfaced one MEDIUM finding
(confidence 9/10): in client mode (--upstream set) the cache layer
forwards the configured bearer to upstream on every incoming request
without authenticating the local caller, AND --addr defaulted to
:8443 (all interfaces). Together those mean a CLI user running
`zddc-server --upstream https://master --bearer-file ~/token` on a
laptop on hotel/cafe Wi-Fi exposes an open-proxy confused-deputy:
any attacker on the same L2 connects to https://<laptop-ip>:8443,
accepts the self-signed cert, issues GETs (or PUTs/DELETEs that
queue in the outbox), and the cache laundries each request through
upstream with the engineer's bearer. The full cached subtree leaks.

Two layers of defense in config.Load:

1. Loopback default in client mode. When cfg.Upstream is set and
   neither --addr nor ZDDC_ADDR was passed explicitly, --addr
   downgrades to "127.0.0.1:8443" (vs ":8443" in master mode). CLI
   users on a laptop get safe-by-default. Operators who want a
   non-loopback bind opt in explicitly.

2. Refuse non-loopback bind + bearer-file without acknowledgement.
   When cfg.Upstream is set, BearerFile is non-empty, the chosen
   addr is non-loopback, AND --insecure-direct is not set, the load
   fails with an error that names the bind, the threat (open-proxy
   confused-deputy laundering bearer credentials), and the
   acknowledgement flag. The helm zddc-server-cache/ chart already
   sets ZDDC_INSECURE_DIRECT=1 and relies on Kubernetes-namespaced
   pod networking for the gating, so the chart path is unaffected.
   The guard is bearer-file-conditional because proxy mode without a
   bearer doesn't have a credential to launder, and refusing it
   would needlessly block proxy-without-auth deployments.

Tests in internal/config/config_test.go lock down all four cases:
- --upstream with no explicit --addr → 127.0.0.1:8443
- --upstream + non-loopback --addr + --bearer-file (no IDirect) → refuse
- --upstream + non-loopback --addr + --bearer-file + --insecure-direct → ok
- --upstream + non-loopback --addr + NO bearer → ok (no credential to leak)

Doc updates: zddc/README.md client-mode "Flags" section gets a
WARNING block describing the loopback default + insecure-direct
escape hatch. AGENTS.md ZDDC_UPSTREAM row mentions the addr
downgrade. ARCHITECTURE.md gains a "Confused-deputy guard at
startup" subsection under "Master + proxy/cache/mirror" with the
two-layer defense rationale. helm/zddc-server-cache/values.yaml.example
adds an inline note next to addr: ":8080" explaining why the chart
sets ZDDC_INSECURE_DIRECT=1 and what the consequence is of removing
either side of the gating.

Master mode is unaffected — the client-mode validation block is
gated by `if cfg.Upstream != ""`. All existing tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 10:03:51 -05:00
55852a9efb helm: add zddc-server-cache example chart + ZDDC_NO_AUTH on prod/dev
New chart helm/zddc-server-cache/ deploys zddc-server in client mode
against an upstream master. Mirrors the prod chart's source-build-via-
init-container pattern but with:

- ZDDC_UPSTREAM, ZDDC_MODE, ZDDC_BEARER_FILE, ZDDC_NO_AUTH,
  ZDDC_SKIP_TLS_VERIFY, ZDDC_MIRROR_SUBTREE, ZDDC_MIRROR_MIN_INTERVAL
  wired from values.yaml. Mirror-only env vars conditionally rendered
  (only when mode=mirror) to keep the rendered manifest minimal.
- Bearer token mounted from a separately-created Kubernetes Secret
  (defaultMode 0400) at /etc/zddc/bearer/token. values.yaml.example
  documents the secret-creation flow but contains no token. Secret
  reference can be set to "" to disable bearer auth (only valid for
  upstreams running --no-auth).
- Recreate strategy + replicaCount: 1 (multiple replicas would race
  the cache directory and double the upstream walker traffic).
- TCP-socket probes instead of HTTP — HTTP probes against / would
  fail when both upstream is unreachable AND the cache is empty
  (the cache layer returns 503 + offline header in that state),
  causing crashloops. TCP verifies process liveness without depending
  on upstream reachability or cache contents.
- Mounts a separate cache PVC (operator-provided, like the master's
  data PVC). Sized to the working set you expect to mirror; can be
  much smaller than the master's data volume.

Existing prod and dev charts gain optional ZDDC_NO_AUTH wired from
zddc.env.noAuth (default false → no change to existing rendered
manifests). Useful for trusted-LAN or genuinely-public master
deployments.

Updated docs: helm/README.md gains the cache row in the chart table,
the cache-install quickstart with the secret-creation flow, and the
cache-specific structural notes (Recreate / TCP probes / single-
instance). CLAUDE.md and ARCHITECTURE.md updated to reflect three
charts instead of two.

Verified with helm template rendering: ZDDC_NO_AUTH only renders
when noAuth: true; ZDDC_MIRROR_SUBTREE / ZDDC_MIRROR_MIN_INTERVAL
only render when mode: mirror; bearer volume + ZDDC_BEARER_FILE
only render when bearer.secretName is non-empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 08:33:01 -05:00
13ae1498e4 docs(helm): describe dev chart's OverlayFS isolation in README + Chart.yaml
The dev chart's overlay-isolation layer (added in 9765fa2) was not
called out in helm/README.md or zddc-server-dev/Chart.yaml. Readers
comparing the two charts saw "same shape but tracks main" without
learning that the dev chart wraps the data PVC in OverlayFS so its
writes never mutate the underlying store.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 08:33:04 -05:00
9765fa2f5e feat(apps): code-signed URL fetches; dev chart overlays prod data RO
Two interlocking pieces shipped together:

1. Strict Ed25519 signature verification on URL-fetched apps artifacts.
   Every URL the apps cascade resolves must publish a corresponding
   <url>.sig (raw 64-byte Ed25519 signature). The fetcher rejects on
   any failure (sig 404, transport error, wrong key, tampered body)
   and the resolver falls back to the embedded copy.

   The trusted public key is OPERATOR-CONFIGURED via --apps-pubkey /
   ZDDC_APPS_PUBKEY (PEM file path). No baked-in default — same posture
   as TLS certificates. Operators using zddc.varasys.io's canonical
   channels download pubkey.pem from there and configure the local
   path. Operators with their own signing infrastructure pass their
   own public key.

   Build pipeline (./build) gains sign_release_artifacts: walks
   dist/release-output/ after promote and produces an Ed25519 .sig
   alongside every real file. ZDDC_SIGNING_KEY=~/.config/zddc-signing/
   key.pem (mode 0600). Symlinks skip — the .sig at the symlink
   target is what counts.

   Test coverage: parse-PEM round-trip, malformed/wrong-type PEM
   rejection, valid-signature accept, tampered-body reject, wrong-key
   reject, malformed-signature reject, end-to-end fetch+sign+verify,
   fetch-rejects-tampered, fetch-rejects-missing-sig, fetch-rejects-
   wrong-key. Existing fetch tests updated to use signed-fixture
   helpers.

2. Dev Helm chart mounts production data READ-ONLY and layers an
   OverlayFS writable scratch on top. Prod data is the lowerdir;
   dev's writes (form submissions, archive index state, .zddc edits)
   land in upperdir; main container sees the merged read-write view
   at $ZDDC_ROOT. Setup runs in a privileged init container; main
   container runs unprivileged. Solves the dev-replica-on-shared-
   dataset problem at the filesystem layer with no zddc-server code
   change.

Docs: env-var tables in zddc/README.md and AGENTS.md gain a
ZDDC_APPS_PUBKEY row. The Federal-readiness gap analysis "Code-signed
apps: URL fetches" subsection is rewritten as "what's currently in
place" instead of "what would need to be added," with a forward
pointer to per-entry signed_by: (multi-key) and Sigstore as the
federally-acceptable evolution.

The website "Verify your downloads" section + the embedded pubkey
gone — but the website needs separate updates landing in zddc-website
to publish pubkey.pem and add the verify section. Pending in that
repo's commit.

Production binary unchanged at 13.1 MB. All 11 Go test packages green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:59:07 -05:00
6b973906c3 feat(server): refuse to start without root .zddc; default CORS to empty
Two safe-by-default flips, both opt-out via explicit acknowledgement.

1. --insecure / ZDDC_INSECURE=1: zddc-server now refuses to start when
   no <ZDDC_ROOT>/.zddc exists. With no .zddc anywhere in the chain,
   AllowedWithChain falls through to "HasAnyFile=false → allow" and
   the tree is publicly accessible to anonymous callers — almost never
   what an operator wants on a fresh deployment, and previously a
   silent footgun. The flag is the escape hatch for deliberately-
   public archives (no .zddc anywhere by design).

2. ZDDC_CORS_ORIGIN now defaults to empty (CORS disabled) instead of
   the canonical "https://zddc.varasys.io". The embedded-tools install
   path serves tools and data same-origin, so the default never needed
   to permit cross-origin XHRs from a third-party host. Every deployment
   was implicitly trusting zddc.varasys.io to make authenticated XHRs
   on behalf of every logged-in user; if that origin were ever
   compromised, the blast radius extended to every customer server.
   Operators who deliberately use the CDN-bootstrap pattern or self-
   hosted tools at a different host now set the value explicitly.

Helm chart values updated accordingly: prod default is empty; dev
keeps localhost:8000 for tool-iteration workflows. Existing deployments
that depended on the old defaults will need to either set the value
explicitly or pass --insecure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:40:34 -05:00
607121a9ea feat: example helm charts for zddc-server (production + dev)
Two charts under helm/, both compile zddc-server from source via an
init container — no container image registry, no pre-built binary.
The init container clones the repo at a configured git ref, runs
`go build`, and writes the binary into a shared emptyDir; the main
container is alpine + the freshly built static binary.

helm/zddc-server-prod/  Production-shaped:
                        - gitRef pinned to a stable tag in
                          values.yaml.example (zddc-server-v0.0.7).
                        - imagePullPolicy IfNotPresent.
                        - Slower probe cadence (30s liveness, 10s
                          readiness).
                        - ZDDC_LOG_LEVEL=info.
                        - replicaCount: 1 (operators raise as needed
                          when backed by a shared filesystem).

helm/zddc-server-dev/   Dev/soak-shaped:
                        - gitRef defaults to "main" (rebuilt every pod
                          restart). build-time annotation forces
                          recreate on every helm upgrade.
                        - imagePullPolicy Always on the build image
                          so the latest golang:1.24-alpine is pulled.
                        - Faster probe cadence (10s liveness, 5s
                          readiness) — fail-fast in dev.
                        - ZDDC_LOG_LEVEL=debug. NOTE: debug logs every
                          request's full header map (includes auth
                          tokens / cookies) — this chart is for
                          private dev namespaces only.
                        - Strategy: Recreate (single replica racing
                          on different SHAs would be a mess).

Both charts:

- Wire the ZDDC_* env-var contract (ZDDC_ROOT, ZDDC_ADDR,
  ZDDC_TLS_CERT=none, ZDDC_INSECURE_DIRECT=1, ZDDC_EMAIL_HEADER,
  ZDDC_CORS_ORIGIN, ZDDC_LOG_LEVEL, ZDDC_INDEX_PATH).
- Mount a caller-supplied PVC at ZDDC_ROOT (chart does not create the
  PVC; operators provision storage themselves).
- Optional Ingress (ingress.enabled: true). TLS is expected to be
  terminated upstream of the pod; the pod listens on plain HTTP.
- No secrets in values.yaml.example. ACL email lists go in .zddc files
  inside the data volume; image-pull and TLS secrets are referenced by
  name only.

helm/README.md documents the design rationale (why build from source
instead of using a registry image), a quick-start example, and the
explicit list of what the charts do and don't do.

Note: `helm lint` cannot be run in this dev environment (helm isn't
installed). YAML syntax of Chart.yaml and values.yaml.example
verified via `python3 -c "yaml.safe_load(...)"`. Operators should
run `helm lint` and `helm template` before installing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 09:48:02 -05:00