Spectra: A Synthetic Filesystem for Deterministic Benchmarking

Design notes · June 2026

Note on authorship. The prose on this page was produced with AI assistance for raw text generation, but then heavily reviewed, edited, and guided by the author for accuracy and quality.

Abstract. Spectra is a synthetic filesystem simulator for testing traversal, migration, and mirroring pipelines without real disk I/O or cloud APIs. Trees are procedurally generated from configuration (depth, fanout, seed) and exposed through an SDK, HTTP API, and Go fs.FS interface. This document summarizes the problem it addresses, its operating modes, principal design choices, and known limitations.

Benchmarking a traversal engine on a live filesystem measures the engine, the OS cache, disk latency, and tree shape at once. Spectra isolates the engine by replacing storage with a reproducible synthetic backend.


1. Problem statement

Migration and traversal tools are difficult to evaluate fairly on real hardware. Copying a local folder reports elapsed time but confounds engine logic with I/O variance, background OS activity, and unknown directory skew. Cloud APIs add rate limits and network noise. Generating hundreds of millions of real files for a stress test is impractical on most machines.

What is needed is a filesystem-shaped interface that can produce arbitrarily large trees on demand, reproduce the same tree from a seed, and optionally inject latency or failure without touching physical storage. Spectra was built for that role within the Sylos project; it is also usable as a standalone library (MIT licensed).

2. What Spectra provides

Spectra behaves like a mock filesystem. Folders and files are generated procedurally from JSON configuration: maximum depth, fanout distribution, file size ranges, RNG seed, and optional multi-world probabilities. Nodes carry metadata (path, depth, type, size, timestamps, checksums) and are served through list/get/create/delete operations compatible with migration-style workloads.

Typical uses include validating BFS traversal logic, comparing engines under identical trees, integration testing against a stable API shape, and projecting separate "worlds" so two logical sources can be exercised through the same machinery (for example via Rclone against fs.FS backends).

3. Operating modes

3.1 Persistent mode

Default. Nodes are stored in an embedded BoltDB database with lazy generation: children materialize as they are accessed. A write-ahead buffer batches inserts and updates; an optional sliding-window node cache can reduce read traffic during BFS-style access. Implements Go's fs.FS for tool compatibility.

3.2 Ephemeral mode

No database. Children are generated on the fly from parent path and depth. Listing requires path and depth explicitly. Suited to large traversal benchmarks where persistence is unnecessary and I/O should be minimized entirely. Optional diverging-tree seeding can give each world a different shape from the same config skeleton.

Sylos engine comparisons against Rclone on Spectra ephemeral runs remove storage bottlenecks and expose scheduling, persistence, and coordination overhead directly. See the migration engine design note for how those results are interpreted.

4. Design highlights

  • Deterministic node IDs. Root is root; all other nodes use spc: plus a 32-character hex digest (FNV-128a of path and type). Same path and type yield the same ID across runs and worlds.
  • Heavy-tailed fanout. Folder and file counts are drawn from logarithmic buckets with exponential decay and per-depth scaling. Most directories stay small; occasional "monster" directories stress breadth without uniform random fanout.
  • World existence map. Each node records which logical worlds (primary, s1, s2, etc.) it belongs to, supporting overlap and copy scenarios without duplicating tree generation logic.
  • O(1) child retrieval. Parents store a child_ids array updated in batched flushes, avoiding prefix scans through index buckets during traversal.
  • Write-ahead buffer. Separate insert and update queues flush in order (inserts before updates) with batched parent ChildIDs updates to keep bulk generation from devolving into O(n²) parent rewrites.
  • Optional 3-generation cache. When enabled, a sliding window over grandparent, parent, and child depths can cut BoltDB reads during BFS by roughly two thirds. Disabled by default to avoid memory cost on small or write-heavy runs.
  • fs.FS projection. Each world can be exposed as a standard Go filesystem, allowing external tools to treat synthetic worlds as independent remotes.

5. Limitations and considerations

  • Not a fidelity model of real storage. Latency distributions, ACLs, rename semantics, and backend-specific failure modes are only approximated. Results isolate engine behavior; they do not replace tests on real devices or APIs.
  • BoltDB write serialization. Persistent mode has a single writer. Batching amortizes this, but write-heavy workloads still contend on the database lock.
  • Memory vs cache. The optional node cache bounds depth to three generations but memory still scales with fanout within that window. Aggressive fanout configs can produce very wide frontiers.
  • JSON serialization cost. A meaningful fraction of CPU time in bulk operations goes to marshaling node records. Acceptable for a test harness; not tuned as a production datastore.
  • Schema evolution. Major version changes (for example deterministic ID format and child_ids) require regenerating databases rather than migrating in place.
  • Synthetic file content. File payloads are generated deterministically from seeds; throughput tests do not exercise real byte copying unless an integrator adds that layer.

6. Further reading

Source, API reference, configuration fields, and integration notes live in the Spectra repository on Codeberg. Package READMEs and INTEGRATION_GUIDE.md are the authoritative reference for API signatures and migration between major versions.