A Durable Local Task Engine for Dynamically Expanding Workloads

Design notes from the Sylos migration engine · June 2026

Note on authorship. The prose on this page was produced with AI assistance for raw text generation, then heavily reviewed, edited, and guided by the author for accuracy and quality.

Sylos is a breadth-first filesystem migration tool. This note is about the task engine behind it. The same pattern applies anywhere running a task can create more tasks: you need a durable local queue that can grow without blowing up memory or treating the database as the live scheduler on every pull.

Workers do not recurse in-process. Each task handles one item, persists any children, and stops. That child work is picked up in a later round (in Sylos, the next BFS depth level). Scheduling, leasing, and retries live in an in-memory queue; DuckDB is a durable log you refill from and append to, not the frontier itself.

Below: the pull-lease-process-append loop, what went wrong with mutable row queues and read-after-write assumptions, and the three DuckDB patterns that made this practical at scale (Appender bulk writes, append-only status events, and keyset batch pulls).

1. Problem

A task runs, discovers child tasks, and schedules them. At scale this is not a fixed queue size. Generation can outpace consumption for long stretches, especially when one listing expands into hundreds of pending items. If pending work lives only in process memory, the backlog grows until the process fails.

The engine must therefore write results and read the next batch of pending work from the same store, survive restarts, and avoid duplicate traversal. Pure in-memory scheduling is fast but not resumable. Per-node synchronous database round-trips do not reach millions of items. The design target is a local, durable task queue with dynamically expanding work, not a distributed stream processor.

2. Engine loop

Each worker handles one node per task: list immediate children, apply filters, persist results, stop. Children become new tasks for a later round (the next BFS depth level), not on the worker's call stack. Depth is carried in the queue, not in recursion. That matters for skewed trees (one branch with millions of descendants, a sibling with none): a worker must not recurse in-process and pin work that other workers could take.

The runtime loop:

Pull a batch of pending rows from DuckDB (WHERE id >= last_id LIMIT n).
Lease them into an in-memory buffer for workers.
Process each item (list, filter, discover children).
Append new node rows, status events, and child work through a write buffer into the Appender; flush to disk.
Advance last_id and repeat. On crash, restore round cursors and pull from the last known ID.

                 +----------------+
                 |   DuckDB Log   |
                 | Nodes + Events |
                 +--------+-------+
                          ^
                          |
                   Flush / Appender
                          ^
                          |
                 +--------+-------+
                 |   DB Buffer    |
                 +--------+-------+
                          ^
                          |
                 +--------+-------+
                 |    Workers     |
                 +--------+-------+
                          ^
                          |
                 +--------+-------+
                 | In-Memory Lease|
                 |     Queue      |
                 +--------+-------+
                          ^
                          |
                 +--------+-------+
                 | Keyset Pulls   |
                 | (id >= last)  |
                 +----------------+

2.1 DuckDB Appender

Hot-path writes go through DuckDB's Appender API, not row-by-row INSERT or UPDATE statements. Workers push node metadata and status events into an in-memory buffer; the buffer bulk-loads into the Appender and is flushed. Columnar append matches the write pattern: many new rows, few random updates.

The Appender maintains its own internal buffering. An explicit flush after draining the engine buffer defines when data is on disk for resume. That redundancy is intentional: persistence boundaries are predictable during development and after interrupts.

2.2 Append-only status events

Node metadata (src_nodes, dst_nodes) is written once. State changes are new rows in src_status_events / dst_status_events: pending on discovery, traversed or failed on completion. No UPDATE or DELETE on the hot path.

-- Avoid on hot path
UPDATE nodes SET status = 'complete' WHERE id = 123

-- Write pattern
INSERT INTO status_events (node_id, status, ...) VALUES (...)

Current status for review is resolved later by taking the latest event per node. That analytical read is paid at diff/review time, not during traversal.

2.3 Keyset batch pulls

Pending work is fetched in batches. LIMIT/OFFSET is unsuitable at depth: offset 50,000,000 implies scanning past 50M rows on every page, so each pull is O(n) in table size. Keyset pagination on a monotonic ID turns each pull into an O(1) seek to last_id plus a bounded read of batch_size rows. As the database grows, pull cost stays flat instead of scaling with total rows already written.

The engine tracks the starting ID of the next batch (last batched ID + 1):

WHERE id >= :last_id
LIMIT :batch_size

Level advancement uses batch boundaries: if a pull requests n rows but returns fewer than n, or no row exists at last_id + n, the current BFS level is likely exhausted. After the in-memory queue drains, the round advances. The ID set at a given depth is stable for the duration of that round (§3.1), which is what keeps keyset cursors from drifting as status events grow.

2.4 Task retries and leasing

What about task retries? Retries are handled entirely inside the lease queue. The database is not re-queried to discover that a task needs another attempt. A row is pulled from DuckDB into the queue once; from there the queue owns its lifecycle until the task succeeds or is exhausted.

Workers do not take tasks directly from the database. They ask the queue for work. The queue mutex-locks a task for one attempt at a time. On each attempt, whether it succeeds or fails, the worker pushes results into the DB buffer so progress is persisted and resumable if the process crashes mid-retry. If the attempt succeeds, or fails after the configured maximum (for example three attempts), the queue records the final outcome and removes the task. If the attempt fails but retries remain, the queue records the attempt, unlocks the task, and another worker can pick it up.

The queue is the authority on leasing. Workers report attempt completion; they do not decide unilaterally when a task leaves the active set. That keeps retry state out of the hot pull path and consistent with the memory-first scheduling model in §3.1.

3. Why row updates failed

An early model treated the database as a mutable task table: insert, select pending, update status. B-tree stores handle that poorly at volume. Prototyping hit throughput walls around hundreds of thousands of rows on Bolt (single bucket growth), SQLite, and DuckDB when updates dominated.

3.1 Read-after-write lag and sealed rounds

DuckDB with conventional updates also exhibited read-after-write inconsistency on the same connection: a write could be followed within milliseconds by a read that did not see it. That produced silent scheduling bugs when the engine assumed the database was the live source of truth on every pull.

The current design still batches writes through the Appender, so completed work can look pending in DuckDB until the next flush. That lag is acknowledged. It does not corrupt pagination because rounds are sealed at a fixed depth: work at round N does not insert more nodes at depth N, only at depth N+1 when parents finish. During round N, the src_nodes rows at depth N are stable. What grows is src_status_events as nodes move from pending to traversed.

Pulls are anchored on node id at the current round's depth, with id > cursor and a pending status filter applied at read time. The cursor advances to the last enqueued id, not the last row scanned. Once an id is leased into memory, it is not pulled again from the database. If a later flush marks that node successful before the cursor reaches higher ids, it simply drops out of pending results; it is not skipped, because it was already scheduled.

Round completion is driven by in-memory state, not live DB pending counts: the lease buffer is empty, nothing is in progress, and the last pull was partial (keyspace at this depth exhausted). The database is a refill source and resume log during the round, not a real-time scheduler. Before advancing to round N+1, the engine forces an Appender flush so children written at depth N+1 during round N are visible. There is no staging table merged at seal; one flush appends nodes and events together.

Consistency model.

Layer	Role during round N
Lease buffer + in-progress set	Truth for what is live this round
DuckDB	Truth for what exists; frontier refill and resume
Appender flush	Async write path; forced before round advance

Deliberate tradeoff: the DB can show more pending rows than memory during a round. Cursor discipline and in-memory completion prevent double work. The fragility of the earlier in-memory node-cache design was largely cross-round state held only in RAM; this model pushes durability to DuckDB while keeping scheduling authority in memory for the active round.

Append-only events and Appender bulk loads removed UPDATE from the traversal path. Analytical reconstruction of latest status is deferred to review time; scheduling relies on sealed depth, monotonic ids, and flush boundaries instead of immediate read-your-writes on every pull.

4. Memory and coordination

4.1 In-memory node cache (rejected)

A prototype held the current source level and its children in RAM until the destination consumed them. If the source ran ahead, the cache grew faster than the destination could drain it. Gating the source per level capped growth but still required roughly two frontier levels in memory at once. Evicting into the DB buffer shifted growth to another in-memory structure until flush. The approach could be tuned for speed at the cost of RAM; the current design does not rely on it.

4.2 Dual-tree coordinator

Migration compares two trees. Destination work depends on what the source has already discovered at the corresponding level. The source queue may run as far ahead as workers allow. The destination may run at full speed but only through source round minus one: if the source is on round 5, the destination waits on round 4 until the source begins round 6.

Finer-grained gating per subtree is possible. Round-level gating was chosen for simpler invariants and resume behavior at the cost of slightly conservative destination pacing.

Note. The destination queue and coordinator exist for two-tree comparison. Single-graph workloads drop that layer:

Pull → Lease → Process → Append → repeat.

5. Evaluation

Benchmarks use Spectra for reproducible trees without real I/O; see the Spectra design note for the harness. On live storage, throughput is near parity with Rclone because both systems become I/O bound. Under Spectra ephemeral mode, engine overhead (persistence, events, coordination) appears as roughly 10 to 30% additional cost versus a tool that does not carry the same guarantees.

Environment	Observation
Real storage	Near parity with Rclone; disk and API limits dominate.
Spectra ephemeral (no I/O)	~10 to 30% slower than Rclone; persistence and coordination overhead visible.

Throughput numbers are only meaningful alongside what guarantees they cost. Sylos pays for a persistent event log, resumable queue state, and reconstructable review metadata. Rclone's traversal path does not carry equivalent machinery.

6. Further reading

Implementation detail: Codeberg

Migration-Engine/pkg/db: schema, Appender, queries
Migration-Engine/pkg/queue: pull flow, coordinator, resume
Migration-Engine/pkg/migration: orchestration
Spectra (repo)

Package READMEs are the authoritative API reference. This page is a design summary, not a spec.