Skip to main content

Architecture at a glance

The whole shape in one page. Each piece has its own concept page; this is the map.

Thread-per-core

Celeriant is a Rust server where each core owns a shard and runs single-threaded. There is no shared mutable state across cores on the write path, which removes the locking and the concurrency-bug classes that thread-pool databases spend their lives fighting.

Aggregates map to shards by routing_id % num_shards, where routing_id is one of org_id, aggregate_type_id, or aggregate_id, chosen at cluster init. Plain modulus on an id you control: that is deliberate. A hash would scatter aggregates uniformly and rob you of the only mechanism the engine gives you to co-commit. Two aggregates can be written atomically only if they land on the same shard, so you place them there by choosing ids whose modulus matches. Multi-aggregate writes that span shards are rejected outright (ShardRoutingMultipleShards, error 9001). See Consistency boundaries.

The write path

A write to an aggregate is, by default, conditional: it commits only if the aggregate is still at the version you read (optimistic concurrency). On the leader, the batch is written with Direct I/O, fdatasync'd to disk, replicated to the follower, which also fdatasync's, and only then acknowledged. So an acknowledged write is durable on two machines before your call returns. See Durability and safety.

Direct I/O is deliberate: it skips the kernel page cache, which can report a clean fsync and still lose data. The per-write cost is amortized by batching concurrent writes into one fsync and one replication round.

The cluster

Two nodes, a leader and a follower, with no Raft and no Zookeeper. Leadership is an S3 lease acquired by conditional write; failover is a lease handoff. When the follower is down, the leader replicates to S3 instead, so an acknowledged write is never single-homed. See Leader election and S3 leases.

Storage and memory

The storage engine is built for very high stream cardinality. It indexes with bloom filters and falls back to a reverse scan of the log, with hot data in an LRU cache, so memory is bounded by the working set rather than the total number of aggregates. The design holds millions of aggregates and billions of events on a 32 GB box, with the log on NVMe, at the cost of slightly higher latency on the first read of a cold aggregate. See Performance.

Reading

Reads are per-aggregate, ordered, and filtered by offset and event type (Reads and ordering). There is no query language, because Celeriant is the write side; you project the log into a read store and query that (Building a read model). For real-time, a watch streams change notifications and your projection follows the tail.

Where to go next