The Engine
LichenEngine is a purpose-built binary graph database written in Rust, optimised from first principles for 64-bit hardware. Every design decision is made for one specific workload: storing SQL schema as a graph and retrieving it by semantic similarity.
A purpose-built engine, not a general-purpose graph database
LichenEngine does not support a query language, arbitrary property graphs, or generic traversal APIs. Every design decision — the fixed record size, the pointer-based edges, the parallel vector array, the GraphRAG context builder — is made specifically for one workload: store SQL schema structure as a graph, attach vector embeddings to every node, and retrieve relevant subgraphs by semantic similarity. Specialisation is the source of the performance.
64-byte records, aligned to CPU cache lines
Modern CPUs — every x86-64 and ARM64 processor — fetch memory in units called cache lines. A cache line is exactly 64 bytes. When the CPU reads any byte from RAM, the entire 64-byte block containing that byte is loaded into the L1 cache.
By making every record exactly 64 bytes, reading any field of a NodeRecord loads the entire record in a single cache-line fetch. Subsequent reads of other fields in the same record cost zero additional memory operations — the data is already in L1 cache.
The padding: [u8; 5] field at the end of every struct is not wasted space. It is load-bearing: it fills the record to exactly 64 bytes so no record can straddle two cache lines.
Traditional variable-width records
Records straddle cache lines — each read may fetch 2 cache lines
LichenEngine — fixed 64-byte records
1 memory fetch = 1 complete record. Zero wasted bandwidth.
NodeRecord binary layout — #[repr(C)]
Memory-mapped files — no copies, no buffer pool
All four data files are mapped directly into the process's virtual address space at startup using mmap (via the memmap2 crate). Reading a node record is a pointer dereference — not a read() system call, not a copy from kernel to userspace, not a buffer pool lookup.
The OS page cache serves as the buffer pool automatically. Pages accessed recently remain in RAM. Pages not recently accessed are transparently evicted and reloaded on demand. askLenny writes zero buffer management code.
Process virtual address space
engine dereferences pointer
OS page cache (RAM)
warm hit: ~10 cycles · cold miss: load from disk
Disk (.dat files)
nodes.dat · edges.dat · strings.dat · vectors.dat
Pointer-based adjacency lists — O(degree) traversal
Each NodeRecord stores the memory address of its first outgoing and incoming edge directly in the struct. Each EdgeRecord stores the address of the next edge in the chain. This forms a singly-linked list of edges per direction, rooted at the node itself.
When a new edge is inserted, it is prepended to the list in O(1) — the new edge's next_out_ptr is set to the old head, and the node's first_out_edge_ptr is updated to the new edge index.
Traversing all outgoing edges from a node costs one pointer dereference per edge. There is no index lookup, no B-tree traversal, no query planner. For a node with 5 outgoing edges, finding all 5 costs 5 memory reads. The cost scales with the degree of the node — not with the total number of edges in the graph.
LichenEngine (linked list)
O(degree)
~10 cycles per hop
Relational DB (B-tree index)
O(log e)
~5–10 reads per node
Traversing outgoing edges from Node A
Zero-copy serialisation with zerocopy
Reading a 64-byte struct from the mmap'd buffer without copying it requires the struct's in-memory layout to match its on-disk byte representation exactly. LichenEngine uses the zerocopy crate, which verifies at compile time that a struct has a fixed, stable byte layout (#[repr(C)], explicit padding, no uninitialised bytes) and then provides safe casts between the struct and its raw byte representation.
Reading a NodeRecord from the mmap'd buffer produces a reference directly into the mapped memory — no heap allocation, no byte-by-byte parsing, no copy. Writing is a single copy_from_slice of 64 bytes — exactly the bytes that will be stored on disk.
Reading a node
Cast &[u8; 64] slice → &NodeRecord reference. Zero bytes copied. Zero allocations.
Writing a node
Single copy_from_slice of 64 bytes into the mmap'd buffer. No serialisation step.
Vector embeddings and cosine similarity
Every node carries a 1536-dimensional float32 embedding stored in vectors.dat at a fixed stride: node_id × 6144 bytes (1536 floats × 4 bytes each). The vector for any node is located with a single multiplication — no secondary index required.
Cosine similarity measures the angle between two vectors, independent of their magnitudes. A score of 1.0 means semantically identical; 0.0 means unrelated. This is what enables matching net_rev_usd with a query for "revenue" — the embedding captures meaning that literal string search cannot.
cosine similarity formula
similarity(q, n) = (q · n) / (|q| × |n|)
What makes this work for schema retrieval
net_rev_usd+"Net revenue in US dollars after discounts and returns"Matches query: "show me revenue by region"
cust_id+"Unique identifier for each customer account"Matches query: "list all customers"
ord_dt+"Date and time the order was placed"Matches query: "orders in Q3 2024"
GraphRAG — context from the graph, not just the node
Vector similarity finds relevant nodes but not their context. A column node that matches "revenue" is useless to an LLM without knowing which table it belongs to, which database that table is in, and what other columns exist in the same table.
After vector search, LichenEngine performs a breadth-first traversal outward from each matched node, collecting related nodes up to a configurable depth. The entire subgraph — database, tables, columns, types, dialect, primary key flags — is serialised into a structured markdown block and returned in the same HTTP response as the matched node IDs.
The Python layer receives both search results and assembled context in a single round trip. No N+1 problem. No separate context-fetch pass.
--- Anchor Context for [net_rev_usd] --- [Database] (ID: 0, Depth: 0) Label: production_db Engine Dialect: Microsoft SQL Server (T-SQL) [Table] (ID: 5, Depth: 1) Label: orders Description: One row per customer order [Column] (ID: 12, Depth: 2) [PRIMARY KEY] Label: order_id Data Type: INT/INTEGER [Column] (ID: 13, Depth: 2) Label: net_rev_usd Data Type: DECIMAL/FLOAT Description: Net revenue after discounts
What is not built yet
Brute-force vector search
Linear scan over all nodes (O(n × d)). Fast for schemas up to ~100,000 nodes. Approximate nearest-neighbour (HNSW/IVF) is on the roadmap.
Node limit
Current hardcoded limit: 100,000 nodes. The config file specifies 20M — wiring is in progress.
No ACID transactions
Crash-safe header ordering is implemented. Full write-ahead log and MVCC are not yet built.
Fixed embedding dimension
Currently hardcoded to 1536 floats. Making this a runtime config parameter is planned.