Hacker News: xavcochran

The Battle of the Storage Engines: RocksDB vs. LMDB

xavcochran — Thu, 12 Jun 2025 20:00:04 +0000

Comments URL: https://news.ycombinator.com/item?id=44262494

Points: 1

# Comments: 0

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Thu, 15 May 2025 00:19:06 +0000

The benefit of baking in the dimension and size of individual elements (the precision) is the fact that the size will be known at compile time meaning it can be allocated on the stack instead of being heap allocated.

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 18:07:04 +0000

We utilize some of LMDB's optimizations such as the APPEND put flags. We also make use of LMDB handling duplicates as a one-to-many key instead of duplicating keys. This means we can get all values for one key in one call rather than a call for each duplicate.

For keys we are using UUIDs, but using the v6 timestamped uuids so that they are easily lexicographically ordered at creation time. This means keys inserted into LMDB are inserted using the APPEND flag, meaning LMDB shortcuts to the rightmost leaf in its B-Tree (rather than starting at the root) and appends the new record. It can do this because the records are ordered by creation time meaning each new record is guaranteed to be larger (in terms of big-endian byte order) than the previous record.

We also store the UUIDs as u128 values for two reasons. The first is that a u128 takes up 16 bytes where as a string UUID takes up 36 bytes. This means we store 56% less data and LMDB has to decode 56% less bytes when doing code accesses.

For the outgoing/incoming edges for nodes, we store them as fixed sizes which means LMDB packs them in, removing the 8 byte header per Key-Value pair.

In the future, we are also going to separate the properties from the stored value as empty property objects still take up 8 bytes of space. We will also make it so nothing is inserted if the properties are empty.

You can see most of this in action in the storage core file: https://github.com/HelixDB/helix-db/blob/main/helixdb/src/he...

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 17:43:16 +0000

Looking at your benchmarks you say for inserting 1k edges its around 500,000 ns/iteration. Is this 500,000 ns/per edge insertion or for all 1k of them?

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 13:02:23 +0000

thank you! any feedback would be much appreciated

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 12:55:16 +0000

SPALDE*

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 10:13:36 +0000

there is also the fact that the more dimensions you have for embedded data the more diluted the embedding becomes so it is unusual to go anywhere near the limits of vector length!

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 09:59:24 +0000

very interesting, will look into this. I know for a fact that you cannot compile the likes of LMDB and RocksDB to work with WASM but this looks promising for our custom storage engine to be able to make it work with the browser. Thanks for this!

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 08:49:56 +0000

We will definitely look into it. The SPLADE models look promising!

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 08:47:52 +0000

to add to George's reply, for helix to run on the browser with WASM the storage engine has to be completely in memory. At the moment we use LMDB which uses file based storage so that does't work with the browser. As George said, we plan on making our own storage engine and as part of that we aim to have an in-memory implementation.

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 08:44:30 +0000

thanks for the question! we chose f64 as a default for now as just to cover all cases and we believed that basic vector operations would not be our bottleneck initially. As we optimize our HNSW implementation, we are going to add support for f32 and binary vectors and drop using Vec and instead use [f64/f32; {num_dimensions}] to avoid unnecessary heap allocation!

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 08:41:02 +0000

apart from the fact Cozo seems to be pretty dead, we use a different storage engine which makes our reads much faster. based on their benchmarks I estimate our most of our reads to be 10x faster. I think our query language is much simpler, and easy to understand than Datalog which is what they use.

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 08:24:15 +0000

Assuming you are using GPUs for model inference, the best way to set it up would have the DB and a separate server to send inference requests. Note that we plan on support custom model endpoints and on the database side so you probably won't need the inference server in the future!

New comment by xavcochran in "Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)"

xavcochran — Wed, 14 May 2025 07:57:48 +0000

Thanks for the kind words! At the moment the query language transpilation is quite unstable but we are in the process of a large remodel which we aim to finish in the next day or so. This will make the query language compilation far more robust, and will return helpful error messages (like the rust compiler). The other thing is the core traversals are currently single threaded, so aggregating huge lists of graph items can take a bit of a hit. Note however, that we are also implementing parallel LMDB iterators with the help of the meilisearch guys to make aggregation of large results much faster.