Hacker News: wolfgarbe

New comment by wolfgarbe in "Show HN: A spell-checker 380x faster than Hunspell, 5x faster than SymSpell"

wolfgarbe — Fri, 06 Feb 2026 17:07:09 +0000

Peter Norvig shows that an edit distance = 2 will cover 98.9% spelling errors. https://impythonist.wordpress.com/2014/03/18/peter-norvigs-2...

That's the reason why the default maximum edit distance of SymSpell is 2.

Now, all your 6 out of 6 examples are chosen from that 1.1% margin that is not covered by edit distance 2, presenting a rather unlikely high amount of errors within a single word.

The third-party SymSpell port from Justin Willaby, which you were using for benchmarking, clearly states that you need to set both maxEditDistance and dictionaryEditDistance to a higher number if you want to correct higher edit distances. That you neither used nor mentioned. This has nothing to do with accuracy; it is a choice regarding a performance vs. maximum edit distance tradeoff one can make according to the use case at hand.

https://github.com/justinwilaby/spellchecker-wasm?tab=readme...

pronnouncaition IS within edit distance 3, according to the Damerau-Levenshtein edit distance used by SymSpell. The reason is that adjacent transpositions are counted as a single dit. https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di...

New comment by wolfgarbe in "Show HN: A spell-checker 380x faster than Hunspell, 5x faster than SymSpell"

wolfgarbe — Fri, 06 Feb 2026 08:55:41 +0000

Author of SymSpell here. Congrats on the launch of Lexiathan.

Unfortunately, the comparison of Lexiathan vs. Symspell on your website regarding accuracy is misleading.

1. SymSpell has two parameters to control the maximum edit distance. Once you set both to 3, then also terms with an edit distance of 3 are accurately corrected:

  pronnouncaition -> pronunciation

  inndappendent -> independent

  unegspeccted -> unexpected

  soggtwaee       -> software

2. SymSpell comes with dictionaries in several sizes. Once you load the 500_000 terms dictionary, then also the two remaining terms will be corrected:

  maggnificntally -> magnificently

  annnesteasialgist -> anesthesiologist

https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell.B...

SymSpell accurately corrects all of your examples if used properly with the correct parameters and dictionary.

Apart from that, your methodology of comparing correction accuracy by cherry-picking specific terms without statistical significance, where your product seemingly performs better, is questionable.

One would use large public corpora to measure the percentage of accurately corrected terms as well as the percentage of false positives.

Because SymSpell is Open-Source, everyone can integrate it into their applications for free, modify the code, use different dictionaries in various languages, or add terms to existing ones.

https://github.com/wolfgarbe/SymSpell

https://github.com/wolfgarbe/symspell_rs

New comment by wolfgarbe in "Building a Simple Search Engine That Works"

wolfgarbe — Mon, 17 Nov 2025 15:31:13 +0000

The stopword list in SeekStorm is purely optional, per default it is empty.

The query "to be or not to be" that you mentioned, consisting solely of stopwords, returns complete results and perform quite well in the benchmark: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#be...

Both Lucene and Elastic still offer stopword filters: https://lucene.apache.org/core/10_3_2/analysis/common/org/ap... https://www.elastic.co/docs/reference/text-analysis/analysis...

New comment by wolfgarbe in "Show HN: I wrote a full text search engine in Go"

wolfgarbe — Fri, 10 Oct 2025 12:16:13 +0000

Can the index size exceed the RAM size (e.g., via memory mapping), or are index size and document number limited by RAM size? It would be good to mention those limitations in the README.

New comment by wolfgarbe in "Show HN: I wrote a full text search engine in Go"

wolfgarbe — Fri, 10 Oct 2025 12:09:59 +0000

Sure, but it says "High-performance" Full Text Search Engine. Shouldn't that claim be backed up by numbers, comparing it to the state of the art?

New comment by wolfgarbe in "Show HN: I wrote a full text search engine in Go"

wolfgarbe — Thu, 09 Oct 2025 18:39:18 +0000

Great work! Would be interesting to see how it compares to Lucene performance-wise, e.g. with a benchmark like https://github.com/quickwit-oss/search-benchmark-game

New comment by wolfgarbe in "Ask HN: Struggling to Understand DHTs – Any Good Resources?"

wolfgarbe — Sun, 16 Feb 2025 11:14:12 +0000

The most widely used DHT is Kademlia from Petar Maymounkov and David Mazières. It is used in Ethereum, IPFS, I2P, Gnutella DHT, and many other applications.

https://en.wikipedia.org/wiki/Kademlia

https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia...

https://web.archive.org/web/20120128120732/http://www.cs.ric...

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Sat, 14 Dec 2024 14:12:19 +0000

SeekStorm does currently not use io_uring, but it is on our roadmap. Challenges are the cross-platform compatibility. Linux (io_uring) and Windows (IoRing) use different implementations, and other OS don't support it. There is no abstraction layer over those implementations in Rust, so we are on our own.

It would increase concurrent read and write speed (index loading, searching) by removing the need to lock seek and read/write.

But I would expect that the mmap implementations do already use io_uring / IoRing.

Yes, lazy loading would be possible, but pure RAM access does not offer enough benefits to justify the effort to replicate much of the memory mapping.

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Mon, 09 Dec 2024 18:56:22 +0000

SeekStorm comes with an http interface.

The SeekStorm server features an REST API via http: https://seekstorm.apidocumentation.com

It also comes with an embedded Web UI: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#bu...

Or did you mean a Web based interface to create and manage indices, define index schemas, add documents etc?

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Mon, 09 Dec 2024 18:47:34 +0000

>> The documentation seems a bit sparse.

We just released a new OpenAPI based documentation for the SeekStorm server REST API: https://seekstorm.apidocumentation.com

For the library we have the standard rust doc: https://docs.rs/seekstorm/latest/seekstorm/

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Tue, 03 Dec 2024 18:32:17 +0000

For the latency benchmarks we used vanilla BM25 (SimilarityType::Bm25f for a single field) for comparability, so there are no differences in terms of accuracy.

For SimilarityType::Bm25fProximity which takes into account the proximity between query term matches within the document, we have so far only anecdotal evidence that it returns significantly more relevant results for many queries.

Systematic relevancy benchmarks like BeIR, MS MARCO are planned.

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Tue, 03 Dec 2024 12:03:40 +0000

The Seekstorm library is 9 MB, and the Seekstorm server executable is 8 MB, depending on the features selected in cargo.

You add the library via 'cargo add seekstorm' to your project which you anyway have to compile.

As for the server, we may add binaries for the main OS in the future.

WASM and Python bindings are on our roadmap.

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Tue, 03 Dec 2024 07:22:29 +0000

In SeekStorm you can choose per index whether to use Mmap or let SeekStorm fully control Ram access. There is a slight performance advantage to the latter, at the cost of higher index load time of the former. https://docs.rs/seekstorm/latest/seekstorm/index/enum.Access...

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Tue, 03 Dec 2024 07:15:36 +0000

The benchmark should be fairly fair, as it was developed by Tantivy themselves (and Jason Wolfe). So, the choice of corpus and queries was theirs. But, of course, your mileage may vary. It is always best to benchmark it on your machine with your data and your queries.

Yes, WASM and Python bindings are on our roadmap.

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Tue, 03 Dec 2024 06:59:44 +0000

Currently, you can choose between tokenizers with or without folding. But configurability per language or full customizability of the folding logic by the user is a good idea.

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Tue, 03 Dec 2024 06:54:15 +0000

The code for the distributed search cluster is not yet stable enough to be published, but it will be released as open-source as well.

As for shared storage, do you mean something like NAS or, rather Amazon S3? Cloud-native support of object storage and separating storage and compute is on our roadmap. Challenges will be maintaining latency and the need for more sophisticated caching.

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Mon, 02 Dec 2024 23:44:13 +0000

Its not just about speed. Speed reflects efficiency. Efficiency is needed to serve more queries in parallel, to search within exponentially growing data, with less expensive hardware, and fewer servers, consuming less energy. Therefore the pursuit for efficiency never gets outdated and has no limit.

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Mon, 02 Dec 2024 22:10:21 +0000

PostgreSQL is an SQL database that also offers full-text search (FTS), with extensions like pg_search it also supports BM25 scoring which is essential for lexical search. SeekStorm is centered around full-text search only, it doesn't offer SQL.

Performance-wise it would be indeed interesting to run a benchmark. The third-party open-source benchmark we are currently using (search_benchmark_game) does not yet support PostgreSQL. So yes, that comparison is still pending.

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Mon, 02 Dec 2024 21:12:26 +0000

Yes. We waited long for AOT compilation to become mature, to remove the need for the user to install the .Net framework. But two years ago when we decided to switch, we still couldn't just get the AOT compilation of our codebase to work without changes (perhaps it was somehow possible, but the available documentation was not very verbose about this). Also, there is still a performance gap. Of course, this doesn't matter for most of the applications, where the completeness and consistency of the framework, and the number of programmers fluent in that language might matter more. But for a search server, we needed to carve out every inch of performance we could get. And other benchmarks seemed to echo our experience: https://programming-language-benchmarks.vercel.app/rust-vs-c...

New comment by wolfgarbe in "Show HN: SeekStorm – open-source sub-millisecond search in Rust"

wolfgarbe — Mon, 02 Dec 2024 20:26:49 +0000

The 2-4 speed ratio was not meant to denounce C#, which is a great language I loved to program in for over two decades, coming from Delphi. Unfortunately, C# has not a complete SIMD support. See our request to support the SSE4.2 _mm_cmpistrm instruction https://github.com/dotnet/runtime/discussions/63332, which we required for a vectorized intersection between two sorted 16-bit arrays. We did the switch from C# to Rust not light-minded, as the cost of porting a fairly large codebase is time-consuming. We just wanted to share our experience for our specific task, not as a general statement.