Hacker News: alamb

New comment by alamb in "Embedding a Tantivy Index in Parquet"

alamb — Thu, 25 Sep 2025 14:47:05 +0000

This demo extends a Parquet file by embedding a Tantivy full-text search index inside it. A custom DataFusion TableProvider implementation uses the embedded full-text index to optimize wildcard LIKE predicates.

Embedding a Tantivy Index in Parquet

alamb — Thu, 25 Sep 2025 14:47:05 +0000

Article URL: https://github.com/jcsherin/datablok/tree/main/crates/parquet-embed-tantivy

Comments URL: https://news.ycombinator.com/item?id=45373253

Points: 1

# Comments: 1

New comment by alamb in "Embedding user-defined indexes in Apache Parquet"

alamb — Tue, 15 Jul 2025 11:01:36 +0000

> Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes

The one downside of this approach, which is likely obvious, but I haven't seen mentioned is that the resulting parquet files are larger than they would be otherwise, and the increased size only benefits engines that know how to interpret the new index

(I am an author)

New comment by alamb in "Embedding user-defined indexes in Apache Parquet"

alamb — Tue, 15 Jul 2025 10:57:29 +0000

> That is, start with Wild West and define specs as needed

Yes this is my personal hope as well -- if there are new index types that are widespread, they can be incorporated formally into the spec

However, changing the spec is a non trivial process and requires significant consensus and engineering

Thus the methods used in the blog can be used to use indexes prior to any spec change and potentially as a way to prototype / prove out new potential indexes

(note I am an author)

New comment by alamb in "Embedding user-defined indexes in Apache Parquet"

alamb — Tue, 15 Jul 2025 10:52:27 +0000

We are actively working on supporting extension types. The mechanism is likely to be using the Arrow extension type mechanism (a logical annotation on top of existing Arrow types https://arrow.apache.org/docs/format/Columnar.html#format-me...)

I expect this to be used to support Variant https://github.com/apache/datafusion/issues/16116 and geometry types

(note I am an author)

New comment by alamb in "Tpchgen-rs: TPC-H benchmark data generation in pure Rust"

alamb — Sun, 13 Apr 2025 15:19:31 +0000

New comment by alamb in "Apache DataFusion"

alamb — Thu, 16 Jan 2025 17:24:21 +0000

Specifically, DataFusion is faster when querying parquet directly.

Most of the leaderboard of ClickBench is for database specific file formats (that you first have to load the data into)

New comment by alamb in "Apache DataFusion"

alamb — Thu, 16 Jan 2025 17:23:24 +0000

I think you would pick DataFusion over DuckDB if you want to customize it substantially. Not just with user defined functions (which are quite easy to write in DataFusion and are very fast), but things like * custom file formats (e.g. Spiral or Lance) * custom query languages / sql dialects * custom catalogs (e.g. other than a local file or prebuilt duckdb connectors) * custom indexes (read only parts of parquet files based on extra information you store) * etc.

If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat

Disclaimer: I am the PMC chair of DataFusion

There are some other interesting FAQs here too: https://datafusion.apache.org/user-guide/faq.html

New comment by alamb in "Building Databases over a Weekend"

alamb — Thu, 21 Nov 2024 14:55:03 +0000

BTW here is a fun exercise that takes this idea to the extreme. Who can build a custom file format that gets the best ClickHouse performance (on DataFusion):

https://github.com/apache/datafusion/issues/13448

Disclaimer I am on the PMC of Apache DataFusion, so am totally a fan boy.

New comment by alamb in "Using Parquet's Bloom Filters"

alamb — Wed, 29 May 2024 09:33:00 +0000

In general, if you can partition your datasets on your predicate column, sorting is likely the best option

For example when you have a predicate like, `where id = 'fdhah-4311-ddsdd-222aa'` sorting on the `id` column will help

However, if you have predicates on multiple different sets of columns, such as another query on `state = 'MA'`, you can't pick an ideal sort order for all of them.

People often partition (sort) on the low cardinality columns first as that tends to improve compression signficantly

New comment by alamb in "Bringing GPU acceleration to Polars DataFrames in the near future"

alamb — Fri, 05 Apr 2024 18:39:35 +0000

It would be amazing if the code for working with arrow on GPUs could be made open source -- I think that would drive a significant amount of adoption

New comment by alamb in "Show HN: Spice.ai – materialize, accelerate, and query SQL data from any source"

alamb — Thu, 28 Mar 2024 17:51:46 +0000

So great to see another project built on DataFusion @!

New comment by alamb in "Apache Arrow DataFusion Comet"

alamb — Wed, 06 Mar 2024 12:06:33 +0000

The Apache Arrow PMC is pleased to announce the donation of the Comet project, a native Spark SQL Accelerator built on Apache Arrow DataFusion.

Apache Arrow DataFusion Comet

alamb — Wed, 06 Mar 2024 12:06:32 +0000

Article URL: https://arrow.apache.org/blog/2024/03/06/comet-donation/

Comments URL: https://news.ycombinator.com/item?id=39615022

Points: 6

# Comments: 1

New comment by alamb in "What I talk about when I talk about query optimizer (part 1): IR design"

alamb — Mon, 29 Jan 2024 17:23:27 +0000

CMU's database courses are online and excellent:

https://15445.courses.cs.cmu.edu/spring2024/

https://15721.courses.cs.cmu.edu/spring2023/

New comment by alamb in "What I talk about when I talk about query optimizer (part 1): IR design"

alamb — Mon, 29 Jan 2024 17:22:09 +0000

BTW you can see a version of what an industrial strength query optimizer / execution engine looks like in Rust https://arrow.apache.org/datafusion/

(can also use it in your own projects)

It is quite similar to what is described in this post

Pg_analytics: Transforming Postgres into a Fast Analytical Database

alamb — Mon, 29 Jan 2024 17:20:10 +0000

Article URL: https://docs.paradedb.com/blog/introducing_analytics

Comments URL: https://news.ycombinator.com/item?id=39179023

Points: 10

# Comments: 3

DataWeb: Virtual Data Unsiloing

alamb — Fri, 19 Jan 2024 13:35:52 +0000

Article URL: https://github.com/devinjdangelo/DataWeb

Comments URL: https://news.ycombinator.com/item?id=39055212

Points: 1

# Comments: 0

New comment by alamb in "Updates to the H2O.ai db-benchmark"

alamb — Mon, 06 Nov 2023 18:32:48 +0000

The following paper describes some of the tradeoffs between different formats

Deep Dive into Common Open Formats for Analytical DBMSs https://www.vldb.org/pvldb/vol16/p3044-liu.pdf

New comment by alamb in "Updates to the H2O.ai db-benchmark"

alamb — Mon, 06 Nov 2023 18:20:08 +0000

I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.

[1] https://h2oai.github.io/db-benchmark/