<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: alamb</title><link>https://news.ycombinator.com/user?id=alamb</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Thu, 16 Apr 2026 08:07:05 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=alamb" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by alamb in "Embedding a Tantivy Index in Parquet"]]></title><description><![CDATA[
<p>This demo extends a Parquet file by embedding a Tantivy full-text search index inside it. A custom DataFusion TableProvider implementation uses the embedded full-text index to optimize wildcard LIKE predicates.</p>
]]></description><pubDate>Thu, 25 Sep 2025 14:47:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=45373254</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=45373254</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45373254</guid></item><item><title><![CDATA[Embedding a Tantivy Index in Parquet]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/jcsherin/datablok/tree/main/crates/parquet-embed-tantivy">https://github.com/jcsherin/datablok/tree/main/crates/parquet-embed-tantivy</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45373253">https://news.ycombinator.com/item?id=45373253</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Thu, 25 Sep 2025 14:47:05 +0000</pubDate><link>https://github.com/jcsherin/datablok/tree/main/crates/parquet-embed-tantivy</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=45373253</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45373253</guid></item><item><title><![CDATA[New comment by alamb in "Embedding user-defined indexes in Apache Parquet"]]></title><description><![CDATA[
<p>> Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes<p>The one downside of this approach, which is likely obvious, but I haven't seen mentioned is that the resulting parquet files are larger than they would be otherwise, and the increased size only benefits engines that know how to interpret the new index<p>(I am an author)</p>
]]></description><pubDate>Tue, 15 Jul 2025 11:01:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=44569881</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=44569881</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44569881</guid></item><item><title><![CDATA[New comment by alamb in "Embedding user-defined indexes in Apache Parquet"]]></title><description><![CDATA[
<p>> That is, start with Wild West and define specs as needed<p>Yes this is my personal hope as well -- if there are new index types that are widespread, they can be incorporated formally into the spec<p>However, changing the spec is a non trivial process and requires significant consensus and engineering<p>Thus the methods used in the blog can be used to use indexes prior to any spec change and potentially as a way to prototype / prove out new potential indexes<p>(note I am an author)</p>
]]></description><pubDate>Tue, 15 Jul 2025 10:57:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=44569858</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=44569858</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44569858</guid></item><item><title><![CDATA[New comment by alamb in "Embedding user-defined indexes in Apache Parquet"]]></title><description><![CDATA[
<p>We are actively working on supporting extension types. The mechanism is likely to be using the Arrow extension type mechanism (a logical annotation on top of existing Arrow types <a href="https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types" rel="nofollow">https://arrow.apache.org/docs/format/Columnar.html#format-me...</a>)<p>I expect this to be used to support Variant <a href="https://github.com/apache/datafusion/issues/16116">https://github.com/apache/datafusion/issues/16116</a> and geometry types<p>(note I am an author)</p>
]]></description><pubDate>Tue, 15 Jul 2025 10:52:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=44569838</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=44569838</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44569838</guid></item><item><title><![CDATA[New comment by alamb in "Tpchgen-rs: TPC-H benchmark data generation in pure Rust"]]></title><description><![CDATA[
<p>See also related blog: <a href="https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/" rel="nofollow">https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-g...</a></p>
]]></description><pubDate>Sun, 13 Apr 2025 15:19:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=43673410</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=43673410</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43673410</guid></item><item><title><![CDATA[New comment by alamb in "Apache DataFusion"]]></title><description><![CDATA[
<p>Specifically, DataFusion is faster when querying parquet directly.<p>Most of the leaderboard of ClickBench is for database specific file formats (that you first have to load the data into)</p>
]]></description><pubDate>Thu, 16 Jan 2025 17:24:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=42728160</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=42728160</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42728160</guid></item><item><title><![CDATA[New comment by alamb in "Apache DataFusion"]]></title><description><![CDATA[
<p>I think you would pick DataFusion over DuckDB if you want to customize it substantially. Not just with user defined functions (which are quite easy to write in DataFusion and are very fast), but things like
* custom file formats (e.g. Spiral or Lance)
* custom query languages / sql dialects
* custom catalogs (e.g. other than a local file or prebuilt duckdb connectors)
* custom indexes (read only parts of parquet files based on extra information you store)
* etc.<p>If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat<p>Disclaimer: I am the PMC chair of DataFusion<p>There are some other interesting FAQs here too: <a href="https://datafusion.apache.org/user-guide/faq.html" rel="nofollow">https://datafusion.apache.org/user-guide/faq.html</a></p>
]]></description><pubDate>Thu, 16 Jan 2025 17:23:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=42728151</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=42728151</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42728151</guid></item><item><title><![CDATA[New comment by alamb in "Building Databases over a Weekend"]]></title><description><![CDATA[
<p>BTW here is a fun exercise that takes this idea to the extreme. Who can build a custom file format that gets the best ClickHouse performance (on DataFusion):<p><a href="https://github.com/apache/datafusion/issues/13448">https://github.com/apache/datafusion/issues/13448</a><p>Disclaimer I am on the PMC of Apache DataFusion, so am totally a fan boy.</p>
]]></description><pubDate>Thu, 21 Nov 2024 14:55:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=42204856</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=42204856</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42204856</guid></item><item><title><![CDATA[New comment by alamb in "Using Parquet's Bloom Filters"]]></title><description><![CDATA[
<p>In general, if you can partition your datasets on your predicate column, sorting is likely the best option<p>For example when you have a predicate like, `where id = 'fdhah-4311-ddsdd-222aa'`  sorting on the `id` column will help<p>However, if you have predicates on multiple different sets of columns, such as another query on `state = 'MA'`, you can't pick an ideal sort order for all of them.<p>People often partition (sort) on the low cardinality columns first as that tends to improve compression signficantly</p>
]]></description><pubDate>Wed, 29 May 2024 09:33:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=40510104</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=40510104</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40510104</guid></item><item><title><![CDATA[New comment by alamb in "Bringing GPU acceleration to Polars DataFrames in the near future"]]></title><description><![CDATA[
<p>It would be amazing if the code for working with arrow on GPUs could be made open source -- I think that would drive a significant amount of adoption</p>
]]></description><pubDate>Fri, 05 Apr 2024 18:39:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=39945607</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=39945607</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39945607</guid></item><item><title><![CDATA[New comment by alamb in "Show HN: Spice.ai – materialize, accelerate, and query SQL data from any source"]]></title><description><![CDATA[
<p>So great to see another project built on DataFusion @!</p>
]]></description><pubDate>Thu, 28 Mar 2024 17:51:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=39855068</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=39855068</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39855068</guid></item><item><title><![CDATA[New comment by alamb in "Apache Arrow DataFusion Comet"]]></title><description><![CDATA[
<p>The Apache Arrow PMC is pleased to announce the donation of the Comet project, a native Spark SQL Accelerator built on Apache Arrow DataFusion.</p>
]]></description><pubDate>Wed, 06 Mar 2024 12:06:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=39615023</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=39615023</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39615023</guid></item><item><title><![CDATA[Apache Arrow DataFusion Comet]]></title><description><![CDATA[
<p>Article URL: <a href="https://arrow.apache.org/blog/2024/03/06/comet-donation/">https://arrow.apache.org/blog/2024/03/06/comet-donation/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=39615022">https://news.ycombinator.com/item?id=39615022</a></p>
<p>Points: 6</p>
<p># Comments: 1</p>
]]></description><pubDate>Wed, 06 Mar 2024 12:06:32 +0000</pubDate><link>https://arrow.apache.org/blog/2024/03/06/comet-donation/</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=39615022</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39615022</guid></item><item><title><![CDATA[New comment by alamb in "What I talk about when I talk about query optimizer (part 1): IR design"]]></title><description><![CDATA[
<p>CMU's database courses are online and excellent:<p><a href="https://15445.courses.cs.cmu.edu/spring2024/" rel="nofollow">https://15445.courses.cs.cmu.edu/spring2024/</a><p><a href="https://15721.courses.cs.cmu.edu/spring2023/" rel="nofollow">https://15721.courses.cs.cmu.edu/spring2023/</a></p>
]]></description><pubDate>Mon, 29 Jan 2024 17:23:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=39179083</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=39179083</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39179083</guid></item><item><title><![CDATA[New comment by alamb in "What I talk about when I talk about query optimizer (part 1): IR design"]]></title><description><![CDATA[
<p>BTW you can see a version of what an industrial strength query optimizer / execution engine looks like in Rust <a href="https://arrow.apache.org/datafusion/" rel="nofollow">https://arrow.apache.org/datafusion/</a><p>(can also use it in your own projects)<p>It is quite similar to what is described in this post</p>
]]></description><pubDate>Mon, 29 Jan 2024 17:22:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=39179062</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=39179062</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39179062</guid></item><item><title><![CDATA[Pg_analytics: Transforming Postgres into a Fast Analytical Database]]></title><description><![CDATA[
<p>Article URL: <a href="https://docs.paradedb.com/blog/introducing_analytics">https://docs.paradedb.com/blog/introducing_analytics</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=39179023">https://news.ycombinator.com/item?id=39179023</a></p>
<p>Points: 10</p>
<p># Comments: 3</p>
]]></description><pubDate>Mon, 29 Jan 2024 17:20:10 +0000</pubDate><link>https://docs.paradedb.com/blog/introducing_analytics</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=39179023</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39179023</guid></item><item><title><![CDATA[DataWeb: Virtual Data Unsiloing]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/devinjdangelo/DataWeb">https://github.com/devinjdangelo/DataWeb</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=39055212">https://news.ycombinator.com/item?id=39055212</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 19 Jan 2024 13:35:52 +0000</pubDate><link>https://github.com/devinjdangelo/DataWeb</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=39055212</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39055212</guid></item><item><title><![CDATA[New comment by alamb in "Updates to the H2O.ai db-benchmark"]]></title><description><![CDATA[
<p>The following paper describes some of the tradeoffs between different formats<p>Deep Dive into Common Open Formats for Analytical DBMSs 
<a href="https://www.vldb.org/pvldb/vol16/p3044-liu.pdf" rel="nofollow noreferrer">https://www.vldb.org/pvldb/vol16/p3044-liu.pdf</a></p>
]]></description><pubDate>Mon, 06 Nov 2023 18:32:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=38166682</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=38166682</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38166682</guid></item><item><title><![CDATA[New comment by alamb in "Updates to the H2O.ai db-benchmark"]]></title><description><![CDATA[
<p>I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.<p>[1] <a href="https://h2oai.github.io/db-benchmark/" rel="nofollow noreferrer">https://h2oai.github.io/db-benchmark/</a></p>
]]></description><pubDate>Mon, 06 Nov 2023 18:20:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=38166459</link><dc:creator>alamb</dc:creator><comments>https://news.ycombinator.com/item?id=38166459</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38166459</guid></item></channel></rss>