Hacker News: pauldix

Duck Hunt: Moving Bauplan from DuckDB to DataFusion

pauldix — Tue, 11 Nov 2025 14:44:17 +0000

Article URL: https://www.bauplanlabs.com/post/duck-hunt-moving-bauplan-from-duckdb-to-datafusion

Comments URL: https://news.ycombinator.com/item?id=45887797

Points: 17

# Comments: 1

New comment by pauldix in "Bloom filters are good for search that does not scale"

pauldix — Tue, 04 Nov 2025 15:18:00 +0000

I believe you could do this effectively with COBS (COmpact Bit Sliced signature index): https://panthema.net/2019/1008-COBS-A-Compact-Bit-Sliced-Sig...

It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.

I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).

There's also code online here: https://github.com/bingmann/cobs

New comment by pauldix in "Spiral"

pauldix — Thu, 11 Sep 2025 16:25:40 +0000

I've been following this team's work for a while and what they're doing is super interesting. The file format they created and put into the LF, Vortex, is very welcome innovation in the space: https://github.com/vortex-data/vortex

I'm excited to start doing some experimentation with Vortex to see how it can improve our products.

Great stuff, congrats to Will and team!

LF AI and Data Hosts Vortex Project for Data Access for AI and Analytics

pauldix — Wed, 06 Aug 2025 16:34:09 +0000

Article URL: https://www.linuxfoundation.org/press/lf-ai-data-foundation-hosts-vortex-project-to-power-high-performance-data-access-for-ai-and-analytics

Comments URL: https://news.ycombinator.com/item?id=44814290

Points: 19

# Comments: 1

New comment by pauldix in "Timescale Is Now TigerData"

pauldix — Wed, 18 Jun 2025 14:25:06 +0000

InfluxDB Founder & CTO here. We worked hard to support InfluxQL in 3.x and it supports the v1 write API. Admittedly, it will be a migration to move and we haven't yet built the tooling, but we felt it was important to get the 3.0 release out even though we don't have the migration tooling built yet. Our plan is to have that available later this year.

The 2.x to 3.x move is, admittedly, much harder. This is because of the language Flux. We haven't been able to bring that over to 3.x in a way that makes it useful. We actually built a bridge for it in our cloud offering, but our experience is that the performance isn't good enough to be acceptable for customers wanting to upgrade. If they want to make the move, adopting SQL or InfluxQL is likely the only path.

We'll continue to develop 3.x and we'll build more migration tooling over time. I think we can build specialized tooling to help Flux users migrate over to 3.x with query translation tools, but there are more features we need to land in 3.x to enable that first.

We're committed to the technology stack (Apache Arrow & DataFusion) and the 3.x line. We have no plans for another major release. I'll be happy if we end up releasing 3.56.2 8 years from now.

New comment by pauldix in "Claude 4 System Card"

pauldix — Sun, 25 May 2025 14:33:39 +0000

Right now this is just in the AI Studio web UI. I have a few command line/scripts to put together a file or two and drop those in. So far I've put in about 450k of stuff there and then over a very long conversation and iterations on a bunch of things built up another 350k of tokens into that window.

Then start over again to clean things out. It's not flawless, but it is surprising what it'll remember from a while back in the conversation.

I've been meaning to pick up some of the more automated tooling and editors, but for the phase of the project I'm in right now, it's unnecessary and the web UI or the Claude app are good enough for what I'm doing.

New comment by pauldix in "Claude 4 System Card"

pauldix — Sun, 25 May 2025 13:17:51 +0000

My experience so far with Opus 4 is that it's very good. Based on a few days of using it for real work, I think it's better than Sonnet 3.5 or 3.7, which had been my daily drivers prior to Gemini 2.5 Pro switching me over just 3 weeks ago. It has solved some things that eluded Gemini 2.5 Pro.

Right now I'm swapping between Gemini and Opus depending on the task. Gemini's 1M token context window is really unbeatable.

But the quality of what Opus 4 produces is really good.

edit: forgot to mention that this is all for Rust based work on InfluxDB 3, a fairly large and complex codebase. YMMV

10 Years of Stable Rust: An Infrastructure Story

pauldix — Thu, 15 May 2025 15:40:41 +0000

Article URL: https://rustfoundation.org/media/10-years-of-stable-rust-an-infrastructure-story/

Comments URL: https://news.ycombinator.com/item?id=43996189

Points: 13

# Comments: 2

New comment by pauldix in "InfluxDB 3 Core and Enterprise GA"

pauldix — Tue, 15 Apr 2025 14:25:08 +0000

We're very excited about this release, over 4 years in the making. Over that time we adopted, contributed to, and helped lead parts of what we're calling the FDAP stack: Apache Arrow Flight, DataFusion, Arrow, and Parquet.

We wrote and contributed the Rust object store crate used in this stack and by many others to the ASF.

This release is based on a "diskless" architecture that uses object storage for all durability. With DataFusion it has a columnar, vectorized, standards compliant SQL query engine. We also built support for InfluxQL on top of it.

The other big thing we brought in is an embedded Python VM using PyO3 and Python Build Standalone. This makes it possible to do data collection, ETL, monitoring, alerting, and all kinds of tasks inside the database at the point of collection.

Happy to answer any questions about the big project, what's next or anything time series related.

InfluxDB 3 Core and Enterprise GA

pauldix — Tue, 15 Apr 2025 14:25:08 +0000

Article URL: https://www.influxdata.com/blog/influxdb-3-oss-ga/

Comments URL: https://news.ycombinator.com/item?id=43693209

Points: 5

# Comments: 1

New comment by pauldix in "InfluxDB 3 Enterprise free for at-home use and an update on Core's 72-hour limit"

pauldix — Mon, 27 Jan 2025 16:23:21 +0000

Blog post author and InfluxDB creator and CTO here. Happy to answer any questions here or provide more technical detail.

New comment by pauldix in "Scaling to users requires Synapse Pro"

pauldix — Sat, 18 Jan 2025 20:25:59 +0000

Our intention with InfluxDB Core is that it's useful to large audience. Just not the group of people seeking a historical TSDB. It's a collector, processor, and recent data TSDB. If you're familiar with the TICK stack from our 1.x line, it's like Telegraf (the data collector), Kapacitor (the processor and monitoring agent), and an InfluxDB that is better on the most recent data.

The InfluxDB part of it is more narrowly scoped than previous versions, but the Telegraf and Kapacitor parts are much more feature rich than those previous products.

New comment by pauldix in "InfluxDB 3 Open Source Now in Public Alpha Under MIT/Apache 2 License"

pauldix — Fri, 17 Jan 2025 17:07:00 +0000

Core doesn't index the metadata so it uses less RAM for higher cardinality data. However, if you have 100M series and you're writing to all of them at the same time, you're going to need some amount of RAM just to buffer it all up and then ship it off to storage as Parquet. The Enterprise product has a compactor that creates indexes as it goes, but those indexes are lighter weight than those in v1 and v2. Also, users can specify which columns they want to appear in those indexes, so they can leave out high cardinality ones if they want to save on RAM. In v3 you can brute force the query against high cardinality data, unlike v1 & v2, which would eat up a ton of RAM to do so.

New comment by pauldix in "InfluxDB 3 Open Source Now in Public Alpha Under MIT/Apache 2 License"

pauldix — Wed, 15 Jan 2025 16:45:45 +0000

I talk a little bit more about this comment on a different submission of this post: https://news.ycombinator.com/item?id=42704526

Can you say more about your use case?

New comment by pauldix in "InfluxDB 3 Open Source Now in Public Alpha Under MIT/Apache 2 License"

pauldix — Wed, 15 Jan 2025 00:57:36 +0000

2.0 was single server. Our paid offering of that is a usage based cloud platform that’s highly available and managed.

New comment by pauldix in "InfluxDB 3 Open Source Now in Public Alpha Under MIT/Apache 2 License"

pauldix — Tue, 14 Jan 2025 22:31:02 +0000

We're open core and have been since 2016. We've deliberately limited the scope of what the open source project is supposed to do. It should be great at this use case of collecting processing, storing, and querying recently buffered data.

The commercial offering is the historical time series DB along with a bunch of other features around high availability, read replication, fine grained security, and the compaction engine which enables longer range queries and row level deletes.

I think Scylla had most of their DB in the open and then a small slice of Enterprise functionality (although I'm not super familiar with their product line).

Ideally, we'd have many open source users and even our commercial customers would use the open source in addition to the commercial offering.

But ultimately, it's about finding a sustainable business model that keeps more software coming. We have a preference for permissive open source over source available. In my view, we may as well create freemium rather than source available.

With this version of InfluxDB, we've been able to invest heavily into Apache projects that lie at the core of it: Arrow, DataFusion, Parquet, and the object store crate, which we developed and donated to the ASF.

We'd like to continue that work because we think that a highly performant, modular, vectorized query engine (i.e. DataFusion) should be a free commodity that's widely available and widely contributed to.

New comment by pauldix in "InfluxDB 3 Open Source Now in Public Alpha Under MIT/Apache 2 License"

pauldix — Tue, 14 Jan 2025 22:02:47 +0000

Post author, cofounder and creator of InfluxDB here. Happy to answer questions in this thread.

I'm guessing there will be questions about the 72 hour limit. There are two things we're looking at:

First, we're considering giving a free tier for at home and hobbyist usage of Enterprise, which doesn't have this limitation. So this would be kind of like what Tailscale does giving a free usage plan for their commercial software.

Second, for Core, the open source build, we're working on an update that will let it query any 72 hour window of historical data. Right now it doesn't evict data, it all still exists on disk or object storage as Parquet files, but we remove the metadata information from RAM to keep things optimized for the most recent 72 hours.

When the update is done, you'll be able to write and query for any period of time. But an individual query will be limited to a 72 hour time range. This is a service protection mechanism because of how the data is organized.

A file gets created for every 10 minute block of time for each table. So 72 hours is 432 files, which is a lot of GET requests to S3 for a single query. We don't want to increase the range because of that. Multiple queries combining a longer range, or accessing the data from third-party clients is all still possible.

In Enterprise, our commercial product, we have a compactor that collapses these files into larger time blocks that also creates an index that the query engine can use.

Doing it this way was a deliberate choice so that we could have a permissively license open source project separate from the commercial product. If we put the compactor into the open, we'd have to put it under a source available license to limit usage so that we can still sell the database.

Our hope is that there's still an audience of users that will find Core useful on its own, even without any commercial relationship with us. It's not a full historical TSDB, but it's not intended to be. It's meant to be a recent data engine that can collect, process, monitor, ship, and store data paired with a fast analytical query engine against the recent buffer (or recently persisted buffer).

Happy to answer any followup questions about this or the release generally.

New comment by pauldix in "InfluxDB 3 Open Source Now in Public Alpha Under MIT/Apache 2 License"

pauldix — Tue, 14 Jan 2025 20:46:08 +0000

That's right, compaction is the way to solve for performance over longer time ranges. This is what we have in our commercial Enterprise product.

New comment by pauldix in "InfluxDB 3 Open Source Now in Public Alpha Under MIT/Apache 2 License"

pauldix — Tue, 14 Jan 2025 16:20:08 +0000

We think that Core will fill some of the use cases of previous OSS versions of InfluxDB, but not all. But we also expect that Core will be useful in many more places that previous OSS versions of InfluxDB were not.

So Core isn't intended to be a full historical TSDB. It's more like a data collector, processing engine, data shipper and recent data buffer/DB.

For a full historical TSDB, that's the product we sell. Keeping the two separate gives us the ability to have real open source vs. combining them and requiring a different license that lets us do freemium.

We'll likely have a freemium tier for the commercial product (Enterprise), but that's separate from the open source project.

New comment by pauldix in "InfluxDB 3 Open Source Now in Public Alpha Under MIT/Apache 2 License"

pauldix — Mon, 13 Jan 2025 17:13:19 +0000

The data is persisted as Parquet files on object storage (or locally attached disk) and is queryable from any tool that can read Parquet. It isn't evicted by the DB, the 72 hour limit is just what is visible by the running database process.

It's a constraint that we could relax over time, but for now we wanted to limit the scope so we can focus on the recent data. We're also considering a free tier of Enterprise for at-home use cases (i.e. non-commercial hobbyist).

As for EOL on previous versions, we don't have anything planned at the moment. We're partnered with AWS on their hosted versions of InfluxDB 2.x OSS so I expect that to continue for quite some time.