Hacker News: ignoreusernames

New comment by ignoreusernames in "My LSM tree was slower than a B-tree. Then I profiled it"

ignoreusernames — Thu, 18 Jun 2026 20:04:13 +0000

Yeah, especially a bloomfilter which has a pretty easy formula for its false positive rate.

New comment by ignoreusernames in "--dangerously-skip-reading-code"

ignoreusernames — Sat, 23 May 2026 22:58:18 +0000

Don’t you think that the provider of the LLM is also a dimension on these discussions about responsibility? We often talk about the tech itself (LLM driven development) but how we access it is just as important imo. It’s either locked behind a non trivial amount of hardware (for open models) or some monopolistic driven provider entity like OpenAI or anthropic. In the provider case, it’s not really the LLM that will “own” the code, it’s the provider itself and we’ll be at the mercy of whatever pricing model they shove down our throats.

New comment by ignoreusernames in "Nobody Reviews Compiler Output"

ignoreusernames — Thu, 07 May 2026 20:56:49 +0000

I think this argument only holds if you believe that LLMs are at a point where it can handle any combination of craziness that you throw at it.

From my own experience working with agents is that there’s “snowball of shit” effect. Small little mistakes that compound on each other. You can either

- review the code and try to prune some of the shit occasionally - let the LLM handle everything

As of the current status of the industry it’s very hard for me to not see option 2 as extremely irresponsible. Coding agents limits are not well defined and unless you’re running an open weight model locally (most people aren’t) you just gave up all control over your code to a third party. If running local models were the norm, the argument that LLM are just another layer of abstraction would hold a little better. Reusing the compiler analogy from the post, it’s like depending on a compiler where you pay a monthly premium to compile your code. Those did exist a while ago with closed licenses, but I think the majority of deployed code nowadays is on open-ish platforms. This walled garden development paradigm already lost once

New comment by ignoreusernames in "Async Rust never left the MVP state"

ignoreusernames — Tue, 05 May 2026 12:42:00 +0000

Can you elaborate on this please? Do you mean that’s basically impossible for rust std to provide a default runtime that makes “everyone” (embedded on one end and web on the other) happy?

New comment by ignoreusernames in "Async Rust never left the MVP state"

ignoreusernames — Tue, 05 May 2026 11:39:55 +0000

As of now I don’t think there’s an alternative. I’m not a Rust expert but the core issue to me is that “async” goes beyond just having a Futures scheduler. Async stuff usually needs network, disk, os interaction, future utilities(spawn) and these are all things the runtime (tokio) provides. It’s pretty hard to be compatible with each other unless the language itself provides those.

New comment by ignoreusernames in "Async Rust never left the MVP state"

ignoreusernames — Tue, 05 May 2026 11:15:11 +0000

I may have missed something, but how does “sans-io” deal with CPU heavy code? For example, if there’s some heavy decoding/encoding required on the data? Does the event loop only drive the network side and the heavy part is done after the loop is finished?

New comment by ignoreusernames in "Async Rust never left the MVP state"

ignoreusernames — Tue, 05 May 2026 09:58:32 +0000

Agree with the other commenters that the title is a bit too dramatic. The content was well written and got the point across.

I still don’t have enough experience to have a strong opinion on Rust async, but some things did standout.

On the good side, it’s nice being able to have explicit runtimes. Instead of polluting the whole project to be async, you can do the opposite. Be sync first and use the runtime on IO “edges”. This was a great fit to a project that I’m working on and it seems like a pretty similar strategy to what zig is doing with IO code. This largely solved the function colloring problem in this particular case. Strict separation of IO and CPU bound code was a requirement regardless of the async stuff, so using the explicit IO runtime was natural.

On the bad side, it seems crazy to me how much the whole ecosystem depends on tokio. It’s almost like Java’s GC was optional, but in practice everyone just used the same third party GC runtime and pulling any library forced you to just use that runtime. This sort of central dependency is simply not healthy.

New comment by ignoreusernames in "Consistent hashing"

ignoreusernames — Fri, 03 Oct 2025 15:07:31 +0000

Another strategy to avoid redistribution is simply having a big enough number of partitions and assign ranges instead of single partitions. A bit more complex on the coordination side but works well in other domains (distributed processing for example)

New comment by ignoreusernames in "The two versions of Parquet"

ignoreusernames — Mon, 25 Aug 2025 11:21:57 +0000

> The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.

I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it.

> The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length

I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing.

New comment by ignoreusernames in "Databricks in talks to acquire startup Neon for about $1B"

ignoreusernames — Tue, 06 May 2025 13:53:51 +0000

> Folks I know in the industry are not very happy with databricks

Yeah, big companies globing up everything does not lead to a healthy ecosystem. Congrats on the founders for their the acquisition but everyone else loses with movements like this.

I'm still sour after their Redash purchase that instantly "killed" the open source version. Tabular acquisition was also a bit controversial since one of the founders is the PMC Chair for Iceberg which "competes" directly with Databricks own delta lake. The mere presence of these giants (mostly databricks and snowflake) makes the whole data ecosystem (both closed and open source) really hostile.

New comment by ignoreusernames in "Anatomy of a SQL Engine"

ignoreusernames — Sun, 27 Apr 2025 10:33:53 +0000

I recommend anyone who works with databases to write a simple engine. It's a lot simpler than you may think and it's a great exercise. If using python, sqlglot (https://github.com/tobymao/sqlglot) let's you skip all the parsing and it even does some simple optimizations. From the parsed query tree it's pretty straightforward to build a logical plan and execute that. You can even use python's builtin ast module to convert sql expressions into python ones (so no need for a custom interpreter!)

New comment by ignoreusernames in "ClickHouse gets lazier and faster: Introducing lazy materialization"

ignoreusernames — Wed, 23 Apr 2025 13:25:18 +0000

Same thing with columnar/vectorized execution. It has been known for a long time that's the "correct" way to process data for olap workflows, but only became "mainstream" in the last few years(mostly due to arrow).

It's awesome that clickhouse is adopting it now, but a shame that it's not standard on anything that does analytics processing.

New comment by ignoreusernames in "Apache DataFusion"

ignoreusernames — Thu, 16 Jan 2025 10:57:32 +0000

> Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations

I see your point, but that's only true within a single stage. Any operator that requires partitioning (groupBys and joins for example) requires writing to disk

> [...] which used to be a point of comparison to MapReduce specifically.

So each mapper in hadoop wrote partial results to disk? LOL this was way worse than I remember than. It's been a long time that I've dealt with hadoop

> Not ground-breaking nowadays but when I was doing this stuff 10+ years

I would say that it wouldn't be ground breaking 20 years ago. I feel like hadoop influence held up our entire field for years. Most of the stuff that arrow made mainstream and is being used by a bunch of engines mentioned in this thread has been known for a long time. It's like, as a community, we had blindfolds on. Sorry about the rant, but I'm glad the hadoop fog is finally dissipating

New comment by ignoreusernames in "Apache DataFusion"

ignoreusernames — Thu, 16 Jan 2025 09:56:49 +0000

just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell:

  spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
  spark.read.parquet("sample_data").groupBy($"col").count().count()

after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.

New comment by ignoreusernames in "Improving Parquet Dedupe on Hugging Face Hub"

ignoreusernames — Tue, 08 Oct 2024 18:15:12 +0000

> Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates

I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself. A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.

New comment by ignoreusernames in "Sail – Unify stream processing, batch processing and compute-intensive workloads"

ignoreusernames — Tue, 10 Sep 2024 11:18:23 +0000

From the announcement “As of now, we have mined 1,580 PySpark tests from the Spark codebase, among which 838 (53.0%) are successful on Sail. We have also mined 2,230 Spark SQL statements or expressions, among which 1,396 (62.6%) can be parsed by Sail”

Kinda early to call this a drop in replacement with those numbers no?

But, with enough parity this project could be a dream for anybody dealing with spark’s dreadful performance. Kudos to the team

New comment by ignoreusernames in "Portugal brings back tax breaks for foreigners in bid to woo digital nomads"

ignoreusernames — Sun, 07 Jul 2024 12:39:56 +0000

This is a fair argument that's often brought up but I never see actual raw data backing it up. Housing is fucked in several places around the world, including a bunch of countries in Europe that don't have any tax breaks for specialized labor. I would love to look at some metrics like

- How many units of housing are built each year

- How many units are rented and to what demography (Portuguese families, immigrants sharing rooms, students, etc)

- How many migrants (legal and ilegal)

- How many specialized migrants each year and the % of them that eventually buy a home

- How many units are bought up by funds and other financial entities

- How much taxes and social security contributions are collected per year for specialized migrants and how that money is reinvested

- etc

I known that's basically impossible to have an accurate picture since those numbers are way too "politically loaded". Politics and facts don't mix very well so we just default to who yells the loudest (specially true in Portugal, unfortunately)

EDIT: Format bullet points

New comment by ignoreusernames in "The AWS S3 Denial of Wallet Amplification Attack"

ignoreusernames — Wed, 01 May 2024 22:52:14 +0000

Early Athena (managed prestodb by AWS) had a similar bug when measuring colunar file scans. If it touched the file, it considered the whole file instead of just the column chunks read. If I’m not mistaken, this was a bug on presto itself, but it was a simple patch that landed on upstream a long time before we did the tests. This was the first and only time we considered using a relatively early AWS product. It was so bad that our half assed self deployed version outperformed Athena by every metric that we cared about

New comment by ignoreusernames in "Science fiction and the death of the sun"

ignoreusernames — Wed, 03 Apr 2024 10:03:02 +0000

Great series. If I'm not mistaken, there's an additional layer to the unreliable narrator part because the book is supposed to be a translation of that biography. So, when certain words are used, the reader knows that they don't necessarily represent the literal meaning and it's only an approximation for the actual thing in the book universe (for example, a "horse" is not actually a "horse" as we know it). It certainly helped me digest the more outlandish ideas.

New comment by ignoreusernames in "DeWitt and Stonebraker's "MapReduce: A major step backwards" (2009)"

ignoreusernames — Sat, 30 Mar 2024 18:58:26 +0000

100% agree. mapReduce hype always seemed strange to me because it's basically the volcano paper from the 90s but with custom user defined operators instead of pre baked ones in a more traditional engine. To make everything worse, hadoop came along, ignoring every industry advance of the past 40 years with its "one tuple at a time" iterator based model on a garbage collected language. I realize it's very easy for me to say those things in hindsight, but it's not like vectorized execution was a weird obscure secret by the time these things came out.

On a side note, it finally looks like the industry is moving towards saner tools that implement a lot of things that this article mentions mapReduce was missing

Hacker News: ignoreusernames

New comment by ignoreusernames in "My LSM tree was slower than a B-tree. Then I profiled it"

New comment by ignoreusernames in "-​-dangerously-skip-reading-code"

New comment by ignoreusernames in "Nobody Reviews Compiler Output"

New comment by ignoreusernames in "Async Rust never left the MVP state"

New comment by ignoreusernames in "Async Rust never left the MVP state"

New comment by ignoreusernames in "Async Rust never left the MVP state"

New comment by ignoreusernames in "Async Rust never left the MVP state"

New comment by ignoreusernames in "Consistent hashing"

New comment by ignoreusernames in "The two versions of Parquet"

New comment by ignoreusernames in "Databricks in talks to acquire startup Neon for about $1B"

New comment by ignoreusernames in "Anatomy of a SQL Engine"

New comment by ignoreusernames in "ClickHouse gets lazier and faster: Introducing lazy materialization"

New comment by ignoreusernames in "Apache DataFusion"

New comment by ignoreusernames in "Apache DataFusion"

New comment by ignoreusernames in "Improving Parquet Dedupe on Hugging Face Hub"

New comment by ignoreusernames in "Sail – Unify stream processing, batch processing and compute-intensive workloads"

New comment by ignoreusernames in "Portugal brings back tax breaks for foreigners in bid to woo digital nomads"

New comment by ignoreusernames in "The AWS S3 Denial of Wallet Amplification Attack"

New comment by ignoreusernames in "Science fiction and the death of the sun"

New comment by ignoreusernames in "DeWitt and Stonebraker's "MapReduce: A major step backwards" (2009)"

New comment by ignoreusernames in "--dangerously-skip-reading-code"