<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: ignoreusernames</title><link>https://news.ycombinator.com/user?id=ignoreusernames</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sat, 20 Jun 2026 11:49:53 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=ignoreusernames" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by ignoreusernames in "My LSM tree was slower than a B-tree. Then I profiled it"]]></title><description><![CDATA[
<p>Yeah, especially a bloomfilter which has a pretty easy formula for its false positive rate.</p>
]]></description><pubDate>Thu, 18 Jun 2026 20:04:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=48590769</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=48590769</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48590769</guid></item><item><title><![CDATA[New comment by ignoreusernames in "-​-dangerously-skip-reading-code"]]></title><description><![CDATA[
<p>Don’t you think that the provider of the LLM is also a dimension on these discussions about responsibility? We often talk about the tech itself (LLM driven development) but how we access it is just as important imo. It’s either locked behind a non trivial amount of hardware (for open models) or some monopolistic driven provider entity like OpenAI or anthropic. In the provider case, it’s not really the LLM that will “own” the code, it’s the provider itself and we’ll be at the mercy of whatever pricing model they shove down our throats.</p>
]]></description><pubDate>Sat, 23 May 2026 22:58:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=48252445</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=48252445</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48252445</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Nobody Reviews Compiler Output"]]></title><description><![CDATA[
<p>I think this argument only holds if you believe that LLMs are at a point where it can handle any combination of craziness that you throw at it.<p>From my own experience working with agents is that there’s “snowball of shit” effect. Small little mistakes that compound on each other. You can either<p>- review the code and try to prune some of the shit occasionally
- let the LLM handle everything<p>As of the current status of the industry it’s very hard for me to not see option 2 as extremely irresponsible. Coding agents limits are not well defined and unless you’re running an open weight model locally (most people aren’t) you just gave up all control over your code to a third party. If running local models were the norm, the argument that LLM are just another layer of abstraction would hold a little better. Reusing the compiler analogy from the post, it’s like depending on a compiler where you pay a monthly premium to compile your code. Those did exist a while ago with closed licenses, but I think the majority of deployed code nowadays is on open-ish platforms. This walled garden development paradigm already lost once</p>
]]></description><pubDate>Thu, 07 May 2026 20:56:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=48054861</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=48054861</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48054861</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Async Rust never left the MVP state"]]></title><description><![CDATA[
<p>Can you elaborate on this please? Do you mean that’s basically impossible for rust std to provide a default runtime that makes “everyone” (embedded on one end and web on the other) happy?</p>
]]></description><pubDate>Tue, 05 May 2026 12:42:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=48021709</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=48021709</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48021709</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Async Rust never left the MVP state"]]></title><description><![CDATA[
<p>As of now I don’t think there’s an alternative. I’m not a Rust expert but the core issue to me is that “async” goes beyond just having a Futures scheduler. Async stuff usually needs network, disk, os interaction, future utilities(spawn) and these are all things the runtime (tokio) provides. It’s pretty hard to be compatible with each other unless the language itself provides those.</p>
]]></description><pubDate>Tue, 05 May 2026 11:39:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=48021093</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=48021093</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48021093</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Async Rust never left the MVP state"]]></title><description><![CDATA[
<p>I may have missed something, but how does “sans-io” deal with CPU heavy code? For example, if there’s some heavy decoding/encoding required on the data? Does the event loop only drive the network side and the heavy part is done after the loop is finished?</p>
]]></description><pubDate>Tue, 05 May 2026 11:15:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=48020874</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=48020874</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48020874</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Async Rust never left the MVP state"]]></title><description><![CDATA[
<p>Agree with the other commenters that the title is a bit too dramatic. The content was well written and got the point across.<p>I still don’t have enough experience to have a strong opinion on Rust async, but some things did standout.<p>On the good side, it’s nice being able to have explicit runtimes. Instead of polluting the whole project to be async, you can do the opposite. Be sync first and use the runtime on IO “edges”. This was a great fit to a project that I’m working on and it seems like a pretty similar strategy to what zig is doing with IO code. This largely solved the function colloring problem in this particular case. Strict separation of IO and CPU bound code was a requirement regardless of the async stuff, so using the explicit IO runtime was natural.<p>On the bad side, it seems crazy to me how much the whole ecosystem depends on tokio. It’s almost like Java’s GC was optional, but in practice everyone just used the same third party GC runtime and pulling any library forced you to just use that runtime. This sort of central dependency is simply not healthy.</p>
]]></description><pubDate>Tue, 05 May 2026 09:58:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=48020260</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=48020260</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48020260</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Consistent hashing"]]></title><description><![CDATA[
<p>Another strategy to avoid redistribution is simply having a big enough number of partitions and assign ranges instead of single partitions. A bit more complex on the coordination side but works well in other domains (distributed processing for example)</p>
]]></description><pubDate>Fri, 03 Oct 2025 15:07:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=45463839</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=45463839</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45463839</guid></item><item><title><![CDATA[New comment by ignoreusernames in "The two versions of Parquet"]]></title><description><![CDATA[
<p>> The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.<p>I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it.<p>> The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length<p>I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing.</p>
]]></description><pubDate>Mon, 25 Aug 2025 11:21:57 +0000</pubDate><link>https://news.ycombinator.com/item?id=45012630</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=45012630</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45012630</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Databricks in talks to acquire startup Neon for about $1B"]]></title><description><![CDATA[
<p>> Folks I know in the industry are not very happy with databricks<p>Yeah, big companies globing up everything does not lead to a healthy ecosystem. Congrats on the founders for their the acquisition but everyone else loses with movements like this.<p>I'm still sour after their Redash purchase that instantly "killed" the open source version. Tabular acquisition was also a bit controversial since one of the founders is the PMC Chair for Iceberg which "competes" directly with Databricks own delta lake. The mere presence of these giants (mostly databricks and snowflake) makes the whole data ecosystem (both closed and open source) really hostile.</p>
]]></description><pubDate>Tue, 06 May 2025 13:53:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=43905151</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=43905151</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43905151</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Anatomy of a SQL Engine"]]></title><description><![CDATA[
<p>I recommend anyone who works with databases to write a simple engine. It's a lot simpler than you may think and it's a great exercise. If using python, sqlglot (<a href="https://github.com/tobymao/sqlglot">https://github.com/tobymao/sqlglot</a>) let's you skip all the parsing and it even does some simple optimizations. From the parsed query tree it's pretty straightforward to build a logical plan and execute that. You can even use python's builtin ast module to convert sql expressions into python ones (so no need for a custom interpreter!)</p>
]]></description><pubDate>Sun, 27 Apr 2025 10:33:53 +0000</pubDate><link>https://news.ycombinator.com/item?id=43810873</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=43810873</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43810873</guid></item><item><title><![CDATA[New comment by ignoreusernames in "ClickHouse gets lazier and faster: Introducing lazy materialization"]]></title><description><![CDATA[
<p>Same thing with columnar/vectorized execution. It has been known for a long time that's the "correct" way to process data for olap workflows, but only became "mainstream" in the last few years(mostly due to arrow).<p>It's awesome that clickhouse is adopting it now, but a shame that it's not standard on anything that does analytics processing.</p>
]]></description><pubDate>Wed, 23 Apr 2025 13:25:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=43771894</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=43771894</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43771894</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Apache DataFusion"]]></title><description><![CDATA[
<p>> Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations<p>I see your point, but that's only true within a single stage. Any operator that requires partitioning (groupBys and joins for example) requires writing to disk<p>> [...] which used to be a point of comparison to MapReduce specifically.<p>So each mapper in hadoop wrote partial results to disk? LOL this was way worse than I remember than. It's been a long time that I've dealt with hadoop<p>> Not ground-breaking nowadays but when I was doing this stuff 10+ years<p>I would say that it wouldn't be ground breaking 20 years ago. I feel like hadoop influence held up our entire field for years. Most of the stuff that arrow made mainstream and is being used by a bunch of engines mentioned in this thread has been known for a long time. It's like, as a community, we had blindfolds on. Sorry about the rant, but I'm glad the hadoop fog is finally dissipating</p>
]]></description><pubDate>Thu, 16 Jan 2025 10:57:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=42723787</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=42723787</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42723787</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Apache DataFusion"]]></title><description><![CDATA[
<p>just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell:<p><pre><code>  spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
  spark.read.parquet("sample_data").groupBy($"col").count().count()
</code></pre>
after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.</p>
]]></description><pubDate>Thu, 16 Jan 2025 09:56:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=42723399</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=42723399</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42723399</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Improving Parquet Dedupe on Hugging Face Hub"]]></title><description><![CDATA[
<p>> Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates<p>I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself. 
A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.</p>
]]></description><pubDate>Tue, 08 Oct 2024 18:15:12 +0000</pubDate><link>https://news.ycombinator.com/item?id=41780177</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=41780177</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41780177</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Sail – Unify stream processing, batch processing and compute-intensive workloads"]]></title><description><![CDATA[
<p>From the announcement
“As of now, we have mined 1,580 PySpark tests from the Spark codebase, among which 838 (53.0%) are successful on Sail. We have also mined 2,230 Spark SQL statements or expressions, among which 1,396 (62.6%) can be parsed by Sail”<p>Kinda early to call this a drop in replacement with those  numbers no?<p>But, with enough parity this project could be a dream for anybody dealing with spark’s dreadful performance. Kudos to the team</p>
]]></description><pubDate>Tue, 10 Sep 2024 11:18:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=41499602</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=41499602</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41499602</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Portugal brings back tax breaks for foreigners in bid to woo digital nomads"]]></title><description><![CDATA[
<p>This is a fair argument that's often brought up but I never see actual raw data backing it up. Housing is fucked in several places around the world, including a bunch of countries in Europe that don't have any tax breaks for specialized labor. I would love to look at some metrics like<p>- How many units of housing are built each year<p>- How many units are rented and to what demography (Portuguese families, immigrants sharing rooms, students, etc)<p>- How many migrants (legal and ilegal)<p>- How many specialized migrants each year and the % of them that eventually buy a home<p>- How many units are bought up by funds and other financial entities<p>- How much taxes and social security contributions are collected per year for specialized migrants and how that money is reinvested<p>- etc<p>I known that's basically impossible to have an accurate picture since those numbers are way too "politically loaded".
Politics and facts don't mix very well so we just default to who yells the loudest (specially true in Portugal, unfortunately)<p>EDIT: Format bullet points</p>
]]></description><pubDate>Sun, 07 Jul 2024 12:39:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=40897219</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=40897219</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40897219</guid></item><item><title><![CDATA[New comment by ignoreusernames in "The AWS S3 Denial of Wallet Amplification Attack"]]></title><description><![CDATA[
<p>Early Athena (managed prestodb by AWS) had a similar bug when measuring colunar file scans. If it touched the file, it considered the whole file instead of just the column chunks read. If I’m not mistaken, this was a bug on presto itself, but it was a simple patch that landed on upstream a long time before we did the tests. This was the first and only time we considered using a relatively early AWS product. It was so bad that our half assed self deployed version outperformed Athena by every metric that we cared about</p>
]]></description><pubDate>Wed, 01 May 2024 22:52:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=40230612</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=40230612</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40230612</guid></item><item><title><![CDATA[New comment by ignoreusernames in "Science fiction and the death of the sun"]]></title><description><![CDATA[
<p>Great series. If I'm not mistaken, there's an additional layer to the unreliable narrator part because the book is supposed to be a translation of that biography. So, when certain words are used, the reader knows that they don't necessarily represent the literal meaning and it's only an approximation for the actual thing in the book universe (for example, a "horse" is not actually a "horse" as we know it). It certainly helped me digest the more outlandish ideas.</p>
]]></description><pubDate>Wed, 03 Apr 2024 10:03:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=39915430</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=39915430</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39915430</guid></item><item><title><![CDATA[New comment by ignoreusernames in "DeWitt and Stonebraker's "MapReduce: A major step backwards" (2009)"]]></title><description><![CDATA[
<p>100% agree. mapReduce hype always seemed strange to me because it's basically the volcano paper from the 90s but with custom user defined operators instead of pre baked ones in a more traditional engine. 
To make everything worse, hadoop came along, ignoring every industry advance of the past 40 years with its "one tuple at a time" iterator based model on a garbage collected language. I realize it's very easy for me to say those things in hindsight, but it's not like vectorized execution was a weird obscure secret by the time these things came out.<p>On a side note, it finally looks like the industry is moving towards saner tools  that implement a lot of things that this article mentions mapReduce was missing</p>
]]></description><pubDate>Sat, 30 Mar 2024 18:58:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=39877595</link><dc:creator>ignoreusernames</dc:creator><comments>https://news.ycombinator.com/item?id=39877595</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39877595</guid></item></channel></rss>