Hacker News: mxmlnkn

New comment by mxmlnkn in "GitHub is investigating unauthorized access to their internal repositories"

mxmlnkn — Wed, 20 May 2026 10:25:46 +0000

Why not simply have both? This does not have to be an either-or decision. Have a default repository with vetted extensions, but leave the option to install from other sources open.

New comment by mxmlnkn in "Make ZIP files smaller with ZIP Shrinker"

mxmlnkn — Tue, 19 May 2026 13:07:36 +0000

For seekable gzip indexes in zip, there SOZip: https://github.com/sozip/sozip-spec . However, it stores the indexes as files succeeding the actual file entry. To hide these index files and avoid extraction, they are not listed in the central directory, but a linear scan of the local headers, which some wrongly-behaved ZIP tools do, or which might be necessary for recovering broken ZIP files, would find those hidden indexes.

New comment by mxmlnkn in "Lessons for Agentic Coding: What should we do when code is cheap?"

mxmlnkn — Tue, 05 May 2026 09:20:22 +0000

> 10. Code is cheap, but maintenance, support, and security aren’t.

I also keep circling around this point. So many software repositories in the AI space seem to follow a publish and forget pattern. If you simply can show that you have the patience to maintain a project, ideally with manual intervention instead of a fully autonomous AI, then you already have an outstanding project.

New comment by mxmlnkn in "Tar Files Created on macOS Display Errors When Extracting on Linux (2024)"

mxmlnkn — Mon, 04 May 2026 09:49:48 +0000

The title seems misleading.

These are not errors. They are simply warnings about extended attributes being ignored when extracting files, which seems completely fine to me, and creating the tar without those extended attributes has exactly the same outcome, but throws away the metadata at archive time instead of extraction time.

Furthermore, this is not an Apple/macOS issue. The tool used is bsdtar, so it would also affect all BSD-variants that default to bsdtar/libarchive, and those systems also have extended attributes, e.g., for SELinux, which would get added to the TAR.

New comment by mxmlnkn in "Mounting tar archives as a filesystem in WebAssembly"

mxmlnkn — Fri, 24 Apr 2026 21:24:29 +0000

https://github.com/martinellimarco/indexed_zstd

https://github.com/martinellimarco/libzstd-seek

Note, however, that this can only seek to frames, and zstd still only creates files containing a single frame by default. pzstd did create multi-frame files, but it is not being developed anymore. Other alternatives for creating seekable zstd files are: zeekstd, t2sz, and zstd-seekable-format-go.

New comment by mxmlnkn in "Mounting tar archives as a filesystem in WebAssembly"

mxmlnkn — Fri, 24 Apr 2026 16:07:18 +0000

> Apparently the internal state is only 32kB

Exactly. And often this state is either highly compressible or non-compressible but only sparsely used. The latter can then be made compressible by replacing the unused bytes with zeros.

Ratarmount uses indexed_gzip, and when parallelization makes sense, it also uses rapidgzip. Rapidgzip implements the sparsity analysis to increase compressibility and then simply uses the gztool index format, i.e., compresses each 32 KiB using gzip itself, with unused bytes replaced with zeros where possible.

indexed_gzip, gztool, and rapidgzip all support seeking in gzip streams, but all have some trade-offs, e.g., rapidgzip is parallelized but will have much higher memory usage because of that than indexed_gzip or gztool. It might be possible to compile either of these to WebAssembly if there is demand.

New comment by mxmlnkn in "Axios compromised on NPM – Malicious versions drop remote access trojan"

mxmlnkn — Tue, 31 Mar 2026 08:52:53 +0000

I like the idea of bubblewrap, but my pain point is that it is work to set it up correctly with bind mounts and forwarding necessary environment variables to make the program actually work usefully. Could you share your pip bwrap configuration? It sounds useful.

New comment by mxmlnkn in "What if AI doesn't need more RAM but better math?"

mxmlnkn — Sun, 29 Mar 2026 17:19:10 +0000

> The obvious one outside of KV caches as mentioned above is vector databases. Any RAG pipeline that stores embedding vectors for retrieval benefits from the same compression. TurboQuant reduces indexing time to “virtually zero” on vector search tasks and outperforms product quantisation and RabbiQ on recall benchmarks using GloVe vectors.

This part sounds especially cool. I did not think about this application when reading the other articles about TurboQuant. It would be cool to have access to this performance optimization for local RAG.

New comment by mxmlnkn in "The bot situation on the internet is worse than you could imagine"

mxmlnkn — Sun, 29 Mar 2026 17:15:20 +0000

After 2 minutes at 150 kHashes on mobile, I finally see the first pixel of the progress bar filling up. Seems like it will take hours or a day to finish. Some estimate would have been nice.

New comment by mxmlnkn in "Full Unicode Search at 50× ICU Speed with AVX‑512"

mxmlnkn — Tue, 16 Dec 2025 13:50:58 +0000

I never understood why the recommended replacement for ß is ss. It is a ligature of sz (similar to & being a ligature of et) and is even pronounced ess-zet. The only logical replacement would have been sz, and it would have avoided the clash of Masse (mass) and Maße (measurements). Then again, it only affects whether the vowel before it is pronounced short or long, and there are better ways to encode that in written language in the first place.

An Update on Pytype

mxmlnkn — Wed, 20 Aug 2025 17:04:51 +0000

Article URL: https://github.com/google/pytype

Comments URL: https://news.ycombinator.com/item?id=44963724

Points: 199

# Comments: 66

New comment by mxmlnkn in "The beauty of a text only webpage"

mxmlnkn — Fri, 15 Aug 2025 18:28:28 +0000

On the other hand, fonts can be an expression of your personality. Shouldn't it be preferable to centrally enable overriding fonts instead of forcing every site designer not to use custom fonts to express themselves? Theoretically, it is easier to remove formatting than it is to add it. Therefore, this functionality should be part of the browser, not the website. Firefox has this as an option: "Allow pages to choose their own fonts, instead of your selections above".

Personally, I quite like the site's design and its font. My gripe often is light gray text on a darker gray background. The bad readability that so many newer sites seem to prefer makes me question my eyes or my monitor capabilities. Reader mode in Firefox is also often very helpful.

New comment by mxmlnkn in "I prefer human-readable file formats"

mxmlnkn — Sat, 09 Aug 2025 13:56:09 +0000

I can confirm usual compression ratios of 10-20 for JSON. For example, wikidata-20220103.json.gz is quite fun to work with. It is 109 GB, which decompresses to 1.4 TB, and even the non-compressed index for random access with indexed_gzip is 11 GiB. The compressed random access index format, which gztool supports, would be 1.4 GB (compression ratio 8). And rapidgzip even supports the compressed gztool format with further file size reduction by doing a sparsity analysis of required seek point data and setting all unnecessary bytes to 0 to increase compressibility. The resulting index is only 536 MiB.

The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.

New comment by mxmlnkn in "I prefer human-readable file formats"

mxmlnkn — Sat, 09 Aug 2025 12:19:30 +0000

I concur with most of these arguments, especially about longevity. But, this only applies to smallish files like configurations because I don't agree with the last paragraph regarding its efficiency.

I have had to work with large 1GB+ JSON files, and it is not fun. Amazing projects such as jsoncons for streaming JSONs, and simdjson, for parsing JSON with SIMD, exist, but as far as I know, the latter still does not support streaming and even has an open issue for files larger than 4 GiB. So you cannot have streaming for memory efficiency and SIMD-parsing for computational efficiency at the same time. You want streaming because holding the whole JSON in memory is wasteful and sometimes not even possible. JSONL tries to change the format to fix that, but now you have another format that you need to support.

I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful. Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that without parsing everything in search of the closing bracket or quotes, accounting for escaped brackets and quotes, and nesting.

New comment by mxmlnkn in "I want everything local – Building my offline AI workspace"

mxmlnkn — Fri, 08 Aug 2025 22:49:53 +0000

This resonates. I have finally started looking into local inference a bit more recently.

I have tried Cursor a bit, and whatever it used worked somewhat alright to generate a starting point for a feature and for a large refactor and break through writer's blocks. It was fun to see it behave similarly to my workflow by creating step-by-step plans before doing work, then searching for functions to look for locations and change stuff. I feel like one could learn structured thinking approaches from looking at these agentic AI logs. There were lots of issues with both of these tasks, though, e.g., many missed locations for the refactor and spuriously deleted or indented code, but it was a starting point and somewhat workable with git. The refactoring usage caused me to reach free token limits in only two days. Based on the usage, it used millions of tokens in minutes, only rarely less than 100K tokens per request, and therefore probably needs a similarly large context length for best performance.

I wanted to replicate this with VSCodium and Cline or Continue because I want to use it without exfiltrating all my data to megacorps as payment and use it to work on non-open-source projects, and maybe even use it offline. Having Cursor start indexing everything, including possibly private data, in the project folder as soon as it starts, left a bad taste, as useful as it is. But, I quickly ran into context length problems with Cline, and Continue does not seem to work very well. Some models did not work at all, DeepSeek was thinking for hours in loops (default temperature too high, should supposedly be <0.5). And even after getting tool use to work somewhat with qwen qwq 32B Q4, it feels like it does not have a full view of the codebase, even though it has been indexed. For one refactor request mentioning names from the project, it started by doing useless web searches. It might also be a context length issue. But larger contexts really eat up memory.

I am also contemplating a new system for local AI, but it is really hard to decide. You have the choice between fast GPU inference, e.g., RTX 5090 if you have money, or 1-2 used RTX 3090, or slow, but qualitatively better CPU / unified memory integrated GPU inference with systems such as the DGX Spark, the Framework Desktop AMD Ryzen AI Max, or the Mac Pro systems. Neither is ideal (and cheap). Although my problems with context length and low-performing agentic models seem to indicate that going for the slower but more helpful models on a large unified memory seems to be better for my use case. My use case would mostly be agentic coding. Code completion does not seem to fit me because I find it distracting, and I don't require much boilerplating.

It also feels like the GPU is wasted, and local inference might be a red herring altogether. Looking at how a batch size of 1 is one of the worst cases for GPU computation and how it would only be used in bursts, any cloud solution will be easily an order of magnitude or two more efficient because of these, if I understand this correctly. Maybe local inference will therefore never fully take off, barring even more specialized hardware or hard requirements on privacy, e.g., for companies. To solve that, it would take something like computing on encrypted data, which seems impossible.

Then again, if the batch size of 1 is indeed so bad as I think it to be, then maybe simply generate a batch of results in parallel and choose the best of the answers? Maybe this is not a thing because it would increase memory usage even more.

New comment by mxmlnkn in "Ditching GitHub (2024)"

mxmlnkn — Thu, 07 Aug 2025 17:31:55 +0000

Yes. It can also be shown in the Activity tab or accessed and scanned for secrets or personal information via the API. See https://news.ycombinator.com/item?id=44452623

New comment by mxmlnkn in "I dumped Google for Kagi"

mxmlnkn — Tue, 05 Aug 2025 16:03:38 +0000

AI-overview was the straw that broke the camel's back for me recently. But I also suffered from dark mode issues for a long time. On almost every visit, it shows the outer background dark but the smaller search results background as white, and the search result text is still in light mode, ergo, it is not readable. After refreshing, it works, but this user experience is untenable for a trillion-dollar company. I changed to Startpage.com, though.

New comment by mxmlnkn in "Enough AI copilots, we need AI HUDs"

mxmlnkn — Mon, 28 Jul 2025 13:14:02 +0000

The Anime Yukikaze (2002 - 2005) has some similar themes. It's about a fighter jet pilot using a new AI-supported jet to fight against aliens. It asserts that the combination of human intuition and artificial intelligence trumps either of the two on its own. If I remember correctly, the jet can pilot on its own, but when it becomes dangerous, the human pilot only uses the AI hints instead of letting it autopilot.

New comment by mxmlnkn in "Fun with gzip bombs and email clients"

mxmlnkn — Wed, 23 Jul 2025 00:08:30 +0000

> I'm sure there are zip-bomb equivalents in binary formats like .xlsx, PDF, .docx, etc.

Yes. Both, docx and xlsx are literally just a zip of XML files with a different extension. PDF can contain zlib streams, which use deflate compression just as gzip, so all the mentioned methods apply to all three formats.

New comment by mxmlnkn in "Marathon fusion claims to invent alchemy, making 5000 kgs gold per gigawatt"

mxmlnkn — Sat, 19 Jul 2025 14:47:47 +0000

Yes, that section is fitting and interesting. It is the production-side view. I think I was more motivated by the comments envisioning an abundance of cheap gold, which seems not in any way near or even possible, even with this approach as cool and baffling as it is.

I don't think that it is of much use as waste disposal because again, it can only remove 10%, i.e., an insignificant amount. If it were even mined because of this, then more mercury waste would be produced than before, but increased mining would probably be many decades or centuries in the future, as long as there is still waste to reuse.

So, how long would the current midterm stockpile of 1400 t for 198Hg for the next 10 years last? At 5 t per 1 GW per year, i.e., 5 t per 8.76 TWh, and a current global electricity generation of ~30 PWh, replacing all energy production with fusion would be able to transmute 3400 t 198Hg per year, over twice the stockpile. Of course, there would be a myriad of other bottlenecks long before that, but consuming all the existing stockpile seems feasible in human time spans.

I am honestly impressed by the amount of transmutation that is possible with fusion. And it is a lucky coincidence that the half-life is only dozens of hours for the middle product. I never thought of that process or would have guessed grams of production instead of tons, probably because of the association with existing particle accelerators. It is quite amazing, but also presumably still decades off into the future.