Hacker News: popinman322

New comment by popinman322 in "Google releases Gemma 4 open models"

popinman322 — Thu, 02 Apr 2026 20:28:22 +0000

Does anyone know whether we'll be receiving transcoders for this batch of models? We got them for Gemma 3, but maybe that was a one-off.

New comment by popinman322 in "Show HN: Ghidra MCP Server – 110 tools for AI-assisted reverse engineering"

popinman322 — Wed, 04 Feb 2026 18:35:37 +0000

I've found that Gemini models often produce pseudocode that seems good at first glance but is typically wrong or incomplete, especially for larger or more complex functions. It might produce pseudocode for 70% of the function, then silently drop the last 30%. Or it might elide the inside of switch blocks or if statements, only including a comment explaining what should happen.

Alternatively, Claude Opus generally output actual code that included more of the original functionality. Even Qwen3-30B-A3B performs better than Gemini, in my experience.

It's honestly really frustrating. The huge context size available with Gemini makes the model family seem like a boon for this task; PCode is very verbose, impinging on the headroom needed for the model's response.

New comment by popinman322 in "Auto-grading decade-old Hacker News discussions with hindsight"

popinman322 — Thu, 11 Dec 2025 04:43:08 +0000

It doesn't look like the code anonymizes usernames when sending the thread for grading. This likely induces bias in the grades based on past/current prevailing opinions of certain users. It would be interesting to see the whole thing done again but this time randomly re-assigning usernames, to assess bias, and also with procedurally generated pseudonyms, to see whether the bias can be removed that way.

I'd expect de-biasing would deflate grades for well known users.

It might also be interesting to use a search-grounded model that provides citations for its grading claims. Gemini models have access to this via their API, for example.

New comment by popinman322 in "Mistral 3 family of models released"

popinman322 — Tue, 02 Dec 2025 16:10:17 +0000

They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.

There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.

A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.

New comment by popinman322 in "The Llama 4 herd"

popinman322 — Sat, 05 Apr 2025 23:58:13 +0000

You can swap experts in and out of VRAM, it just increases inference time substantially.

Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.

New comment by popinman322 in "The young, inexperienced engineers aiding DOGE"

popinman322 — Tue, 04 Feb 2025 00:41:21 +0000

The executive branch is currently ignoring the law. Why would they start following it in 2029?

New comment by popinman322 in "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL"

popinman322 — Sat, 25 Jan 2025 22:20:56 +0000

Not a fan of censorship here, but Chinese models are (subjectively) less propagandized than US models. If you ask US models about China, for instance, they'll tend towards the antagonistic perspective favored by US media. Chinese models typically seem to take a more moderate, considered tone when discussing similar subjects. US models also suffer from safety-based censorship, especially blatant when "safety" involves protection of corporate resources (eg. not helping the user to download YouTube videos).

New comment by popinman322 in "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL"

popinman322 — Sat, 25 Jan 2025 22:11:26 +0000

Assuming you're doing local inference, have you tried setting a token filter on the model?

New comment by popinman322 in "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL"

popinman322 — Sat, 25 Jan 2025 22:07:33 +0000

DeepSeek was built on the foundations of public research, a major part of which is the Llama family of models. Prior to Llama open weights LLMs were considerably less performant; without Llama we might not have gotten Mistral, Qwen, or DeepSeek. This isn't meant to diminish DeepSeek's contributions, however: they've been doing great work on mixture of experts models and really pushing the community forward on that front. And, obviously, they've achieved incredible performance.

Llama models are also still best in class for specific tasks that require local data processing. They also maintain positions in the top 25 of the lmarena leaderboard (for what that's worth these days with suspected gaming of the platform), which places them in competition with some of the best models in the world.

But, going back to my first point, Llama set the stage for almost all open weights models after. They spent millions on training runs whose artifacts will never see the light of day, testing theories that are too expensive for smaller players to contemplate exploring.

Pegging Llama as mediocre, or a waste of money (as implied elsewhere), feels incredibly myopic.

New comment by popinman322 in "Supreme Court upholds TikTok ban, but Trump might offer lifeline"

popinman322 — Fri, 17 Jan 2025 18:53:57 +0000

It's always very interesting to see people pull out threads with low like counts (like 12k) and claim that central idea of the post is widely held.

We're talking about platforms with tens of millions of users; wide appeal is at least a quarter million likes, with mass appeal being at least a million. A local-scale influencer can gather 10-30k likes very easily on such a massive platform.

New comment by popinman322 in "Voyage-code-3"

popinman322 — Tue, 14 Jan 2025 09:50:31 +0000

The LSP is limited in scope and doesn't provide access to things like the AST (which can vary by language). If you want to navigate by symbols, that can be done. If you want to know whether a given import is valid, to verify LLM output, that's not possible.

Similarly, you can't use the LSP to determine all valid in-scope objects for an assignment. You can get a hierarchy of symbol information from some servers, allowing selection of particular lexical scopes within the file, but you'll need to perform type analysis yourself to determine which of the available variables could make for a reasonable completion. That type analysis is also a bit tricky because you'll likely need a lot of information about the type hierarchy at that lexical scope-- something you can't get from the LSP.

It might be feasible to edit an open source LSP implementation for your target language to expose the extra information you'd want, but they're relatively heavy pieces of software and, of course, they don't exist for all languages. Compared to the development cost of "just" using embeddings-- it's pretty clear why teams choose embeddings.

Also, if you assume that the performance improvements we've seen in embeddings for retrieval will continue, it makes less sense to invest weeks of time on something that would otherwise improve passively with time.

New comment by popinman322 in "New LLM optimization technique slashes memory costs"

popinman322 — Tue, 17 Dec 2024 01:34:23 +0000

Google Trends make it seem like we're out of the exponential growth phase for LLMs-- search interest is possibly plateauing.

A decline in search interest outside of academia makes sense. The groups who can get by on APIs don't care so much how the sausage is made and just want to see prices come down. Interested parties have likely already found tools that work for them.

There's definitely some academic interest outside of CS in producing tools using LLMs. I know plenty of astro folks working to build domain specific tools with open models as their backbone. They're typically not interested in more operational work, I guess because they operate under the assumption that relevant optimizations will eventually make their way into public inference engines.

And CS interest in these models will probably sustain for at least 5-10 more years, even if performance plateaus, as work continues into how LLMs function.

All that to say, maybe we're just seeing the trend die for laypeople?

New comment by popinman322 in "Amazon Nova"

popinman322 — Wed, 04 Dec 2024 02:03:38 +0000

Try LiteLLM; their core LLM proxy is open source. As an added bonus it also supports other major providers.

New comment by popinman322 in "How to succeed in MrBeast production (Leaked PDF)"

popinman322 — Mon, 16 Sep 2024 05:18:34 +0000

Huge +1. If I'd understood this mantra earlier in my career it would have saved me a large amount of hassle.

For juniors: any time you send something important to your manager, confirm they read the document. Don't ask "did you read it?" Don't rely on reactions in chat. Ask a specific question that would require them to read the contents of the document. For example, if you're sending over a quote from a vendor, and you'd already sent another quote before, you could ask "how does this quote compare to the previous one? [link to previous one]" Always get confirmation at least 24-48 hours in advance of the point-of-no-return (e.g. launch, meeting, changing dates, company-wide emails), very preferably in writing.

And for _very_ important meetings, ensure all parties have either acknowledged understanding of the required information, or schedule pre-meeting briefings with individuals. There's nothing quite like getting thrown under the bus because someone showed up and couldn't figure out the subtleties & context on the fly. Unfortunately you can't just say "it's a 12 page document for a reason." when your manager is confused in front of their manager.

New comment by popinman322 in "Greppability is an underrated code metric"

popinman322 — Tue, 03 Sep 2024 08:36:23 +0000

Grep is also useful when IDE indexing isn't feasible for the entire project. At past employers I worked in monorepos where the sheer size of the index caused multiple seconds of delay in intellisense and UI stuttering; our devex team's preferred approach was to better integrate our IDE experience with the build system such that only symbols in scope of the module you were working on would be loaded. This was usually fine, and it works especially well for product teams, but it's a headache when you're doing cross-cutting work (e.g. for infrastructure projects/overhauls).

We also had a livegrep instance that we could use to grep any corporate repo, regardless of where it was hosted. That was extremely useful for investigating failures in build scripts that spanned multiple repositories (e.g. building a Go sidecar that relies on a service config in the Java monorepo).

New comment by popinman322 in "Stripe's Monorepo Developer Environment"

popinman322 — Mon, 19 Aug 2024 13:09:55 +0000

It's possible to get stuck in merge hell where all your reviewers ok the PR but someone merged a conflict 2 seconds ago, or you've got a reviewer in Singapore while you're in SF and conflicts appeared overnight.

In general it was pretty rare, in my experience. The code bases were pretty well modularized.

New comment by popinman322 in "OpenDevin: An Open Platform for AI Software Developers as Generalist Agents"

popinman322 — Sun, 11 Aug 2024 18:59:55 +0000

This is where supporting machinery & RAG are very useful.

You can auto- lint and test code before you set eyes on it, then re-run the prompt with either more context or an altered prompt. With local models there are options like steering vectors, fine-tuning, and constrained decoding as well.

There's also evidence that multiple models of different lineages, when their outputs are rated and you take the best one at each input step, can surpass the performance of better models. So if one model knows something the others don't you can automatically fail over to the one that can actually handle the problem, and typically once the knowledge is in the chat the other models will pick it up.

Not saying we have the solution to your specific problem in any readily available software, but that there are approaches specific to your problem that go beyond current methods.

New comment by popinman322 in "Ontario family doctor says new AI notetaking saved her job"

popinman322 — Fri, 03 May 2024 08:06:16 +0000

Tangent here: really? I've found base Whisper has concerning error rates for non-US English accents; I imagine the same is true for other languages with a large regional mode to the source dataset.

Whisper + an LLM can recover some of the gaps by filling in contextually plausible bits, but then it's not a transcript and may contain hallucinations.

There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.

New comment by popinman322 in "Iterative reasoning preference optimization"

popinman322 — Wed, 01 May 2024 04:48:18 +0000

Also, similar to Orca-Math but without a teacher model. They also followed an iterative DPO/KTO scheme, but with no length normalized NLL loss term.

New comment by popinman322 in "Haystack DB – 10x faster than FAISS with binary embeddings by default"

popinman322 — Mon, 29 Apr 2024 10:46:31 +0000

I remember stumbling upon an early discussion about this [0] a bit ago in the EleutherAI discord when searching for discussion about a paper; I'm glad to see it's turned into something public.

[0]: https://discord.com/channels/729741769192767510/730095596861...