<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: throwdbaaway</title><link>https://news.ycombinator.com/user?id=throwdbaaway</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 13 Apr 2026 12:09:55 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=throwdbaaway" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by throwdbaaway in "Pro Max 5x quota exhausted in 1.5 hours despite moderate usage"]]></title><description><![CDATA[
<p><a href="https://github.com/anthropics/claude-code/issues/46829#issuecomment-4231266649" rel="nofollow">https://github.com/anthropics/claude-code/issues/46829#issue...</a> - Have you checked with your colleague? (and his AI, of course)</p>
]]></description><pubDate>Sun, 12 Apr 2026 15:28:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=47740856</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47740856</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47740856</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Research-Driven Agents: When an agent reads before it codes"]]></title><description><![CDATA[
<p>> EC2 instances on shared hardware showed up to 30% variance between runs due to noisy neighbors.<p>Based on this finding, I suppose the better way is to rely on local hardware whenever possible?</p>
]]></description><pubDate>Fri, 10 Apr 2026 03:45:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=47713368</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47713368</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47713368</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Research-Driven Agents: When an agent reads before it codes"]]></title><description><![CDATA[
<p>Very nice TG improvement from Flash Attention KQ fusion. Is it something that was already done in ik_llama.cpp? If not, then it will be a welcomed addition for hybrid CPU/GPU inference.</p>
]]></description><pubDate>Fri, 10 Apr 2026 02:59:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=47713094</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47713094</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47713094</guid></item><item><title><![CDATA[New comment by throwdbaaway in "GLM-5.1: Towards Long-Horizon Tasks"]]></title><description><![CDATA[
<p><a href="https://github.com/THUDM/IndexCache" rel="nofollow">https://github.com/THUDM/IndexCache</a> - Might be some expected issue when rolling out this. They don't have enough compute, and have to innovate.</p>
]]></description><pubDate>Tue, 07 Apr 2026 21:19:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=47681493</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47681493</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47681493</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Can I run AI locally?"]]></title><description><![CDATA[
<p>90% of what you pay in agentic coding is for cached reads, which are free with local inference serving one user. This is well known in r/LocalLLaMA for ages, and an article about this also hit HN front page few weeks ago.</p>
]]></description><pubDate>Sat, 14 Mar 2026 08:12:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=47374424</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47374424</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47374424</guid></item><item><title><![CDATA[New comment by throwdbaaway in "No, it doesn't cost Anthropic $5k per Claude Code user"]]></title><description><![CDATA[
<p>What about the VRAM requirement for KV cache? That may matter more than memory bandwidth. With these GPUs, there are more compute capacity than memory bandwidth than VRAM.<p>DeepSeek got MLA, and then DSA. Qwen got gated delta-net. These inventions allow efficient inference both at home and at scale. If Anthropic got nothing here, then their inference cost can be much higher.<p>DeepSeek also got <a href="https://github.com/deepseek-ai/3FS" rel="nofollow">https://github.com/deepseek-ai/3FS</a> that makes cached reads a lot cheaper with way longer TTL. If Anthropic didn't need to invent and uses some expensive solution like Redis, as indicated by the crappy TTL, then that also contributes to higher inference cost.</p>
]]></description><pubDate>Wed, 11 Mar 2026 14:49:57 +0000</pubDate><link>https://news.ycombinator.com/item?id=47336346</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47336346</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47336346</guid></item><item><title><![CDATA[New comment by throwdbaaway in "How to run Qwen 3.5 locally"]]></title><description><![CDATA[
<p>Yours is the only benchmark that puts 35B A3B above 27B. Time for human judgement to verify? For example, if you look at the thinking traces, there might be logical inconsistencies in the prompts, which then tripped up the 27B more when reasoning. This will also be reflected in the score when thinking is disabled, but we can sort of debug with the thinking traces.</p>
]]></description><pubDate>Sun, 08 Mar 2026 11:31:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=47296501</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47296501</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47296501</guid></item><item><title><![CDATA[New comment by throwdbaaway in "How to run Qwen 3.5 locally"]]></title><description><![CDATA[
<p>Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context.<p>35B A3B is faster but didn't do too well in my limited testing.</p>
]]></description><pubDate>Sun, 08 Mar 2026 09:01:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=47295741</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47295741</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47295741</guid></item><item><title><![CDATA[New comment by throwdbaaway in "How to run Qwen 3.5 locally"]]></title><description><![CDATA[
<p>There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp.</p>
]]></description><pubDate>Sun, 08 Mar 2026 07:02:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=47295236</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47295236</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47295236</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers"]]></title><description><![CDATA[
<p>I don't quite get the low temperature coupled with the high penalty. We get thinking loop due to low temperature, and we then counter it with high penalty. That seems backward.<p>For Qwen3.5 27B, I got good result with --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.2, without penalty. It allows the model to explore (temp, top-p, top-k) without going off the rail (min-p) during reasoning. No loop so far.</p>
]]></description><pubDate>Sun, 01 Mar 2026 04:32:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=47203738</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47203738</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47203738</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers"]]></title><description><![CDATA[
<p>We are all reasonable people here, and while you are (mostly) correct, I think we can all agree that Anthropic documentation sucks. If I have to infer from the  doc:<p>* Haiku 4.5 by default doesn't think, i.e. it has a default thinking budget of 0.<p>* By setting a non-zero thinking budget, Haiku 4.5 can think. My guess is that Claude Code may set this differently for different tasks, e.g. thinking for Explore, no thinking for Compact.<p>* This hybrid thinking is different from the adaptive thinking introduced in Opus 4.6, which when enabled, can automatically adjust the thinking level based on task difficulty.</p>
]]></description><pubDate>Sun, 01 Mar 2026 04:12:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=47203617</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47203617</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47203617</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers"]]></title><description><![CDATA[
<p>For 27B, just get a used 3090 and hop on to r/LocalLLaMA. You can run a 4bpw quant at full context with Q8 KV cache.</p>
]]></description><pubDate>Sun, 01 Mar 2026 03:47:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=47203488</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47203488</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47203488</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers"]]></title><description><![CDATA[
<p>I would say 27B matches with Sonnet 4.0, while 397B A17B matches with Opus 4.1. They are indeed nowhere near Sonnet 4.5, but getting 262144 context length at good speed with modest hardware is huge for local inference.<p>Will check your updated ranking on Monday.</p>
]]></description><pubDate>Sun, 01 Mar 2026 03:02:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=47203209</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47203209</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47203209</guid></item><item><title><![CDATA[New comment by throwdbaaway in "What AI coding costs you"]]></title><description><![CDATA[
<p>Can you describe a bit more how this works? I suppose the speed remains about the same, while the experience is more pleasant?<p>(Big fan of SQLAlchemy)</p>
]]></description><pubDate>Sun, 01 Mar 2026 01:53:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=47202779</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47202779</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47202779</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Claude Sonnet 4.6"]]></title><description><![CDATA[
<p>From a quick testing on simple tasks, adaptive thinking with sonnet 4.6 uses about 50% more reasoning tokens than opus 4.6.<p>Let's see how long it will take for DeepSeek to crack this.</p>
]]></description><pubDate>Wed, 18 Feb 2026 00:18:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=47055378</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47055378</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47055378</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Two different tricks for fast LLM inference"]]></title><description><![CDATA[
<p>If you ask someone knowledgeable at r/LocalLLaMA about an inference configuration that can increase TG by *up to* 2.5x, in particularly for a sample prompt that reads "*Refactor* this module to use dependency injection", then the answer is of course speculative decoding.<p>You don't have to work for a frontier lab to know that. You just have to be GPU poor.</p>
]]></description><pubDate>Mon, 16 Feb 2026 03:13:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=47030456</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=47030456</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47030456</guid></item><item><title><![CDATA[New comment by throwdbaaway in "We mourn our craft"]]></title><description><![CDATA[
<p>As mentioned by the sibling comment from godelski, it is about the lack of precision, not the lack of determinism. After all, we already got <a href="https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/" rel="nofollow">https://thinkingmachines.ai/blog/defeating-nondeterminism-in...</a>, which is not even an issue for single user local inference.<p>Question: Have you tried using LLM as a compiler?<p>Well, I sort of did, as a fun exercise. I came up with a very elaborate ~5000 tokens prompt, such that when fed with a  ~500 tokens function, I will get back a ~600 tokens rewritten function.<p>The prompt contains 10+ examples, such that the model will learn the steps from the context. Then, it will start by going through a series of yes/no questions, to decide what's the correct rewrite pattern to apply. The tricky part here is the lack of precision, such that the "else" clause has to be reserved for the condition that is the hardest to communicate clearly in English. Then it will extract the part that needs to be rewritten and introspect the formatting, again with a series of simple questions. Lastly, it will proceed, confidently, with the rewrite.<p>With this, I did some testing with 50+ randomly chosen functions, and I could get back the exact same rewritten functions, from about 20 models that are good in coding, down to the newlines and indentations. With a strong model, there might only be 1~2 output tokens in the whole test where the probability was less than 80%, so the lack of batch invariance wasn't even a problem. (temperature=0 usually messes up logprobs, go with top_k=1 or top_p=0.01)<p>So input + English = output, works for multiple models from multiple companies.<p>But what's the point of writing so much English, in hope that it leaves no room for ambiguity? For now, I will stick with mitchellh's style of (occasional) LLM assisted programming, jumping in to write the code when precision is needed.</p>
]]></description><pubDate>Mon, 09 Feb 2026 17:13:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=46947842</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=46947842</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46947842</guid></item><item><title><![CDATA[New comment by throwdbaaway in "My AI Adoption Journey"]]></title><description><![CDATA[
<p>Not using Hot Aisle for inference?</p>
]]></description><pubDate>Fri, 06 Feb 2026 06:18:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=46909714</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=46909714</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46909714</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Coding assistants are solving the wrong problem"]]></title><description><![CDATA[
<p>I thought "iterate and improve" was exactly what Phil did.</p>
]]></description><pubDate>Tue, 03 Feb 2026 20:51:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=46877106</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=46877106</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46877106</guid></item><item><title><![CDATA[New comment by throwdbaaway in "Coding assistants are solving the wrong problem"]]></title><description><![CDATA[
<p>I call this the Groundhog Day loop</p>
]]></description><pubDate>Tue, 03 Feb 2026 14:56:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=46871743</link><dc:creator>throwdbaaway</dc:creator><comments>https://news.ycombinator.com/item?id=46871743</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46871743</guid></item></channel></rss>