<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: LuxBennu</title><link>https://news.ycombinator.com/user?id=LuxBennu</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 01 May 2026 10:18:32 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=LuxBennu" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by LuxBennu in "ChatGPT for Excel"]]></title><description><![CDATA[
<p>Chatgpt for Excel is still an office add-in running in the same sandbox though. strongpigeon described the exact bottleneck upthread, process boundary crossings, context.sync() roundtrips that take seconds on web. That's a platform limitation, not a model limitation.
Swapping AI behind the add-in doesn't fix the fundamental constraint that third-party add-ins can't deeply integrate with Excel's runtime the way a native feature can. If copilot is bad despite having more access to excel internals(I don't like how Copilot is designed or implemented tho), an add-in with less access is likely not be better.</p>
]]></description><pubDate>Wed, 15 Apr 2026 23:28:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=47786718</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47786718</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47786718</guid></item><item><title><![CDATA[Making prompts longer did not help. Making the task contract explicit did]]></title><description><![CDATA[
<p>Article URL: <a href="https://signaldepth.ai/">https://signaldepth.ai/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47778720">https://news.ycombinator.com/item?id=47778720</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 15 Apr 2026 13:31:03 +0000</pubDate><link>https://signaldepth.ai/</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47778720</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47778720</guid></item><item><title><![CDATA[New comment by LuxBennu in "Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon"]]></title><description><![CDATA[
<p>Yeah sorry that was unclear on my part. I chunk at the endpoint level, whisper itself obviously processes 30s windows. The memory/latency thing I was referring to is more about processing longer files end to end through the pipeline, not a single whisper pass. My fastapi wrapper just splits the audio and runs chunks sequentially so total wall time scales linearly with file length, nothing fancy.</p>
]]></description><pubDate>Wed, 08 Apr 2026 15:55:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=47691954</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47691954</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47691954</guid></item><item><title><![CDATA[New comment by LuxBennu in "Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS"]]></title><description><![CDATA[
<p>Oh nice, the pyannote coreml port is interesting. Last time I looked at pyannote it was pytorch only so getting it to run efficiently on apple silicon was kind of a pain. Does the coreml version handle diarization or just activity detection?</p>
]]></description><pubDate>Tue, 07 Apr 2026 22:36:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=47682234</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47682234</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47682234</guid></item><item><title><![CDATA[New comment by LuxBennu in "Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon"]]></title><description><![CDATA[
<p>Ah that makes sense, quadratic scaling is brutal. So with 96gb i'd probably get somewhere around 4-5k total sequence length before hitting the wall, which is still pretty limiting for anything multimodal. Do you do any gradient checkpointing or is that not worth the speed tradeoff at these sizes?</p>
]]></description><pubDate>Tue, 07 Apr 2026 22:35:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=47682225</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47682225</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47682225</guid></item><item><title><![CDATA[New comment by LuxBennu in "Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon"]]></title><description><![CDATA[
<p>I run whisper large-v3 on an m2 max 96gb and even with just inference the memory gets tight on longer audio, can only imagine what fine-tuning looks like. Does the 64gb vs 96gb make a meaningful difference for gemma 4 fine-tuning or does it just push the oom wall back a bit? Been wanting to try local fine-tuning on apple silicon but the tooling gap has kept me on inference only so far.</p>
]]></description><pubDate>Tue, 07 Apr 2026 20:28:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=47680929</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47680929</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47680929</guid></item><item><title><![CDATA[New comment by LuxBennu in "Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS"]]></title><description><![CDATA[
<p>Yeah that makes sense, chunking on silence would sidestep the latency issue pretty cleanly. I've been running it through a basic fastapi wrapper so it just takes whatever audio blob gets thrown at it, no chunking logic on the server side. Might be worth adding a vad pass before sending to whisper though, would cut down on processing dead air too.</p>
]]></description><pubDate>Mon, 06 Apr 2026 21:47:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=47667650</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47667650</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47667650</guid></item><item><title><![CDATA[New comment by LuxBennu in "Show HN: Ghost Pepper – 100% local hold-to-talk speech-to-text for macOS"]]></title><description><![CDATA[
<p>I've been running whisper large-v3 on an m2 max through a self-hosted endpoint and honestly the accuracy is good enough that i stopped bothering with cleanup models. The bigger annoyance for me was latency on longer chunks, like anything over 30 seconds starts feeling sluggish even with metal acceleration. Haven't tried whisperkit specifically but curious how it handles longer audio compared to the full model.</p>
]]></description><pubDate>Mon, 06 Apr 2026 20:48:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=47666857</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47666857</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47666857</guid></item><item><title><![CDATA[New comment by LuxBennu in "Ollama is now powered by MLX on Apple Silicon in preview"]]></title><description><![CDATA[
<p>that tracks with what i've noticed practically. shorter prompts feel basically the same between llama.cpp metal and what i'd expect from native mlx, but once context gets longer the overhead starts showing up. would be interesting to see if ollama's mlx path actually handles kv cache differently under the hood or if it just skips the buffer sync layer</p>
]]></description><pubDate>Wed, 01 Apr 2026 07:25:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=47597922</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47597922</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47597922</guid></item><item><title><![CDATA[New comment by LuxBennu in "Ollama is now powered by MLX on Apple Silicon in preview"]]></title><description><![CDATA[
<p>Roughly 8-12 token/s on generation depending on context length. Prompt processing is faster obviously. Haven't benchmarked it super carefully though, just eyeballing the llama.cpp output.</p>
]]></description><pubDate>Wed, 01 Apr 2026 07:21:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=47597902</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47597902</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47597902</guid></item><item><title><![CDATA[New comment by LuxBennu in "From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem"]]></title><description><![CDATA[
<p>yeah fair point, it's definitely model dependent. i've had good results with qwen but tried it on a smaller mistral variant once and the output quality dropped noticeably even at q8 for both. the speed hit from mixed types hasn't been bad on apple silicon in my experience but i can see it mattering more on cuda.</p>
]]></description><pubDate>Wed, 01 Apr 2026 07:20:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=47597894</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47597894</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47597894</guid></item><item><title><![CDATA[New comment by LuxBennu in "From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem"]]></title><description><![CDATA[
<p>good overview of the architecture side but worth mentioning there's another axis that stacks on top of all of this: you can quantize the kv cache itself at inference time. in llama.cpp you can run q8 for keys and q4 for values and it cuts cache memory roughly in half again on top of whatever gqa or mla already saves you. i run qwen 70b 4-bit on m2 max 96gb and the kv quant is what actually made longer contexts fit without running out of unified memory. keys need more precision because they drive attention scores but values are way more tolerant of lossy compression, so the asymmetry works out.</p>
]]></description><pubDate>Tue, 31 Mar 2026 19:29:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=47592257</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47592257</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47592257</guid></item><item><title><![CDATA[New comment by LuxBennu in "Show HN: Reprompt – Analyze what you type into AI tools, not what they output"]]></title><description><![CDATA[
<p>Thanks! Turns out structural signals get you surprisingly far. An LLM catches more, but speed is the feature.</p>
]]></description><pubDate>Tue, 31 Mar 2026 18:23:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=47591461</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47591461</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47591461</guid></item><item><title><![CDATA[New comment by LuxBennu in "Show HN: Reprompt – Analyze what you type into AI tools, not what they output"]]></title><description><![CDATA[
<p>I ran this on my own prompt history and three things surprised me. found 3 API keys buried in copy-pasted stack traces (`reprompt privacy`). 35% of my agent sessions had error loops -- the agent retrying the same failing approach 3+ times (`reprompt agent`). And 50-70% of my conversation turns were filler like "ok try that" (`reprompt distill`).<p><pre><code>    pip install reprompt-cli
    reprompt scan && reprompt
</code></pre>
Everything runs locally -- zero network calls, zero telemetry. Also works as an MCP server and GitHub Action.</p>
]]></description><pubDate>Tue, 31 Mar 2026 15:53:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=47589246</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47589246</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47589246</guid></item><item><title><![CDATA[Show HN: Reprompt – Analyze what you type into AI tools, not what they output]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/reprompt-dev/reprompt">https://github.com/reprompt-dev/reprompt</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47589133">https://news.ycombinator.com/item?id=47589133</a></p>
<p>Points: 3</p>
<p># Comments: 3</p>
]]></description><pubDate>Tue, 31 Mar 2026 15:46:25 +0000</pubDate><link>https://github.com/reprompt-dev/reprompt</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47589133</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47589133</guid></item><item><title><![CDATA[New comment by LuxBennu in "Ollama is now powered by MLX on Apple Silicon in preview"]]></title><description><![CDATA[
<p>Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path</p>
]]></description><pubDate>Tue, 31 Mar 2026 05:01:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=47582925</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47582925</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47582925</guid></item><item><title><![CDATA[New comment by LuxBennu in "New Apple Silicon M4 and M5 HiDPI Limitation on 4K External Displays"]]></title><description><![CDATA[
<p>Sadly I have the issue on a new m5 air. I have a 60hz 4k work monitor and two high refresh 4k gaming displays. The 60hz pairs fine with either gaming monitor, but the two gaming ones together and one just doesn't get recognized. Spent way too long trying new cables before realizing it's a bandwidth limitation.</p>
]]></description><pubDate>Mon, 30 Mar 2026 03:29:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=47570104</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47570104</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47570104</guid></item><item><title><![CDATA[New comment by LuxBennu in "Claude Code runs Git reset –hard origin/main against project repo every 10 mins"]]></title><description><![CDATA[
<p>This is true for prohibitions but claude.md works really well as positive documentation. I run custom mcp servers and documenting what each tool does and when to use it made claude pick the right ones way more reliably. Totally different outcome than a list of NEVER DO THIS rules though, for that you definitely need hooks or sandboxing.</p>
]]></description><pubDate>Mon, 30 Mar 2026 03:22:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=47570072</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47570072</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47570072</guid></item><item><title><![CDATA[New comment by LuxBennu in "AI overly affirms users asking for personal advice"]]></title><description><![CDATA[
<p>yeah that's a good way to put it. the "felt good in the moment" framing is basically the whole problem. the reward model was trained on human preferences and humans preferred the agreeable answer, so now that's what you get at inference time regardless of whether it's correct. the frustrating part is you can see it happen in real time if you log the outputs turn by turn, the model will literally contradict its own previous response just because the user sounded more confident.</p>
]]></description><pubDate>Sun, 29 Mar 2026 14:31:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=47563485</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47563485</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47563485</guid></item><item><title><![CDATA[New comment by LuxBennu in "AI overly affirms users asking for personal advice"]]></title><description><![CDATA[
<p>i tested this pretty extensively actually. built a pipeline that asks the same question rephrased across multiple turns and tracks how much the model shifts based on user tone. even when you tell it to be critical, the moment the user pushes back with any confidence the model just folds. it's not a prompting problem, it's baked into RLHF. you're right that LLMs will poke holes in stuff when the conversation starts neutral, but add any emotional charge and the sycophancy takes over immediately. that's exactly why the personal advice angle matters, that's peak emotional signal from the user.</p>
]]></description><pubDate>Sun, 29 Mar 2026 07:50:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=47561177</link><dc:creator>LuxBennu</dc:creator><comments>https://news.ycombinator.com/item?id=47561177</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47561177</guid></item></channel></rss>