<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: ggerganov</title><link>https://news.ycombinator.com/user?id=ggerganov</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Thu, 23 Apr 2026 04:33:00 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=ggerganov" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by ggerganov in "Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model"]]></title><description><![CDATA[
<p>llama-batched-bench -hf ggml-org/Qwen3.6-27B-GGUF -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000<p>M2 Ultra, Q8_0<p><pre><code>  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    1.307 |   391.69 |    6.209 |    20.61 |    7.516 |    85.15 |
  |  1024 |    128 |    1 |   1152 |    2.534 |   404.16 |    6.227 |    20.56 |    8.760 |   131.50 |
  |  2048 |    128 |    1 |   2176 |    5.029 |   407.26 |    6.229 |    20.55 |   11.258 |   193.29 |
  |  4096 |    128 |    1 |   4224 |   10.176 |   402.52 |    6.278 |    20.39 |   16.454 |   256.72 |
  |  8192 |    128 |    1 |   8320 |   20.784 |   394.14 |    6.376 |    20.08 |   27.160 |   306.33 |
  | 16384 |    128 |    1 |  16512 |   43.513 |   376.53 |    6.532 |    19.59 |   50.046 |   329.94 |
  | 32768 |    128 |    1 |  32896 |   99.137 |   330.53 |    7.081 |    18.08 |  106.218 |   309.70 |

</code></pre>
DGX Spark, Q8_0<p><pre><code>  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.881 |   580.98 |   16.122 |     7.94 |   17.003 |    37.64 |
  |  1024 |    128 |    1 |   1152 |    1.749 |   585.43 |   16.131 |     7.93 |   17.880 |    64.43 |
  |  2048 |    128 |    1 |   2176 |    3.486 |   587.54 |   16.169 |     7.92 |   19.655 |   110.71 |
  |  4096 |    128 |    1 |   4224 |    7.018 |   583.64 |   16.245 |     7.88 |   23.263 |   181.58 |
  |  8192 |    128 |    1 |   8320 |   14.189 |   577.33 |   16.427 |     7.79 |   30.617 |   271.75 |
  | 16384 |    128 |    1 |  16512 |   29.015 |   564.68 |   16.749 |     7.64 |   45.763 |   360.81 |
  | 32768 |    128 |    1 |  32896 |   60.413 |   542.40 |   17.359 |     7.37 |   77.772 |   422.98 |</code></pre></p>
]]></description><pubDate>Wed, 22 Apr 2026 20:37:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=47868989</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=47868989</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47868989</guid></item><item><title><![CDATA[New comment by ggerganov in "Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs"]]></title><description><![CDATA[
<p>Better keep the KV cache in full precision</p>
]]></description><pubDate>Wed, 01 Apr 2026 06:53:53 +0000</pubDate><link>https://news.ycombinator.com/item?id=47597729</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=47597729</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47597729</guid></item><item><title><![CDATA[New comment by ggerganov in "NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference"]]></title><description><![CDATA[
<p>Yes, I provided detailed numbers here: <a href="https://github.com/ggml-org/llama.cpp/discussions/16578" rel="nofollow">https://github.com/ggml-org/llama.cpp/discussions/16578</a></p>
]]></description><pubDate>Tue, 14 Oct 2025 14:56:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=45580859</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=45580859</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45580859</guid></item><item><title><![CDATA[New comment by ggerganov in "NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference"]]></title><description><![CDATA[
<p>FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:<p><pre><code>  ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
  | model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |          pp4096 |       3564.31 ± 9.91 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |            tg32 |         53.93 ± 1.71 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp4096 |      1792.32 ± 34.74 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         38.54 ± 3.10 |</code></pre></p>
]]></description><pubDate>Tue, 14 Oct 2025 05:59:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=45576737</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=45576737</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45576737</guid></item><item><title><![CDATA[New comment by ggerganov in "Crush: Glamourous AI coding agent for your favourite terminal"]]></title><description><![CDATA[
<p>They should add "custom endpoint" support instead [0].<p>[0] <a href="https://github.com/microsoft/vscode/issues/249605">https://github.com/microsoft/vscode/issues/249605</a></p>
]]></description><pubDate>Wed, 30 Jul 2025 18:49:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=44738088</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=44738088</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44738088</guid></item><item><title><![CDATA[New comment by ggerganov in "Show HN: Refine – A Local Alternative to Grammarly"]]></title><description><![CDATA[
<p>Gemma 3n (the model used by this app) would run on any Apple Silicon device (even with 8GB RAM).</p>
]]></description><pubDate>Mon, 14 Jul 2025 10:54:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=44558584</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=44558584</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44558584</guid></item><item><title><![CDATA[New comment by ggerganov in "RamaLama"]]></title><description><![CDATA[
<p>The llama.cpp tools and examples download the models by default to a OS-specific cache folder [0]. We try to follow the HF standard (as discussed in the linked thread), though the layout of the llama.cpp cache is not the same atm. Not sure about the plans for RamaLama, but it might be something worth to consider.<p>[0] <a href="https://github.com/ggerganov/llama.cpp/issues/7252">https://github.com/ggerganov/llama.cpp/issues/7252</a></p>
]]></description><pubDate>Fri, 31 Jan 2025 14:59:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=42888308</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=42888308</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42888308</guid></item><item><title><![CDATA[New comment by ggerganov in "Llama.vim – Local LLM-assisted text completion"]]></title><description><![CDATA[
<p>Yes, exactly. You can set --ctx-size to a smaller value if you know that you will not hit the limit of 32k - this will save you VRAM.<p>To control how much global context to keep in the ring buffer (i.e. the context that is being reused to enrich the local context), you can adjust the "ring_n_chunks" and "rink_chunk_size". With the default settings, this amounts to about 8k tokens of context on our codebases when the ring buffer is full, which is a conservative setting. Increasing these numbers will make the context bigger, will improve the quality but will affect the performance.<p>There are a few other tricks to reduce the compute for the local context (i.e. the 1k batch of tokens), so that in practice, a smaller amount is processed. This further saves compute during the prefill.</p>
]]></description><pubDate>Fri, 24 Jan 2025 10:00:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=42811823</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=42811823</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42811823</guid></item><item><title><![CDATA[New comment by ggerganov in "Llama.vim – Local LLM-assisted text completion"]]></title><description><![CDATA[
<p>The primary tricks for reducing the latency are around context reuse, meaning that the computed KV cache of tokens from previous requests is reused for new requests and thus computation is saved.<p>To get high-quality completions, you need to provide a large context of your codebase so that the generated suggestion is more inline with your style and implementation logic. However, naively increasing the context will quickly hit a computation limit because each request would need to compute (a.k.a prefill) a lot of tokens.<p>The KV cache shifts used here is an approach to reuse the cache of old tokens by "shifting" them in new absolute positions in the new context. This way a request that would normally require a context of lets say 10k tokens, could be processed more quickly by computing just lets say 500 tokens and reusing the cache of the other 9.5k tokens, thus cutting the compute ~10 fold.<p>The --ctx-size 0 CLI arg simply tells the server to allocate memory buffers for the maximum context size supported by the model. For the case of Qwen Coder models, this corresponds to 32k tokens.<p>The batch sizes are related to how much local context around your cursor to use, along with the global context from the ring buffer. This is described in more detail in the links, but simply put: decreasing the batch size will make the completion faster, but with less quality.</p>
]]></description><pubDate>Fri, 24 Jan 2025 09:18:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=42811654</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=42811654</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42811654</guid></item><item><title><![CDATA[New comment by ggerganov in "Llama.vim – Local LLM-assisted text completion"]]></title><description><![CDATA[
<p>Appreciate the feedback!<p>Currently, there isn't a user-friendly way to disable the stats from showing apart from modifying the "'show_info': 0" value directly in the plugin implementation. These things will be improved with time and will become more user-friendly.<p>A few extra optimizations will soon land which will further improve the experience:<p>- Speculative FIM<p>- Multiple suggestions</p>
]]></description><pubDate>Fri, 24 Jan 2025 09:04:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=42811609</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=42811609</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42811609</guid></item><item><title><![CDATA[New comment by ggerganov in "Llama.vim – Local LLM-assisted text completion"]]></title><description><![CDATA[
<p>There are 4 stopping criteria atm:<p>- Generation time exceeded (configurable in the plugin config)<p>- Number of tokens exceeded (not the case since you increased it)<p>- Indentation - stops generating if the next line has shorter indent than the first line<p>- Small probability of the sampled token<p>Most likely you are hitting the last criteria. It's something that should be improved in some way, but I am not very sure how. Currently, it is using a very basic token sampling strategy with a custom threshold logic to stop generating when the token probability is too low. Likely this logic is too conservative.</p>
]]></description><pubDate>Thu, 23 Jan 2025 20:23:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=42807696</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=42807696</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42807696</guid></item><item><title><![CDATA[New comment by ggerganov in "Llama.vim – Local LLM-assisted text completion"]]></title><description><![CDATA[
<p>Yes, I think it is surprising that it works.<p>I think a fairly large amount, though can't give a good number. I have been using Github Copilot from the very early days and with the release of Qwen Coder last year have fully switched to using local completions. I don't use the chat workflow to code though, only FIM.</p>
]]></description><pubDate>Thu, 23 Jan 2025 18:54:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=42806843</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=42806843</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42806843</guid></item><item><title><![CDATA[New comment by ggerganov in "Llama.vim – Local LLM-assisted text completion"]]></title><description><![CDATA[
<p>Hi HN, happy to see this here!<p>I highly recommend to take a look at the technical details of the server implementation that enables large context usage with this plugin - I think it is interesting and has some cool ideas [0].<p>Also, the same plugin is available for VS Code [1].<p>Let me know if you have any questions about the plugin - happy to explain. Btw, the performance has improved compared to what is seen in the README videos thanks to client-side caching.<p>[0] - <a href="https://github.com/ggerganov/llama.cpp/pull/9787">https://github.com/ggerganov/llama.cpp/pull/9787</a><p>[1] - <a href="https://github.com/ggml-org/llama.vscode">https://github.com/ggml-org/llama.vscode</a></p>
]]></description><pubDate>Thu, 23 Jan 2025 18:29:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=42806546</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=42806546</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42806546</guid></item><item><title><![CDATA[New comment by ggerganov in "Show HN: I made an app to use local AI as daily driver"]]></title><description><![CDATA[
<p>So far is going great! Good community, having fun. Many ideas to explore :-)</p>
]]></description><pubDate>Wed, 28 Feb 2024 10:29:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=39536213</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=39536213</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39536213</guid></item><item><title><![CDATA[New comment by ggerganov in "Show HN: I made an app to use local AI as daily driver"]]></title><description><![CDATA[
<p>> Thanks to the amazing work of @ggerganov on llama.cpp which made this possible. If there is anything that you wish to exist in an ideal local AI app, I'd love to hear about it.<p>The app looks great! Likewise, if you have any requests or ideas for improving llama.cpp, please don't hesitate to open an issue / discussion in the repo</p>
]]></description><pubDate>Wed, 28 Feb 2024 09:14:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=39535722</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=39535722</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39535722</guid></item><item><title><![CDATA[New comment by ggerganov in "Show HN: LLaVaVision: An AI "Be My Eyes"-like web app with a llama.cpp backend"]]></title><description><![CDATA[
<p>I've found lowering the temperature and disabling the repetition penalty can help [0]. My explanation is that the repetition penalty penalizes the end of sentences and sort of forces the generation to go on instead of stopping.<p>[0] <a href="https://old.reddit.com/r/LocalLLaMA/comments/17e855d/llamacpp_server_now_supports_multimodal/k66jpb5/" rel="nofollow noreferrer">https://old.reddit.com/r/LocalLLaMA/comments/17e855d/llamacp...</a></p>
]]></description><pubDate>Mon, 06 Nov 2023 08:34:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=38160077</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=38160077</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38160077</guid></item><item><title><![CDATA[New comment by ggerganov in "Talk-Llama"]]></title><description><![CDATA[
<p>Yes, I was planning to do this back then, but other stuff came up.
There are many different ways in which this simple example can be improved:<p>- better detection of when speech ends (currently basic adaptive threshold)<p>- use small LLM for quick response with something generic while big LLM computes<p>- TTS streaming in chunks or sentences<p>One of the better OSS versions of such chatbot I think is  <a href="https://github.com/yacineMTB/talk">https://github.com/yacineMTB/talk</a>.
Though probably many other similar projects also exist by now.</p>
]]></description><pubDate>Thu, 02 Nov 2023 17:38:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=38117233</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=38117233</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38117233</guid></item><item><title><![CDATA[New comment by ggerganov in "Talk-Llama"]]></title><description><![CDATA[
<p>Heh, funny to see this popup here :)<p>The performance on Apple Silicon should be much better today compared to what is shown in the video as whisper.cpp now runs fully on the GPU and there have been significant improvements in llama.cpp generation speed over the last few months.</p>
]]></description><pubDate>Thu, 02 Nov 2023 13:15:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=38113075</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=38113075</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38113075</guid></item><item><title><![CDATA[New comment by ggerganov in "Attention Is Off By One"]]></title><description><![CDATA[
<p>Yes, I assumed that checking the weights for presence and amount of outliers is not something that is usually done and effects on this can be overlooked. If my assumption is wrong and researchers do usually look at such metrics, then my question is not very relevant.<p>Agree - the "how" is straightforward</p>
]]></description><pubDate>Tue, 25 Jul 2023 04:53:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=36858038</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=36858038</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36858038</guid></item><item><title><![CDATA[New comment by ggerganov in "Attention Is Off By One"]]></title><description><![CDATA[
<p>> I don't recall the details exactly, but I don't think it ever did very much.<p>How would you have known if the trick actually reduces the outliers in the weights? Even if the transformer quality does not improve overall, having less outliers as a result is very beneficial for more accurate quantization of the data</p>
]]></description><pubDate>Mon, 24 Jul 2023 22:04:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=36854897</link><dc:creator>ggerganov</dc:creator><comments>https://news.ycombinator.com/item?id=36854897</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=36854897</guid></item></channel></rss>