<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: typpo</title><link>https://news.ycombinator.com/user?id=typpo</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Wed, 10 Jun 2026 08:35:16 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=typpo" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[OpenAI frontier models and Codex are now available on AWS]]></title><description><![CDATA[
<p>Article URL: <a href="https://openai.com/index/openai-frontier-models-and-codex-are-now-available-on-aws/">https://openai.com/index/openai-frontier-models-and-codex-are-now-available-on-aws/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48363132">https://news.ycombinator.com/item?id=48363132</a></p>
<p>Points: 370</p>
<p># Comments: 131</p>
]]></description><pubDate>Mon, 01 Jun 2026 21:50:02 +0000</pubDate><link>https://openai.com/index/openai-frontier-models-and-codex-are-now-available-on-aws/</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=48363132</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48363132</guid></item><item><title><![CDATA[New comment by typpo in "Advancing finance with Claude Opus 4.6"]]></title><description><![CDATA[
<p>Lately my company has been doing a lot of complex accounting and reporting in spreadsheets.  Overall was surprised by how well both GPT and Claude handled some of these extremely tedious tasks.  Not uncommon to have an hours-long task compressed to minutes.<p>My anecdotal experience is GPT 5.2 Pro is decently ahead of Claude Opus 4.5 in this category when it gets to the tricky stuff, both in presentation and accuracy.  The long reasoning seems to help a lot.  But, apparently the benchmarks do not agree.<p>Edit - noticed OpenAI specifically focuses on finance use cases in their gpt-5.3-codex blog as well <a href="https://openai.com/index/introducing-gpt-5-3-codex/" rel="nofollow">https://openai.com/index/introducing-gpt-5-3-codex/</a></p>
]]></description><pubDate>Thu, 05 Feb 2026 18:15:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=46902752</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=46902752</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46902752</guid></item><item><title><![CDATA[How to replicate the Claude Code attack with Promptfoo]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.promptfoo.dev/blog/claude-code-attack/">https://www.promptfoo.dev/blog/claude-code-attack/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46006618">https://news.ycombinator.com/item?id=46006618</a></p>
<p>Points: 6</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 21 Nov 2025 17:30:08 +0000</pubDate><link>https://www.promptfoo.dev/blog/claude-code-attack/</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=46006618</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46006618</guid></item><item><title><![CDATA[New comment by typpo in "Visualizing 100k Years of Earth in WebGL"]]></title><description><![CDATA[
<p>Nice work!  This is like a much better version of Ancient Earth[0], which I made ~10 years ago using GPlates[1].  I like your approach of rendering the map itself from data, which makes it continuous, rather than just wrapping map textures around a globe.<p>[0] <a href="https://dinosaurpictures.org/ancient-earth#240" rel="nofollow">https://dinosaurpictures.org/ancient-earth#240</a><p>[1] <a href="https://www.gplates.org/" rel="nofollow">https://www.gplates.org/</a></p>
]]></description><pubDate>Mon, 19 May 2025 16:34:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=44031584</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=44031584</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44031584</guid></item><item><title><![CDATA[New comment by typpo in "Show HN: Time Portal – Get dropped into history, guess where you landed"]]></title><description><![CDATA[
<p>This is so fun and creative. Congrats on launching!</p>
]]></description><pubDate>Wed, 12 Mar 2025 20:44:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=43347478</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=43347478</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43347478</guid></item><item><title><![CDATA[Questions censored by DeepSeek]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.promptfoo.dev/blog/deepseek-censorship/">https://www.promptfoo.dev/blog/deepseek-censorship/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=42858552">https://news.ycombinator.com/item?id=42858552</a></p>
<p>Points: 384</p>
<p># Comments: 227</p>
]]></description><pubDate>Tue, 28 Jan 2025 21:54:36 +0000</pubDate><link>https://www.promptfoo.dev/blog/deepseek-censorship/</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=42858552</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42858552</guid></item><item><title><![CDATA[Llama 3.2]]></title><description><![CDATA[
<p>Article URL: <a href="https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf">https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=41650356">https://news.ycombinator.com/item?id=41650356</a></p>
<p>Points: 21</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 25 Sep 2024 18:23:54 +0000</pubDate><link>https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=41650356</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41650356</guid></item><item><title><![CDATA[New comment by typpo in "Open source AI is the path forward"]]></title><description><![CDATA[
<p>Thanks to Meta for their work on safety, particularly Llama Guard.  Llama Guard 3 adds defamation, elections, and code interpreter abuse as detection categories.<p>Having run many red teams recently as I build out promptfoo's red teaming featureset [0], I've noticed the Llama models punch above their weight in terms of accuracy when it comes to safety. People hate excessive guardrails and Llama seems to thread the needle.<p>Very bullish on open source.<p>[0] <a href="https://www.promptfoo.dev/docs/red-team/" rel="nofollow">https://www.promptfoo.dev/docs/red-team/</a></p>
]]></description><pubDate>Tue, 23 Jul 2024 16:21:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=41047795</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=41047795</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41047795</guid></item><item><title><![CDATA[Automated jailbreaking techniques with DALL-E]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.promptfoo.dev/blog/jailbreak-dalle/">https://www.promptfoo.dev/blog/jailbreak-dalle/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=40847860">https://news.ycombinator.com/item?id=40847860</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Mon, 01 Jul 2024 17:10:24 +0000</pubDate><link>https://www.promptfoo.dev/blog/jailbreak-dalle/</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=40847860</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40847860</guid></item><item><title><![CDATA[New comment by typpo in "Gemma 2: Improving Open Language Models at a Practical Size [pdf]"]]></title><description><![CDATA[
<p>If anyone is interested in evaling Gemma locally, this can be done pretty easily using ollama[0] and promptfoo[1] with the following config:<p><pre><code>  prompts:
    - 'Answer this coding problem in Python: {{ask}}'

  providers:
    - ollama:chat:gemma2:9b
    - ollama:chat:llama3:8b

  tests:
    - vars:
        ask: function to find the nth fibonacci number
    - vars:
        ask: calculate pi to the nth digit
    - # ...
</code></pre>
One small thing I've always appreciated about Gemma is that it doesn't include a "Sure, I can help you" preamble.  It just gets right into the code, and follows it with an explanation.  The training seems to emphasize response structure and ease of comprehension.<p>Also, best to run evals that don't rely on rote memorization of public code... so please substitute with your personal tests :)<p>[0] <a href="https://ollama.com/library/gemma2">https://ollama.com/library/gemma2</a><p>[1] <a href="https://github.com/promptfoo/promptfoo">https://github.com/promptfoo/promptfoo</a></p>
]]></description><pubDate>Thu, 27 Jun 2024 18:43:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=40813648</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=40813648</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40813648</guid></item><item><title><![CDATA[Show HN: Automated red teaming for your LLM app]]></title><description><![CDATA[
<p>Hi HN,<p>I built this open-source LLM red teaming tool based on my experience scaling LLMs at a big co to millions of users... and seeing all the bad things people did.<p>How it works:<p>- Uses an unaligned model to create toxic inputs<p>- Runs these inputs through your app using different techniques: raw, prompt injection, and a chain-of-thought jailbreak that tries to re-frame the request to trick the LLM.<p>- Probes a bunch of other failure cases (e.g. will your customer support bot recommend a competitor? Does it think it can process a refund when it can't?  Will it leak your user's address?)<p>- Built on top of promptfoo, a popular eval tool<p>One interesting thing about my approach is that almost none of the tests are hardcoded.  They are all tailored toward the specific purpose of your application, which makes the attacks more potent.<p>Some of these tests reflect fundamental, unsolved issues with LLMs.  Other failures can be solved pretty trivially by prompting or safeguards.<p>Most businesses will never ship LLMs without at least being able to quantify these types of risks.  So I hope this helps someone out.  Happy building!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=40671686">https://news.ycombinator.com/item?id=40671686</a></p>
<p>Points: 23</p>
<p># Comments: 2</p>
]]></description><pubDate>Thu, 13 Jun 2024 16:29:19 +0000</pubDate><link>https://www.promptfoo.dev/docs/red-team/</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=40671686</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40671686</guid></item><item><title><![CDATA[New comment by typpo in "Show HN: I built a backend so simple that it fits in a YAML file"]]></title><description><![CDATA[
<p>Care to explain why you think so?</p>
]]></description><pubDate>Sat, 01 Jun 2024 23:49:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=40550133</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=40550133</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40550133</guid></item><item><title><![CDATA[New comment by typpo in "Google scrambles to manually remove weird AI answers in search"]]></title><description><![CDATA[
<p>The problem in this case is not that it was trained on bad data. The AI summaries are just that - summaries - and there are bad results that it faithfully summarizes.<p>This is an attempt to reduce hallucinations coming full circle. A simple summarization model was meant to reduce hallucination risk, but now it's not discerning enough to exclude untruthful results from the summary.</p>
]]></description><pubDate>Sat, 25 May 2024 16:54:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=40476239</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=40476239</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40476239</guid></item><item><title><![CDATA[New comment by typpo in "Veo"]]></title><description><![CDATA[
<p>The amount of negativity in these comments is astounding.  Congrats to the teams at Google on what they have built, and hoping for more competition and progress in this space.</p>
]]></description><pubDate>Tue, 14 May 2024 19:08:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=40358903</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=40358903</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40358903</guid></item><item><title><![CDATA[New comment by typpo in "Ollama v0.1.33 with Llama 3, Phi 3, and Qwen 110B"]]></title><description><![CDATA[
<p>Paul's benchmarks are excellent and they're the first thing I look for to get a sense of a new model performance :)<p>For those looking to create their own benchmarks, promptfoo[0] is one way to do this locally:<p><pre><code>  prompts:
    - "Write this in Python 3: {{ask}}"
  
  providers:
    - ollama:chat:llama3:8b
    - ollama:chat:phi3
    - ollama:chat:qwen:7b
    
  tests:
    - vars:
        ask: a function to determine if a number is prime
    - vars:
        ask: a function to split a restaurant bill given individual contributions and shared items
</code></pre>
Jumping in because I'm a big believer in (1) local LLMs, and (2) evals specific to individual use cases.<p>[0] <a href="https://github.com/typpo/promptfoo">https://github.com/typpo/promptfoo</a></p>
]]></description><pubDate>Mon, 29 Apr 2024 03:08:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=40194012</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=40194012</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40194012</guid></item><item><title><![CDATA[New comment by typpo in "Show HN: I made a website that converts YT videos into step-by-step guides"]]></title><description><![CDATA[
<p>Great idea and congrats on shipping the project!<p>I'm curious if you noticed certain models worked better for summarizing and converting to steps. For example, in my projects I've found that Gemini outperforms "better" models like GPT for similar use cases, which I guess makes sense given Google's interest in summarization.</p>
]]></description><pubDate>Sun, 21 Apr 2024 17:02:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=40107294</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=40107294</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40107294</guid></item><item><title><![CDATA[New comment by typpo in "Meta Llama 3"]]></title><description><![CDATA[
<p>Public benchmarks are broadly indicative, but devs really should run custom benchmarks on their own use cases.<p>Replicate created a Llama 3 API [0] very quickly.  This can be used to run simple benchmarks with promptfoo [1] comparing Llama 3 vs Mixtral, GPT, Claude, and others:<p><pre><code>  prompts:
    - 'Answer this programming question concisely: {{ask}}'

  providers:
    - replicate:meta/meta-llama-3-8b-instruct
    - replicate:meta/meta-llama-3-70b-instruct
    - replicate:mistralai/mixtral-8x7b-instruct-v0.1
    - openai:chat:gpt-4-turbo
    - anthropic:messages:claude-3-opus-20240229

  tests:
    - vars:
        ask: Return the nth element of the Fibonacci sequence
    - vars:
        ask: Write pong in HTML
    # ...
</code></pre>
Still testing things but Llama 3 8b is looking pretty good for my set of random programming qs at least.<p>Edit: ollama now supports Llama 3 8b, making it easy to run this eval locally.<p><pre><code>  providers:
    - ollama:chat:llama3
</code></pre>
[0] <a href="https://replicate.com/blog/run-llama-3-with-an-api">https://replicate.com/blog/run-llama-3-with-an-api</a><p>[1] <a href="https://github.com/typpo/promptfoo">https://github.com/typpo/promptfoo</a></p>
]]></description><pubDate>Thu, 18 Apr 2024 17:04:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=40078383</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=40078383</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40078383</guid></item><item><title><![CDATA[New comment by typpo in "Google CodeGemma: Open Code Models Based on Gemma [pdf]"]]></title><description><![CDATA[
<p>If anyone wants to eval this locally versus codellama, it's pretty easy with Ollama[0] and Promptfoo[1]:<p><pre><code>  prompts:
    - "Solve in Python: {{ask}}"

  providers:
    - ollama:chat:codellama:7b
    - ollama:chat:codegemma:instruct

  tests:
    - vars:
        ask: function to return the nth number in fibonacci sequence
    - vars:
        ask: convert roman numeral to number
    # ...
</code></pre>
YMMV based on your coding tasks, but I notice gemma is much less verbose by default.<p>[0] <a href="https://github.com/ollama/ollama">https://github.com/ollama/ollama</a><p>[1] <a href="https://github.com/promptfoo/promptfoo">https://github.com/promptfoo/promptfoo</a></p>
]]></description><pubDate>Tue, 09 Apr 2024 14:49:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=39980053</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=39980053</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39980053</guid></item><item><title><![CDATA[Benchmark Command R vs. GPT/Claude on your own data]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.promptfoo.dev/docs/guides/cohere-command-r-benchmark/">https://www.promptfoo.dev/docs/guides/cohere-command-r-benchmark/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=39978835">https://news.ycombinator.com/item?id=39978835</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 09 Apr 2024 12:46:21 +0000</pubDate><link>https://www.promptfoo.dev/docs/guides/cohere-command-r-benchmark/</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=39978835</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39978835</guid></item><item><title><![CDATA[DBRX vs. Mixtral vs. GPT: create your own benchmark]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.promptfoo.dev/docs/guides/dbrx-benchmark/">https://www.promptfoo.dev/docs/guides/dbrx-benchmark/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=39886088">https://news.ycombinator.com/item?id=39886088</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Sun, 31 Mar 2024 17:20:51 +0000</pubDate><link>https://www.promptfoo.dev/docs/guides/dbrx-benchmark/</link><dc:creator>typpo</dc:creator><comments>https://news.ycombinator.com/item?id=39886088</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39886088</guid></item></channel></rss>