Hacker News: typpo

OpenAI frontier models and Codex are now available on AWS

typpo — Mon, 01 Jun 2026 21:50:02 +0000

Article URL: https://openai.com/index/openai-frontier-models-and-codex-are-now-available-on-aws/

Comments URL: https://news.ycombinator.com/item?id=48363132

Points: 370

# Comments: 131

New comment by typpo in "Advancing finance with Claude Opus 4.6"

typpo — Thu, 05 Feb 2026 18:15:17 +0000

Lately my company has been doing a lot of complex accounting and reporting in spreadsheets. Overall was surprised by how well both GPT and Claude handled some of these extremely tedious tasks. Not uncommon to have an hours-long task compressed to minutes.

My anecdotal experience is GPT 5.2 Pro is decently ahead of Claude Opus 4.5 in this category when it gets to the tricky stuff, both in presentation and accuracy. The long reasoning seems to help a lot. But, apparently the benchmarks do not agree.

Edit - noticed OpenAI specifically focuses on finance use cases in their gpt-5.3-codex blog as well https://openai.com/index/introducing-gpt-5-3-codex/

How to replicate the Claude Code attack with Promptfoo

typpo — Fri, 21 Nov 2025 17:30:08 +0000

Article URL: https://www.promptfoo.dev/blog/claude-code-attack/

Comments URL: https://news.ycombinator.com/item?id=46006618

Points: 6

# Comments: 0

New comment by typpo in "Visualizing 100k Years of Earth in WebGL"

typpo — Mon, 19 May 2025 16:34:43 +0000

Nice work! This is like a much better version of Ancient Earth[0], which I made ~10 years ago using GPlates[1]. I like your approach of rendering the map itself from data, which makes it continuous, rather than just wrapping map textures around a globe.

[0] https://dinosaurpictures.org/ancient-earth#240

[1] https://www.gplates.org/

New comment by typpo in "Show HN: Time Portal – Get dropped into history, guess where you landed"

typpo — Wed, 12 Mar 2025 20:44:17 +0000

This is so fun and creative. Congrats on launching!

Questions censored by DeepSeek

typpo — Tue, 28 Jan 2025 21:54:36 +0000

Article URL: https://www.promptfoo.dev/blog/deepseek-censorship/

Comments URL: https://news.ycombinator.com/item?id=42858552

Points: 384

# Comments: 227

Llama 3.2

typpo — Wed, 25 Sep 2024 18:23:54 +0000

Article URL: https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf

Comments URL: https://news.ycombinator.com/item?id=41650356

Points: 21

# Comments: 0

New comment by typpo in "Open source AI is the path forward"

typpo — Tue, 23 Jul 2024 16:21:43 +0000

Thanks to Meta for their work on safety, particularly Llama Guard. Llama Guard 3 adds defamation, elections, and code interpreter abuse as detection categories.

Having run many red teams recently as I build out promptfoo's red teaming featureset [0], I've noticed the Llama models punch above their weight in terms of accuracy when it comes to safety. People hate excessive guardrails and Llama seems to thread the needle.

Very bullish on open source.

[0] https://www.promptfoo.dev/docs/red-team/

Automated jailbreaking techniques with DALL-E

typpo — Mon, 01 Jul 2024 17:10:24 +0000

Article URL: https://www.promptfoo.dev/blog/jailbreak-dalle/

Comments URL: https://news.ycombinator.com/item?id=40847860

Points: 2

# Comments: 0

New comment by typpo in "Gemma 2: Improving Open Language Models at a Practical Size [pdf]"

typpo — Thu, 27 Jun 2024 18:43:28 +0000

If anyone is interested in evaling Gemma locally, this can be done pretty easily using ollama[0] and promptfoo[1] with the following config:

  prompts:
    - 'Answer this coding problem in Python: {{ask}}'

  providers:
    - ollama:chat:gemma2:9b
    - ollama:chat:llama3:8b

  tests:
    - vars:
        ask: function to find the nth fibonacci number
    - vars:
        ask: calculate pi to the nth digit
    - # ...

One small thing I've always appreciated about Gemma is that it doesn't include a "Sure, I can help you" preamble. It just gets right into the code, and follows it with an explanation. The training seems to emphasize response structure and ease of comprehension.

Also, best to run evals that don't rely on rote memorization of public code... so please substitute with your personal tests :)

[0] https://ollama.com/library/gemma2

[1] https://github.com/promptfoo/promptfoo

Show HN: Automated red teaming for your LLM app

typpo — Thu, 13 Jun 2024 16:29:19 +0000

Hi HN,

I built this open-source LLM red teaming tool based on my experience scaling LLMs at a big co to millions of users... and seeing all the bad things people did.

How it works:

- Uses an unaligned model to create toxic inputs

- Runs these inputs through your app using different techniques: raw, prompt injection, and a chain-of-thought jailbreak that tries to re-frame the request to trick the LLM.

- Probes a bunch of other failure cases (e.g. will your customer support bot recommend a competitor? Does it think it can process a refund when it can't? Will it leak your user's address?)

- Built on top of promptfoo, a popular eval tool

One interesting thing about my approach is that almost none of the tests are hardcoded. They are all tailored toward the specific purpose of your application, which makes the attacks more potent.

Some of these tests reflect fundamental, unsolved issues with LLMs. Other failures can be solved pretty trivially by prompting or safeguards.

Most businesses will never ship LLMs without at least being able to quantify these types of risks. So I hope this helps someone out. Happy building!

Comments URL: https://news.ycombinator.com/item?id=40671686

Points: 23

# Comments: 2

New comment by typpo in "Show HN: I built a backend so simple that it fits in a YAML file"

typpo — Sat, 01 Jun 2024 23:49:05 +0000

Care to explain why you think so?

New comment by typpo in "Google scrambles to manually remove weird AI answers in search"

typpo — Sat, 25 May 2024 16:54:41 +0000

The problem in this case is not that it was trained on bad data. The AI summaries are just that - summaries - and there are bad results that it faithfully summarizes.

This is an attempt to reduce hallucinations coming full circle. A simple summarization model was meant to reduce hallucination risk, but now it's not discerning enough to exclude untruthful results from the summary.

New comment by typpo in "Veo"

typpo — Tue, 14 May 2024 19:08:44 +0000

The amount of negativity in these comments is astounding. Congrats to the teams at Google on what they have built, and hoping for more competition and progress in this space.

New comment by typpo in "Ollama v0.1.33 with Llama 3, Phi 3, and Qwen 110B"

typpo — Mon, 29 Apr 2024 03:08:28 +0000

Paul's benchmarks are excellent and they're the first thing I look for to get a sense of a new model performance :)

For those looking to create their own benchmarks, promptfoo[0] is one way to do this locally:

  prompts:
    - "Write this in Python 3: {{ask}}"
  
  providers:
    - ollama:chat:llama3:8b
    - ollama:chat:phi3
    - ollama:chat:qwen:7b
    
  tests:
    - vars:
        ask: a function to determine if a number is prime
    - vars:
        ask: a function to split a restaurant bill given individual contributions and shared items

Jumping in because I'm a big believer in (1) local LLMs, and (2) evals specific to individual use cases.

[0] https://github.com/typpo/promptfoo

New comment by typpo in "Show HN: I made a website that converts YT videos into step-by-step guides"

typpo — Sun, 21 Apr 2024 17:02:11 +0000

Great idea and congrats on shipping the project!

I'm curious if you noticed certain models worked better for summarizing and converting to steps. For example, in my projects I've found that Gemini outperforms "better" models like GPT for similar use cases, which I guess makes sense given Google's interest in summarization.

New comment by typpo in "Meta Llama 3"

typpo — Thu, 18 Apr 2024 17:04:39 +0000

Public benchmarks are broadly indicative, but devs really should run custom benchmarks on their own use cases.

Replicate created a Llama 3 API [0] very quickly. This can be used to run simple benchmarks with promptfoo [1] comparing Llama 3 vs Mixtral, GPT, Claude, and others:

  prompts:
    - 'Answer this programming question concisely: {{ask}}'

  providers:
    - replicate:meta/meta-llama-3-8b-instruct
    - replicate:meta/meta-llama-3-70b-instruct
    - replicate:mistralai/mixtral-8x7b-instruct-v0.1
    - openai:chat:gpt-4-turbo
    - anthropic:messages:claude-3-opus-20240229

  tests:
    - vars:
        ask: Return the nth element of the Fibonacci sequence
    - vars:
        ask: Write pong in HTML
    # ...

Still testing things but Llama 3 8b is looking pretty good for my set of random programming qs at least.

Edit: ollama now supports Llama 3 8b, making it easy to run this eval locally.

  providers:
    - ollama:chat:llama3

[0] https://replicate.com/blog/run-llama-3-with-an-api

[1] https://github.com/typpo/promptfoo

New comment by typpo in "Google CodeGemma: Open Code Models Based on Gemma [pdf]"

typpo — Tue, 09 Apr 2024 14:49:35 +0000

If anyone wants to eval this locally versus codellama, it's pretty easy with Ollama[0] and Promptfoo[1]:

  prompts:
    - "Solve in Python: {{ask}}"

  providers:
    - ollama:chat:codellama:7b
    - ollama:chat:codegemma:instruct

  tests:
    - vars:
        ask: function to return the nth number in fibonacci sequence
    - vars:
        ask: convert roman numeral to number
    # ...

YMMV based on your coding tasks, but I notice gemma is much less verbose by default.

[0] https://github.com/ollama/ollama

[1] https://github.com/promptfoo/promptfoo

Benchmark Command R vs. GPT/Claude on your own data

typpo — Tue, 09 Apr 2024 12:46:21 +0000

Article URL: https://www.promptfoo.dev/docs/guides/cohere-command-r-benchmark/

Comments URL: https://news.ycombinator.com/item?id=39978835

Points: 2

# Comments: 0

DBRX vs. Mixtral vs. GPT: create your own benchmark

typpo — Sun, 31 Mar 2024 17:20:51 +0000

Article URL: https://www.promptfoo.dev/docs/guides/dbrx-benchmark/

Comments URL: https://news.ycombinator.com/item?id=39886088

Points: 1

# Comments: 0