Hacker News: tedsanders

New comment by tedsanders in "Stanford report highlights growing disconnect between AI insiders and everyone"

tedsanders — Mon, 13 Apr 2026 22:09:41 +0000

> Meanwhile, everyone can read the news about layoffs attributed to AI and can see that hiring (especially of junior engineers) has slowed to a trickle.

According to FRED/Indeed[1], software job openings have been roughly flat for 2-3 years, and they've actually been slightly increasing again. What data source are you looking at?

[1] https://fred.stlouisfed.org/series/IHLIDXUSTPSOFTDEVE

New comment by tedsanders in "ChatGPT Pro now starts at $100/month"

tedsanders — Mon, 13 Apr 2026 06:42:37 +0000

Following up - I was wrong about 10x/40x. Here's how it actually works:

$20 = 1x

$100 = 5X (but temporarily 10x for just Codex til May 31st)

$200 = 20x

We'll send out new tweets and clarify our pricing page.

New comment by tedsanders in "Exploiting the most prominent AI agent benchmarks"

tedsanders — Sun, 12 Apr 2026 19:28:43 +0000

> I remember the gpt-5 benchmarks and how wildly inaccurate they were data-wise. Linking one[0] that I found so that other people can remember what I am talking about. I remember some data being completely misleading or some reaching more than 100% (iirc)

Yeah, I found that slide very embarrassing. It wasn't intentionally inaccurate or misleading - just a design error made right before we went live. All the numbers on that slide were correct, and there was no problem in terms of research accuracy or data handling or reward hacking. A single bar height had the wrong value, set to its neighbor. Back then, we in the research team would generate data and graphs, and then hand them off to a separate design team, who remade the graphs in our brand style. After the GPT-5 launch with multiple embarrassingly bad graphs, I wrote an internal library so that researchers could generate graphs in our brand style directly, without the handoff. Since then our graphs have been much better.

I don't think it's unfair to assume our sloppiness in graphs translates to sloppiness in eval results. But they are different groups of people working on different timelines, so I hope it's at least plausible that our numbers are pretty honest, even if our design process occasionally results in sloppy graphs.

Regarding the DoW deal, I don't want to comment too publicly. I also can't say anything with confidence, as I wasn't part of the deal in any way shape or form. My perception from what I have read and heard is that both Anthropic and OpenAI have good intentions, both have loosened their prior policies over time to allow usage by the US military, and both have red lines to prohibit abuse by the US military. One place they differ is in the mechanisms employed to enforce those red lines (e.g. usage policies vs refusals vs human oversight). Each company asserts their methods are stronger than the other's, so I think we have to make our own judgments there. Accounts from the parties involved in the negotiations also conflict, so I don't think anyone's account can be trusted 100%. With that caveat, I thought this article on the DoW's POV was interesting (seems to support the notion that the breakdown wasn't over differing red lines, especially since they almost managed to salvage the deal): https://www.piratewires.com/p/inside-pentagon-anthropic-deal...

Lastly, I hope it's obvious to everyone that Anthropic is not at all a supply chain risk and the threats there were incredibly disappointing. I support them 100% and I'm glad to see them unhurt by the empty threats.

New comment by tedsanders in "Exploiting the most prominent AI agent benchmarks"

tedsanders — Sun, 12 Apr 2026 00:57:50 +0000

I work at OpenAI and I really don't find this to be the case.

We're pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users, but we don't do this. My impression is that Anthropic and other labs are similar. E.g., in the Sonnet 4.6 system card they use a model to detect potential contamination and manually score those outputs as 0 if human review agrees there was contamination. If all the labs cared about was marketing material, it would be quite easy not to do this extra work.

There are ton of other games you can play with evals too (e.g., test 100 different model checkpoints or run secret prompt optimization to steer away from failing behaviors), but by and large what I've seen inside OpenAI is trustworthy.

I won't say everything is 100% guaranteed bulletproof, as we could always hire 100 more SWEs to improve hack detection systems and manually read outputs. Mistakes do happen, in both directions. Plus there's always going to be a bit of unavoidable multiple model testing bias that's hard to precisely adjust for. Also, there are legitimate gray areas like what to do if your model asks genuinely useful clarifying questions that the original reference implementation scores as 0s, despite there being no instruction that clarifying questions are forbidden. Like, if you tell a model not to ask clarifying questions is that cheating or is that patching the eval to better align it with user value?

New comment by tedsanders in "ChatGPT Pro now starts at $100/month"

tedsanders — Fri, 10 Apr 2026 06:13:48 +0000

I'm honestly not sure, as I don't work on it. My understanding from afar is:

- There was a 2x promotion in March that ended on April 2, so limits have felt tighter since then

- We sometimes reset rate limits after bugs or milestones or because Tibo feels generous, which can make some days feel different than others (they are typically announced here: https://x.com/thsottiaux)

- Recently Plus was tweaked to have a smaller 5h limit but an increased weekly limit

- Lastly, as part of the new Pro launch, the $100 & $200 Pro tiers are getting a 2x promotion, meaning they are temporarily 10x/40x instead of 5x/20x

I've asked our team to clarify the pricing page. Agree it's not clear.

New comment by tedsanders in "ChatGPT Pro now starts at $100/month"

tedsanders — Thu, 09 Apr 2026 22:53:32 +0000

All good, I interpreted it as postulation and not accusation. :)

I do like the job! Much more organic than yanking tickets, though I'm on the model training side of things, rather than product side. Always a balance between short-term sprints patching bad behaviors for the next model vs long-term investments in infra and science that make future work easier. Sometimes the negative press gets to me a bit (it's a very different feeling than 2022 or 2023), but my goal is just to make the most useful product I can for people. It's been wild how much Codex has already changed my day-to-day work, I'm so curious to see what it looks like in 2030 or 2040.

New comment by tedsanders in "ChatGPT Pro now starts at $100/month"

tedsanders — Thu, 09 Apr 2026 18:35:40 +0000

Nope, it's just that a lot of people (especially those using Codex) asked us for a medium-sized $100 plan. $20 felt too restrictive and $200 felt like a big jump.

Pricing strategy is always a bit of an art, without a perfect optimum for everyone:

- pay-per-token makes every query feel stressful

- a single plan overcharges light users and annoyingly blocks heavy users

- a zillion plans are confusing / annoying to navigate and change

This change mostly just adds a medium-sized plan for people doing medium-sized amounts of work. People were asking for this, and we're happy to deliver.

(I work at OpenAI.)

New comment by tedsanders in "ChatGPT won't let you type until Cloudflare reads your React state"

tedsanders — Mon, 30 Mar 2026 07:35:19 +0000

For what it's worth, the big AI companies do have opt out mechanisms for scraping and search.

OpenAI documents how to opt out of scraping here: https://developers.openai.com/api/docs/bots

Anthropic documents how to opt out of scraping here: https://privacy.claude.com/en/articles/8896518-does-anthropi...

I'm not sure if Gemini lets you opt out without also delisting you from Google search rankings.

New comment by tedsanders in "ChatGPT won't let you type until Cloudflare reads your React state"

tedsanders — Mon, 30 Mar 2026 07:26:34 +0000

It's documented here: https://developers.openai.com/api/docs/bots

New comment by tedsanders in "Astral to Join OpenAI"

tedsanders — Thu, 19 Mar 2026 15:48:55 +0000

I work at OpenAI. Software developers are not obsoleted by Codex or Claude Code, nor will they be soon.

For our teams, Codex is a massive productivity booster that actually increases the value of each dev. If you check our hiring page, you’ll see we are still hiring aggressively. Our ambitions are bigger than our current workforce, and we continue to pay top dollar for talented devs who want to join us in transforming how silicon chips provide value to humans.

Akin to how compilers reduced the demand for assembly but increased the demand for software engineering, I see Codex reducing the demand for hand-typed code but increasing the demand for software engineering. Codex can read and write code faster than you or me, but it still lacks a lot of intelligence and wisdom and context to do whole jobs autonomously.

New comment by tedsanders in "GPT-5.4"

tedsanders — Thu, 05 Mar 2026 20:34:40 +0000

In the text, we did share one hallucination benchmark: Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts).

Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.

(I work at OpenAI.)

New comment by tedsanders in "GPT-5.4"

tedsanders — Thu, 05 Mar 2026 18:41:14 +0000

Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.

For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.

Curious to hear if people have use cases where they find 1M works much better!

(I work at OpenAI.)

New comment by tedsanders in "GPT‑5.3 Instant"

tedsanders — Tue, 03 Mar 2026 19:11:19 +0000

Yeah, for a while ChatGPT Plus has been powered by two series of models under the hood.

One series is the Instant series, which is faster and more tuned to ChatGPT, but less accurate.

The second series is the Thinking series, which is more accurate and more tuned to professional knowledge work, but slower (because it uses more reasoning tokens).

We'd also prefer to have simple experience with just one option, but picking just one would pull back the pareto frontier for some group of people/preferences. So for now we continue to serve two models, with manual control for people who want to choose and an imperfect auto switcher for people who don't want to be bothered. Could change down the road - we'll see.

(I work at OpenAI.)

New comment by tedsanders in "OpenAI agrees with Dept. of War to deploy models in their classified network"

tedsanders — Sat, 28 Feb 2026 09:48:42 +0000

The supply chain risk stuff is bogus. Anthropic is a great, trustworthy company, and no enemy of America. I genuinely root for Anthropic, because its success benefits consumers and all the charities that Anthropic employees have pledged equity toward.

Whether Anthropic’s clear mistreatment means that all other companies should refrain from doing business with the US government isn’t as clear to me. I can see arguments on both sides and I acknowledge it’s probably impossible to eliminate all possible bias within myself.

One thing I hope we can agree on is that it would be good if the contract (or its relevant portions) is made public so that people can judge for themselves, without having to speculate about who’s being honest and who’s lying.

New comment by tedsanders in "OpenAI agrees with Dept. of War to deploy models in their classified network"

tedsanders — Sat, 28 Feb 2026 06:58:03 +0000

I agree it makes little sense, and I think if all players were rational it never would have played out this way. My understanding is that there are other reasons (i.e., beyond differing red lines) that made the OpenAI deal more palatable, but unfortunately the information shared with me has not been made public so I won't comment on specifics. I know that's unsatisfying, but I hope it serves as some very mild evidence that it's not all a big fat lie.

New comment by tedsanders in "OpenAI agrees with Dept. of War to deploy models in their classified network"

tedsanders — Sat, 28 Feb 2026 06:27:15 +0000

I'm an OpenAI employee and I'll go out on a limb with a public comment. I agree AI shouldn't be used for mass surveillance or autonomous weapons. I also think Anthropic has been treated terribly and has acted admirably. My understanding is that the OpenAI deal disallows domestic mass surveillance and autonomous weapons, and that OpenAI is asking for the same terms for other AI companies (so that we can continue competing on the basis of differing services and not differing scruples). Given this understanding, I don't see why I should quit. If it turns out that the deal is being misdescribed or that it won't be enforced, I can see why I should quit, but so far I haven't seen any evidence that's the case.

New comment by tedsanders in "Detecting and Preventing Distillation Attacks"

tedsanders — Mon, 23 Feb 2026 20:45:41 +0000

The people who would otherwise be affected by spam calls, spam messages, ransomware / computer viruses, fake / deceptive websites, or bioengineered viruses.

The risk of these could plausibly increase in a world with powerful AI. Obviously the risk isn't high now, and there are benefits to trade off against these costs, but all powerful technologies have costs.

New comment by tedsanders in "Detecting and Preventing Distillation Attacks"

tedsanders — Mon, 23 Feb 2026 18:33:51 +0000

One consequence of creating a country of geniuses in a data center is that you now have a country of geniuses who can potentially help your competitors catch up on research, coding, and data labeling. It's a tough problem for the industry and, more importantly, for long-term safety.

We're obviously nowhere close now, but if we get to a world AI becomes powerful, and powerful AI can be used to create misaligned powerful AI, you may have to start regulating powerful AI like refined uranium processing tech, which is regulated more heavily than refined uranium itself.

Why SWE-bench Verified no longer measures frontier coding capabilities

tedsanders — Mon, 23 Feb 2026 18:08:55 +0000

Article URL: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

Comments URL: https://news.ycombinator.com/item?id=47126205

Points: 10

# Comments: 0

New comment by tedsanders in "Uncovering insiders and alpha on Polymarket with AI"

tedsanders — Fri, 20 Feb 2026 22:09:42 +0000

Bribing employees to disclose confidential information entrusted to them is not kosher nor wholesome. I consider corporate insider trading on these markets to be analogous - if you're an employee and you trade, you are selling your employer's info for money. Nearly every employer would fire employees caught giving away confidential information for personal bribes.

In the stock market, Matt Levine likes to say that insider training is about theft, not fairness. You can be prosecuted for merely sharing info with a friend on a golf course who then proceeds to trade. Your crime is not trading (you didn't even trade), but misappropriating information you were entrusted with and not authorized to sell.