Hacker News: kcorbitt

Codex, File My Taxes. Make No Mistakes

kcorbitt — Tue, 10 Mar 2026 18:33:14 +0000

Article URL: https://corbt.com/posts/codex-file-my-taxes-make-no-mistakes

Comments URL: https://news.ycombinator.com/item?id=47327098

Points: 5

# Comments: 1

A Pocket Guide to Surviving the Robot Apocalypse

kcorbitt — Mon, 16 Feb 2026 20:50:03 +0000

Article URL: https://corbt.com/posts/a-pocket-guide-to-surviving-the-robot-apocalypse/

Comments URL: https://news.ycombinator.com/item?id=47040155

Points: 2

# Comments: 0

New comment by kcorbitt in "My AI Adoption Journey"

kcorbitt — Thu, 05 Feb 2026 22:52:46 +0000

And lately, the sweet spot has been moving upwards every 6-8 weeks with the model release cycle.

New comment by kcorbitt in "Do not mistake a resilient global economy for populist success"

kcorbitt — Fri, 09 Jan 2026 07:05:48 +0000

Is it?

New comment by kcorbitt in "Show HN: RULER – Easily apply RL to any agent"

kcorbitt — Fri, 11 Jul 2025 23:41:50 +0000

Dang, hadn't seen that. Namespace collision strikes again.

New comment by kcorbitt in "Show HN: RULER – Easily apply RL to any agent"

kcorbitt — Fri, 11 Jul 2025 23:41:20 +0000

I really like RLPR for when you have a known-good answer to compare to as well!

New comment by kcorbitt in "Show HN: RULER – Easily apply RL to any agent"

kcorbitt — Fri, 11 Jul 2025 23:40:37 +0000

No, we don't do anything. Theoretically we could judge several times with different ordering.

We could measure order bias really easily though; we just need to look at the average score by rollout position across many runs. I'll add that to my list of experiments!

New comment by kcorbitt in "Show HN: RULER – Easily apply RL to any agent"

kcorbitt — Fri, 11 Jul 2025 21:06:42 +0000

Thank! If there are any topics that you'd find particularly interesting, let me know and I can try to find time. :)

Show HN: RULER – Easily apply RL to any agent

kcorbitt — Fri, 11 Jul 2025 17:47:36 +0000

Hey HN, Kyle here, one of the co-founders of OpenPipe.

Reinforcement learning is one of the best techniques for making agents more reliable, and has been widely adopted by frontier labs. However, adoption in the outside community has been slow because it's so hard to implement.

One of the biggest challenges when adapting RL to a new task is the need for a task-specific "reward function" (way of measuring success). This is often difficult to define, and requires either high-quality labeled data and/or significant domain expertise to generate.

RULER is a drop-in reward function that works across different tasks without any of that complexity.

It works by showing N trajectories to an LLM judge and asking it to rank them relative to each other. This sidesteps the calibration issues that plague most LLM-as-judge approaches. Combined with GRPO (which only cares about relative scores within groups), it just works (surprisingly well!).

We have a full writeup on the blog, including results on 4 production tasks. On all 4 tasks, small Qwen 2.5 models trained with RULER+GRPO beat the best prompted frontier model, despite being significantly smaller and cheaper to run. Surprisingly, they even beat models trained with hand-crafted reward functions on 3/4 tasks! https://openpipe.ai/blog/ruler

Repo: https://github.com/OpenPipe/ART

Comments URL: https://news.ycombinator.com/item?id=44535078

Points: 81

# Comments: 11

New comment by kcorbitt in "Lossless LLM 3x Throughput Increase by LMCache"

kcorbitt — Sat, 28 Jun 2025 14:57:41 +0000

Looks cool! With vLLM v1, prefix caching is enabled by default and seems quite performant. Is the advantage of LMCache the fact that you can offload to CPU and disk as well? How much is throughput/latency affected if you need to pull a large KV cache from disk/cpu instead of GPU RAM?

Also, how realistic would it be to share the KV cache across vllm nodes within a data center? It would be really nice to be able to freely distribute requests to a pool of vLLM workers without worrying about prefix-aware routing, but maybe that isn't the right approach because moving the KV cache around would be too slow?

New comment by kcorbitt in "Fault Tolerant Llama training"

kcorbitt — Fri, 27 Jun 2025 00:49:25 +0000

I was curious about this so I had o3 do a bit of research. Turns out 300 L40s have more compute than any supercomputer before 2013 (and arguably before 2016, depending on how you count reduced-precision FLOPs).

https://chatgpt.com/share/685dea79-26ec-8002-bd62-7ed83aedf4...

New comment by kcorbitt in "Self-Adapting Language Models"

kcorbitt — Fri, 13 Jun 2025 23:26:31 +0000

The real answer is that nobody trusts their automated evals enough to be confident that any given automatically-trained release actually improves performance, even if eval scores go up. So for now everyone batches up updates and vibe-checks them before rolling them out.

Everything I know about reward hacking

kcorbitt — Thu, 12 Jun 2025 17:14:46 +0000

Article URL: https://openpipe.ai/blog/reward-hacking

Comments URL: https://news.ycombinator.com/item?id=44260189

Points: 3

# Comments: 0

New comment by kcorbitt in "Look Ma, No Bubbles: Designing a Low-Latency Megakernel for Llama-1B"

kcorbitt — Wed, 28 May 2025 02:22:27 +0000

It seems like the speedups here are most useful for small models, since on larger models a smaller fraction of the total time would be spent swapping between kernels? Would be interesting to see at least theoretical results for LLMs in the 14-70B parameter range, which is what most folks deploy in practice.

And of course the effect on throughput at larger batch sizes, which they allude to at the end.

Overall a very interesting result!

New comment by kcorbitt in "Sorry, grads: Entry-level tech jobs are getting wiped out"

kcorbitt — Thu, 22 May 2025 14:22:56 +0000

There are many industries where you need lots of experience before you're a net contributor to productivity. This is true for everything from hairdressers to doctors. We have ways of dealing with this (eg. taking out loans to undergo years of training).

The problem comes if the number of years of experience you need to outperform the frontier AI models advances at more than 1 per year, which is not out of the question.

New comment by kcorbitt in "Gemma 3n preview: Mobile-first AI"

kcorbitt — Tue, 20 May 2025 20:46:13 +0000

I wonder if they've trained the model to operate with a shallower stack; eg. the full model may be composed of 24 transformer blocks, but they've also trained it to accept embeddings at layer 8, so it can be operated with just 16 transformer blocks on lower-resourced devices.

Experimenters in the open source tinkering community have done the opposite (copy/pasting layers in existing models to make them deeper) and it seems to work... fine, with minimal post-training on the new, deeper model required to exceed the performance of the original model. So it's not a crazy idea.

New comment by kcorbitt in "Windsurf SWE-1: Our First Frontier Models"

kcorbitt — Fri, 16 May 2025 06:18:05 +0000

It's very unlikely that they're doing their own pre-training, which is the longest and most expensive part of creating a frontier model (if they were, they'd likely brag about it).

Most likely they built this as a post-train of an open model that is already strong on coding like Qwen 2.5.

New comment by kcorbitt in "The unreasonable effectiveness of an LLM agent loop with tool use"

kcorbitt — Thu, 15 May 2025 21:40:21 +0000

For "that last 10% of reliability" RL is actually working pretty well right now too! https://openpipe.ai/blog/art-e-mail-agent

New comment by kcorbitt in "Show HN: ART – a new open-source RL framework for training agents"

kcorbitt — Wed, 30 Apr 2025 19:30:35 +0000

Ok good questions here.

By fine-tuning in this context I assume you mean "supervised fine-tuning", or SFT. SFT trains a model to produce a specific string of output tokens, given an input. With SFT, if you were trying to train an assistant to solve math problems using a code interpreter, you might train it on a dataset that looks like:

    input: 'What is 934+1208'  
    output: `print(934+1208)`

    input: 'how many "r"s in strawberry'
    output: `print(len([l for l in "strawberry" if l == 'r'])`

etc, etc.

RL, on the other hand, just means training a model not to produce a concrete string of output tokens, but rather to create an output that maximizes some reward function (you get to decide on the reward).

For the example above, you might create the following dataset for RL training:

    input: 'What is 934+1208'
    ground_truth: 2142

    input: 'how many "r"s in strawberry'
    ground_truth: 3

You would then train the model to write python code that produces the ground_truth output. Your training code would take the model's output, run the python it produced, and then check whether the output matches the expected ground_truth. Importantly, this doesn't require you actually writing the code to solve the problem (you don't even have to know if it's solvable, technically!). Over time, the training loop would make the model more likely to produce outputs that get high rewards, which hopefully means it gets better at producing valid and applicable python.

This is useful in lots of domains where it's easier to check the answer than actually produce it. In the blog post[1] linked above, we train the agent to effectively use keyword search to try to find the correct emails in an inbox. As the model trainer, I didn't actually know what the right strategy was to choose keywords that would most quickly find the relevant email, but through training with RL, the model was able to figure it out on its own!

[1]: https://openpipe.ai/blog/art-e-mail-agent?refresh=1746030513...

New comment by kcorbitt in "Show HN: ART – a new open-source RL framework for training agents"

kcorbitt — Wed, 30 Apr 2025 17:47:24 +0000

Figured now was a good time to post this since we recently got surprisingly good results on training an email research agent. Link is above, but will put it here as well since I think it's a good example of RL's promise: https://openpipe.ai/blog/art-e-mail-agent