Hacker News: bradhilton

New comment by bradhilton in "“Captain Gains” on Capitol Hill"

bradhilton — Wed, 03 Dec 2025 14:36:05 +0000

The problem is that they have a lot of time to report their purchases. If they were required to report before they purchased the problem would probably resolve itself.

New comment by bradhilton in "“Captain Gains” on Capitol Hill"

bradhilton — Wed, 03 Dec 2025 14:33:08 +0000

The SP500 is probably the most popular investment in America, perhaps aside from housing. Wouldn't hurt to have lawmaker's fortunes broadly aligned versus narrowly aligned with specific corporations.

New comment by bradhilton in "Show HN: ART – a new open-source RL framework for training agents"

bradhilton — Thu, 01 May 2025 11:01:38 +0000

Awesome! If you run into any problems or have questions feel free to open an issue or drop by the discord [1] server.

[1] https://discord.gg/zbBHRUpwf4

New comment by bradhilton in "Show HN: ART – a new open-source RL framework for training agents"

bradhilton — Wed, 30 Apr 2025 19:03:56 +0000

Hi, we don't have reliable documentation for the HTTP API endpoints yet, mostly as they are still subject to change.

However, to briefly provide some context, `/_train_model` returns a stream of line delimited JSON objects for each gradient step as the model trains on the provided trajectories so the client can monitor progress. The final version of this endpoint may provide the option for both streaming & non-streaming responses, and/or potentially return a "training job" that can be polled instead.

New comment by bradhilton in "Show HN: ART – a new open-source RL framework for training agents"

bradhilton — Wed, 30 Apr 2025 18:08:49 +0000

Contributor here, we developed the Agent Reinforcement Trainer (ART) library to make it easy to train LLMs for anything.

No callbacks or straitjacket flows. Instead we serve an OpenAI API-compatible endpoint that you can use as a drop-in replacement for any proprietary APIs you may be hitting.

After collecting responses from the inference API, you can tune the model with your own custom rewards and repeat the process as long as you like, until performance converges. We believe this level of flexibility will make it easier for you to train state-of-the-art models for your own use cases, much like Kyle's new email agent[1].

Also happy to answer any questions you have about the framework.

[1] https://openpipe.ai/blog/art-e-mail-agent

New comment by bradhilton in "ART·E: how we built an email research agent that beats o3"

bradhilton — Tue, 29 Apr 2025 18:13:14 +0000

I could see training your own email agent being beneficial for products like this:

https://x.com/advaitpaliwal/status/1913290027897131084

New comment by bradhilton in "The Llama 4 herd"

bradhilton — Sat, 05 Apr 2025 19:09:18 +0000

I know Google DeepMind ran experiments with 10M a while ago, but I think this will be the first legit, released 10M context window model.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"

bradhilton — Fri, 07 Mar 2025 14:06:02 +0000

Yes, pedantically, it is! But as I said, everything's on a spectrum. Online-ish data can still work just fine.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""

bradhilton — Fri, 07 Mar 2025 14:04:07 +0000

We used about 58 hours on 4xH100s and about 19 hours on 8xH100s to get the very best result with the 32B model. We trained for about another 16 hours before finishing the run, but we could have stopped earlier after it was apparent the model was regressing. Actual dollar costs are provider dependent.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"

bradhilton — Fri, 07 Mar 2025 04:42:28 +0000

Well, in this case there is a much more straightforward method with the same CP-SAT solver used to create the puzzles. This is more of a fun experiment to see if we can train LLMs to solve these kinds of logical deduction problems.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"

bradhilton — Fri, 07 Mar 2025 04:13:40 +0000

Technically yes, only if you do a gradient step with data sampled from the exact same weights is it an online step.

With our training recipe this can be easily done by accumulating the gradients across the entire batch and only doing one step with optimizer before sampling more responses.

In our experiments, however, we found the advantages of doing multiple gradient steps outweighed any potential drift in policy.

Ultimately the online-ness of data is on a spectrum and while more online data is better, other factors may be more important.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"

bradhilton — Fri, 07 Mar 2025 04:07:07 +0000

The model is rewarded for accuracy. For each puzzle there are a few multiple choice questions. If it got 1 out of 4 correct, for example, its reward would be 0.25.

Then group relative advantages are calculated. If you have 16 different responses and the average accuracy is 0.5, then you subtract that from each reward and divide by the standard deviation. Say it's also 0.25. Then the advantage for our example would be (0.25 - 0.5) / 0.25 = -1.

The advantages are then used to increase (or decrease) the probability of sampling those tokens again. Since our example was negative, we penalize the model for underperforming with that response.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"

bradhilton — Fri, 07 Mar 2025 01:26:09 +0000

Yeah, the takeaway shouldn't be "our model is smarter," but that we were able to train weak models to as good or better than the best for this specific task. Depends on what you're doing, but sometimes that is enough.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""

bradhilton — Thu, 06 Mar 2025 23:05:09 +0000

We updated the first paragraph to define the acronym. Thanks again for the feedback!

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"

bradhilton — Thu, 06 Mar 2025 22:59:13 +0000

Great question! So the dataset includes prompts and solutions, but no "gold" answer per se to use for SFT. You could sample responses from larger models and then train the smaller model on their answers, but as outlined in the benchmarks there is still a lot of headroom on this task and I wouldn't expect that to get the same results. At the very least you would probably want to do rejection sampling to discard bad results. It would definitely be a good experiment!

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""

bradhilton — Thu, 06 Mar 2025 22:07:21 +0000

Great point! Thanks for the feedback.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"

bradhilton — Thu, 06 Mar 2025 22:05:22 +0000

Yeah, it may help. In this paper[1], the author used a KL penalty of 0.01 for general tasks and 0.001 for mathematical. I tend to think it's probably not very important unless you're trying to optimize for human preferences.

As for response length, I think the model internalizes the logic and doesn't deliberate its answers through context creation. I don't think this is necessarily good for general reasoning, but for a specific task it would cut down inference costs. Just depends on what you're optimizing for. To encourage more general reasoning, I think a broader train and validation set would be helpful.

[1] https://arxiv.org/html/2501.03262v1

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"

bradhilton — Thu, 06 Mar 2025 21:58:06 +0000

We trained all the parameters. Those would definitely be interesting ablations. I would also like to see how much of a performance hit we would take with PEFT methods like LoRA.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""

bradhilton — Thu, 06 Mar 2025 21:46:46 +0000

No meaningful changes to the hyperparameters, just changed the tasks per iteration to 16 and trained on the same first 16 training tasks each iteration.

We only tested this with the 14B model. You can see the run here:

https://wandb.ai/bradhilton/rl-experiments/runs/062

Performance peaked after 21 iterations at 45% accuracy instead of the final 59%, but still a significant increase on very few samples.

New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""

bradhilton — Thu, 06 Mar 2025 21:25:38 +0000

As for why they dropped suddenly, I don't really know. Sometimes models develop degenerate behaviors, but even when forking from the best checkpoint and lowering the learning rate or changing other hyperparameters, performance stills drops. It's as if its fate has already been sealed many iterations ago.