<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: bradhilton</title><link>https://news.ycombinator.com/user?id=bradhilton</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Wed, 08 Apr 2026 12:34:10 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=bradhilton" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by bradhilton in "“Captain Gains” on Capitol Hill"]]></title><description><![CDATA[
<p>The problem is that they have a lot of time to report their purchases. If they were required to report before they purchased the problem would probably resolve itself.</p>
]]></description><pubDate>Wed, 03 Dec 2025 14:36:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=46134942</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=46134942</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46134942</guid></item><item><title><![CDATA[New comment by bradhilton in "“Captain Gains” on Capitol Hill"]]></title><description><![CDATA[
<p>The SP500 is probably the most popular investment in America, perhaps aside from housing. Wouldn't hurt to have lawmaker's fortunes broadly aligned versus narrowly aligned with specific corporations.</p>
]]></description><pubDate>Wed, 03 Dec 2025 14:33:08 +0000</pubDate><link>https://news.ycombinator.com/item?id=46134912</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=46134912</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46134912</guid></item><item><title><![CDATA[New comment by bradhilton in "Show HN: ART – a new open-source RL framework for training agents"]]></title><description><![CDATA[
<p>Awesome! If you run into any problems or have questions feel free to open an issue or drop by the discord [1] server.<p>[1] <a href="https://discord.gg/zbBHRUpwf4" rel="nofollow">https://discord.gg/zbBHRUpwf4</a></p>
]]></description><pubDate>Thu, 01 May 2025 11:01:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=43856094</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43856094</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43856094</guid></item><item><title><![CDATA[New comment by bradhilton in "Show HN: ART – a new open-source RL framework for training agents"]]></title><description><![CDATA[
<p>Hi, we don't have reliable documentation for the HTTP API endpoints yet, mostly as they are still subject to change.<p>However, to briefly provide some context, `/_train_model` returns a stream of line delimited JSON objects for each gradient step as the model trains on the provided trajectories so the client can monitor progress. The final version of this endpoint may provide the option for both streaming & non-streaming responses, and/or potentially return a "training job" that can be polled instead.</p>
]]></description><pubDate>Wed, 30 Apr 2025 19:03:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=43849385</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43849385</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43849385</guid></item><item><title><![CDATA[New comment by bradhilton in "Show HN: ART – a new open-source RL framework for training agents"]]></title><description><![CDATA[
<p>Contributor here, we developed the Agent Reinforcement Trainer (ART) library to make it easy to train LLMs for anything.<p>No callbacks or straitjacket flows. Instead we serve an OpenAI API-compatible endpoint that you can use as a drop-in replacement for any proprietary APIs you may be hitting.<p>After collecting responses from the inference API, you can tune the model with your own custom rewards and repeat the process as long as you like, until performance converges. We believe this level of flexibility will make it easier for you to train state-of-the-art models for your own use cases, much like Kyle's new email agent[1].<p>Also happy to answer any questions you have about the framework.<p>[1] <a href="https://openpipe.ai/blog/art-e-mail-agent">https://openpipe.ai/blog/art-e-mail-agent</a></p>
]]></description><pubDate>Wed, 30 Apr 2025 18:08:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=43848831</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43848831</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43848831</guid></item><item><title><![CDATA[New comment by bradhilton in "ART·E: how we built an email research agent that beats o3"]]></title><description><![CDATA[
<p>I could see training your own email agent being beneficial for products like this:<p><a href="https://x.com/advaitpaliwal/status/1913290027897131084" rel="nofollow">https://x.com/advaitpaliwal/status/1913290027897131084</a></p>
]]></description><pubDate>Tue, 29 Apr 2025 18:13:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=43836090</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43836090</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43836090</guid></item><item><title><![CDATA[New comment by bradhilton in "The Llama 4 herd"]]></title><description><![CDATA[
<p>I know Google DeepMind ran experiments with 10M a while ago, but I think this will be the first legit, released 10M context window model.</p>
]]></description><pubDate>Sat, 05 Apr 2025 19:09:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=43596000</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43596000</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43596000</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"]]></title><description><![CDATA[
<p>Yes, pedantically, it is! But as I said, everything's on a spectrum. Online-ish data can still work just fine.</p>
]]></description><pubDate>Fri, 07 Mar 2025 14:06:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=43290247</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43290247</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43290247</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""]]></title><description><![CDATA[
<p>We used about 58 hours on 4xH100s and about 19 hours on 8xH100s to get the very best result with the 32B model. We trained for about another 16 hours before finishing the run, but we could have stopped earlier after it was apparent the model was regressing. Actual dollar costs are provider dependent.</p>
]]></description><pubDate>Fri, 07 Mar 2025 14:04:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=43290228</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43290228</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43290228</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"]]></title><description><![CDATA[
<p>Well, in this case there is a much more straightforward method with the same CP-SAT solver used to create the puzzles. This is more of a fun experiment to see if we can train LLMs to solve these kinds of logical deduction problems.</p>
]]></description><pubDate>Fri, 07 Mar 2025 04:42:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=43287410</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43287410</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43287410</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"]]></title><description><![CDATA[
<p>Technically yes, only if you do a gradient step with data sampled from the exact same weights is it an online step.<p>With our training recipe this can be easily done by accumulating the gradients across the entire batch and only doing one step with optimizer before sampling more responses.<p>In our experiments, however, we found the advantages of doing multiple gradient steps outweighed any potential drift in policy.<p>Ultimately the online-ness of data is on a spectrum and while more online data is better, other factors may be more important.</p>
]]></description><pubDate>Fri, 07 Mar 2025 04:13:40 +0000</pubDate><link>https://news.ycombinator.com/item?id=43287341</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43287341</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43287341</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"]]></title><description><![CDATA[
<p>The model is rewarded for accuracy. For each puzzle there are a few multiple choice questions. If it got 1 out of 4 correct, for example, its reward would be 0.25.<p>Then group relative advantages are calculated. If you have 16 different responses and the average accuracy is 0.5, then you subtract that from each reward and divide by the standard deviation. Say it's also 0.25. Then the advantage for our example would be (0.25 - 0.5) / 0.25 = -1.<p>The advantages are then used to increase (or decrease) the probability of sampling those tokens again. Since our example was negative, we penalize the model for underperforming with that response.</p>
]]></description><pubDate>Fri, 07 Mar 2025 04:07:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=43287326</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43287326</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43287326</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"]]></title><description><![CDATA[
<p>Yeah, the takeaway shouldn't be "our model is smarter," but that we were able to train weak models to as good or better than the best for this specific task. Depends on what you're doing, but sometimes that is enough.</p>
]]></description><pubDate>Fri, 07 Mar 2025 01:26:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=43286723</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43286723</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43286723</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""]]></title><description><![CDATA[
<p>We updated the first paragraph to define the acronym. Thanks again for the feedback!</p>
]]></description><pubDate>Thu, 06 Mar 2025 23:05:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=43285942</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43285942</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43285942</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"]]></title><description><![CDATA[
<p>Great question! So the dataset includes prompts and solutions, but no "gold" answer per se to use for SFT. You could sample responses from larger models and then train the smaller model on their answers, but as outlined in the benchmarks there is still a lot of headroom on this task and I wouldn't expect that to get the same results. At the very least you would probably want to do rejection sampling to discard bad results. It would definitely be a good experiment!</p>
]]></description><pubDate>Thu, 06 Mar 2025 22:59:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=43285893</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43285893</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43285893</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""]]></title><description><![CDATA[
<p>Great point! Thanks for the feedback.</p>
]]></description><pubDate>Thu, 06 Mar 2025 22:07:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=43285540</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43285540</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43285540</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"]]></title><description><![CDATA[
<p>Yeah, it may help. In this paper[1], the author used a KL penalty of 0.01 for general tasks and 0.001 for mathematical. I tend to think it's probably not very important unless you're trying to optimize for human preferences.<p>As for response length, I think the model internalizes the logic and doesn't deliberate its answers through context creation. I don't think this is necessarily good for general reasoning, but for a specific task it would cut down inference costs. Just depends on what you're optimizing for. To encourage more general reasoning, I think a broader train and validation set would be helpful.<p>[1] <a href="https://arxiv.org/html/2501.03262v1" rel="nofollow">https://arxiv.org/html/2501.03262v1</a></p>
]]></description><pubDate>Thu, 06 Mar 2025 22:05:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=43285525</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43285525</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43285525</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”"]]></title><description><![CDATA[
<p>We trained all the parameters. Those would definitely be interesting ablations. I would also like to see how much of a performance hit we would take with PEFT methods like LoRA.</p>
]]></description><pubDate>Thu, 06 Mar 2025 21:58:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=43285466</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43285466</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43285466</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""]]></title><description><![CDATA[
<p>No meaningful changes to the hyperparameters, just changed the tasks per iteration to 16 and trained on the same first 16 training tasks each iteration.<p>We only tested this with the 14B model. You can see the run here:<p><a href="https://wandb.ai/bradhilton/rl-experiments/runs/062" rel="nofollow">https://wandb.ai/bradhilton/rl-experiments/runs/062</a><p>Performance peaked after 21 iterations at 45% accuracy instead of the final 59%, but still a significant increase on very few samples.</p>
]]></description><pubDate>Thu, 06 Mar 2025 21:46:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=43285379</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43285379</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43285379</guid></item><item><title><![CDATA[New comment by bradhilton in "Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue""]]></title><description><![CDATA[
<p>As for why they dropped <i>suddenly</i>, I don't really know. Sometimes models develop degenerate behaviors, but even when forking from the best checkpoint and lowering the learning rate or changing other hyperparameters, performance stills drops. 
It's as if its fate has already been sealed many iterations ago.</p>
]]></description><pubDate>Thu, 06 Mar 2025 21:25:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=43285213</link><dc:creator>bradhilton</dc:creator><comments>https://news.ycombinator.com/item?id=43285213</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43285213</guid></item></channel></rss>