Hacker News: someguy101010

New comment by someguy101010 in "Codex is now in the ChatGPT mobile app"

someguy101010 — Fri, 15 May 2026 03:18:10 +0000

if i didn't have to prompt it to learn from its mistakes and it just "intuitively" knew to do that

Slowing Down My Coding Agents to Get More Done

someguy101010 — Wed, 13 May 2026 14:39:01 +0000

Article URL: https://www.robw.fyi/2026/05/11/slowing-down-my-coding-agents-to-get-more-done/

Comments URL: https://news.ycombinator.com/item?id=48122557

Points: 3

# Comments: 0

Coloring Code: How Compilers Use Graph Theory [video]

someguy101010 — Wed, 29 Apr 2026 15:15:40 +0000

Article URL: https://www.youtube.com/watch?v=K3mi2m7ccDQ

Comments URL: https://news.ycombinator.com/item?id=47949610

Points: 1

# Comments: 0

New comment by someguy101010 in "Launch HN: Vela (YC W26) – AI for complex scheduling"

someguy101010 — Thu, 05 Mar 2026 18:29:33 +0000

have built in this space which led me to develop a minizinc mcp server [0] for scheduling bocce tournaments [1]. scheduling with constraints is a np hard problem and it makes sense people struggle. tools exist to solve this problem but they are complex and hard to use for non technical folks, and even technical folks. am hoping a tool like this can bridge the gap and would like to bring it to your awareness if you aren't already thinking about the problem this way :)

edit: after reading a bit more of description looks like yall are taking a similar approach, kudos!

[0] https://github.com/r33drichards/minizinc-mcp

[1] https://github.com/r33drichards/bocce-scheduler

New comment by someguy101010 in "When does MCP make sense vs CLI?"

someguy101010 — Sun, 01 Mar 2026 19:08:14 +0000

yep! thats the motivation behind https://github.com/r33drichards/mcp-js

I want to be able to give agents access to computation in a secure way without giving them full access to a computer

New comment by someguy101010 in "Show HN: A MitM proxy to see what your LLM tools are sending"

someguy101010 — Thu, 29 Jan 2026 02:01:09 +0000

Does this support bedrock?

Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

someguy101010 — Mon, 26 Jan 2026 17:46:22 +0000

Hey HN, we're excited to share Cua-Bench ( https://github.com/trycua/cua ), an open-source framework for evaluating and training computer-use agents across different environments.

Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.

The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance. Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.

With Cua-Bench, you can:

- Evaluate agents across multiple benchmarks with one CLI (native tasks + OSWorld + Windows Agent Arena adapters)

- Test the same agent on different OS variations (Windows 11/XP/Vista, macOS themes, Linux, Android via QEMU)

- Generate new tasks from natural language prompts

- Create simulated environments for RL training (shell apps like Spotify, Slack with programmatic rewards)

- Run oracle validations to verify environments before agent evaluation

- Monitor agent runs in real-time with traces and screenshots

All of this works on macOS, Linux, Windows, and Android, and is self-hostable.

To get started:

Install cua-bench:

% pip install cua-bench

Run a basic evaluation:

% cb run dataset datasets/cua-bench-basic --agent demo

Open the monitoring dashboard:

% cb run watch

For parallelized evaluations across multiple workers:

% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8

Want to test across different OS variations? Just specify the environment:

% cb run task slack_message --agent your-agent --env windows_xp

% cb run task slack_message --agent your-agent --env macos_sonoma

Generate new tasks from prompts:

% cb task generate "book a flight on kayak.com"

Validate environments with oracle implementations:

% cb run dataset datasets/cua-bench-basic --oracle

The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.

We're seeing teams use Cua-Bench for:

- Training computer-use models on mobile and desktop environments

- Generating large-scale training datasets (working with labs on millions of screenshots across OS variations)

- RL fine-tuning with shell app simulators

- Systematic evaluation across OS themes and browser versions

- Building task registries (collaborating with Snorkel AI on task design and data curation, similar to their Terminal-Bench work)

Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.

GitHub: https://github.com/trycua/cua

Docs: https://cua.ai/docs/cuabench

Technical Report: https://cuabench.ai

We'll be here to answer any technical questions and look forward to your comments!

Comments URL: https://news.ycombinator.com/item?id=46768906

Points: 40

# Comments: 8

Solve Hi-Q with AlphaZero and Curriculum Learning

someguy101010 — Mon, 29 Dec 2025 14:06:33 +0000

Article URL: https://www.robw.fyi/2025/12/28/solve-hi-q-with-alphazero-and-curriculum-learning/

Comments URL: https://news.ycombinator.com/item?id=46420882

Points: 1

# Comments: 0

New comment by someguy101010 in "Skills for organizations, partners, the ecosystem"

someguy101010 — Thu, 18 Dec 2025 17:57:11 +0000

Is it possible to provide a llm a skill through the mcp resource feature?

New comment by someguy101010 in "Why Windows XP is the ultimate AI benchmark"

someguy101010 — Tue, 16 Dec 2025 16:31:03 +0000

as an infrastructure engineer the idea of being able to train computer use agents without provisioning infrastructure sounds amazing!

a common use case i run into is i want to be able to configure corporate vpn software on windows machines. is there a link for a getting started guide i could try this out with?

New comment by someguy101010 in "Dagger: Define software delivery workflows and dev environments"

someguy101010 — Sun, 14 Dec 2025 14:30:02 +0000

have used it, and i do like it, but the licensing situation is not great. It open source but its not free software by any means.

New comment by someguy101010 in "The "confident idiot" problem: Why AI needs hard rules, not vibe checks"

someguy101010 — Mon, 08 Dec 2025 14:06:05 +0000

wrote about this a bit too in https://www.robw.fyi/2025/10/24/simple-control-flow-for-auto...

ran into this when writing agents to fix unit tests. often times they would just give up early so i started writing the verifiers directly into the agent's control flow and this produced much more reliable results. i believe claude code has hooks that do something similar as well.

New comment by someguy101010 in "Isn't WSL2 just a VM?"

someguy101010 — Mon, 01 Dec 2025 20:01:36 +0000

clearly you have never worked in enterprise

New comment by someguy101010 in "Ghostty compiled to WASM with xterm.js API compatibility"

someguy101010 — Mon, 01 Dec 2025 19:16:10 +0000

nice one kyle! you could add https://github.com/wasmerio/webassembly.sh and have a fully featured in browser shell with support for installing packages!

New comment by someguy101010 in "The Thinking Game Film – Google DeepMind documentary"

someguy101010 — Sun, 30 Nov 2025 18:44:04 +0000

reposting this from youtube comment

From 1:14:55-1:15:20, within the span of 25 seconds, the way Demis spoke about releasing all known sequences without a shred of doubt was so amazing to see. There wasn't a single second where he worried about the business side of it (profits, earnings, shareholders, investors) —he just knew it had to be open source for the betterment of the world. Gave me goosebumps. I watched that on repeat for more than 10 times.

Simple Control Flow for Automatically Steering Agents

someguy101010 — Sun, 26 Oct 2025 16:48:16 +0000

Article URL: https://www.robw.fyi/2025/10/24/simple-control-flow-for-automatically-steering-agents/

Comments URL: https://news.ycombinator.com/item?id=45713265

Points: 1

# Comments: 0

New comment by someguy101010 in "Constraint satisfaction to optimize item selection for bundles in Minecraft"

someguy101010 — Sun, 12 Oct 2025 23:19:35 +0000

I opt for the greedy strategy in most game play scenarios for pretty much the reasons you described here. I was considering making a mod to perform this action for me and was looking for a more "correct" solution but greedy is way simpler and just as effective for most cases.

New comment by someguy101010 in "Constraint satisfaction to optimize item selection for bundles in Minecraft"

someguy101010 — Sun, 12 Oct 2025 22:22:52 +0000

Thanks for catching this!

Constraint satisfaction to optimize item selection for bundles in Minecraft

someguy101010 — Sun, 12 Oct 2025 18:31:51 +0000

Article URL: https://www.robw.fyi/2025/10/12/using-constraint-satisfaction-to-optimize-item-selection-for-bundles-in-minecraft/

Comments URL: https://news.ycombinator.com/item?id=45560535

Points: 41

# Comments: 11

New comment by someguy101010 in "Kitten TTS: 25MB CPU-Only, Open-Source Voice Model"

someguy101010 — Wed, 06 Aug 2025 02:49:17 +0000

if the people who develop and release these models were all optimizing for the same goals, they could converge on strategies or behaviors, without coordinating.