Hacker News: monoid73

New comment by monoid73 in "Show HN: Browser AI agent platform designed for reliability"

monoid73 — Thu, 07 Aug 2025 18:04:50 +0000

for the hybrid workflows, curious how do you decide which parts need AI reasoning vs can be hardcoded? is it adaptive or manual config?

Show HN: Open Operator Evals – real-world benchmarks for LLM web agents

monoid73 — Thu, 19 Jun 2025 13:03:56 +0000

We’ve open-sourced a benchmark for LLM-driven web agent setups.

It evaluates real-world tasks, like logging in, scraping dashboards, and submitting forms, using structured criteria: success rate, latency, and task reliability.

Everything is fully reproducible, with all outputs, logs, and evaluation data available.

https://github.com/nottelabs/open-operator-evals

Feedback, critiques, or contributions welcome:)

Comments URL: https://news.ycombinator.com/item?id=44318333

Points: 3

# Comments: 1

New comment by monoid73 in "Void: Open-source Cursor alternative"

monoid73 — Thu, 08 May 2025 18:26:32 +0000

Another one? People saw that 3B windsurf money.

New comment by monoid73 in "Ask HN: How to find a job as Java software developer in USA?"

monoid73 — Sun, 27 Apr 2025 11:34:43 +0000

think the visa hurdle is the big one. even if you have a strong background, a lot of companies hesitate unless they already have an immigration pipeline set up. another angle could be looking for remote roles at US companies first, then trying to convert that into relocation later. a bit longer path but sometimes more realistic. good luck.

New comment by monoid73 in "Acquisitions, consolidation, and innovation in AI"

monoid73 — Thu, 24 Apr 2025 21:14:40 +0000

I think the UX of chatgpt works because it's familiar, not because it's good. Lowers friction for new users but doesn't scale well for more complex workflows. if you're building anything beyond Q&A or simple tasks, you run into limitations fast. There's still plenty of space for apps that treat the model as a backend and build real interaction layers on top — especially for use cases that aren’t served by a chat metaphor

New comment by monoid73 in "AI is right about em-dashes"

monoid73 — Thu, 24 Apr 2025 21:11:44 +0000

funny enough, i started noticing em dashes mostly through using GPT. wasn’t really part of my writing before, but now i find them super useful for managing rhythm and flow. definitely earned their place — not because LLMs use them, but because they actually work. (says ChatGPT in response to this post)

New comment by monoid73 in "Teaching LLMs how to solid model"

monoid73 — Wed, 23 Apr 2025 20:30:54 +0000

this is one of the more compelling "LLM meets real-world tool" use cases i've seen. openSCAD makes a great testbed since it's text-based and deterministic, but i wonder what the limits are once you get into more complex assemblies or freeform surfacing.

curious if the real unlock long-term will come from hybrid workflows, LLMs proposing parameterized primitives, humans refining them in UI, then LLMs iterating on feedback. kind of like pair programming, but for CAD.

New comment by monoid73 in "Can a single AI model advance any field of science?"

monoid73 — Tue, 22 Apr 2025 20:52:16 +0000

exactly. hindsight bias makes it really hard to separate genuine inference from subtle prompt leakage. even framing the question can accidentally steer it toward the right answer. would be interesting to try with completely synthetic problems first just to test the method.

New comment by monoid73 in "Sapphire: Rust based package manager for macOS"

monoid73 — Tue, 22 Apr 2025 20:44:17 +0000

same here. brew’s been great historically but it’s gotten bloated and kinda slow. curious to see if sapphire can keep things lean without sacrificing compatibility.

New comment by monoid73 in "Local LLM inference – impressive but too hard to work with"

monoid73 — Mon, 21 Apr 2025 22:21:01 +0000

yeah, that'd b nice, some kind of self-bootstrapping system where you start with a strong cloud model, then fine-tune a smaller local one over time until it’s good enough to take over. tricky part is managing quality drift and deciding when it's 'good enough' without tanking UX. edge hardware's catching up though, so feels more feasible by the day.