Hacker News: mchonedev

New comment by mchonedev in "GameStop makes $55.5B takeover offer for eBay"

mchonedev — Mon, 04 May 2026 10:49:00 +0000

If I understand correctly, I think the collectibles market is more in line with what GameStop is looking at here. They recently got into the trading card game including grading services via PSA.

New comment by mchonedev in "Building an internal agent: Code-driven vs. LLM-driven workflows"

mchonedev — Thu, 01 Jan 2026 23:50:19 +0000

Not quite unit tests. Evals should be created by humans, as they are measuring quality of the solution.

Let's take the example of the GitHub pr slack bot from the blog post. I would expect 2-3 evals out of that.

Starting at the core, the first eval could be that, given a list of slack messages, it correctly identifies the PRs and calls the correct tool to look up the status of said PR. None of this has to be real and the tool doesn't have to be called, but we can write a test, much like a unit test, that confirms that the AI is responding correctly in that instance.

Next, we can setup another scenario for the AI using effectively mocked history that shows what happens when the AI finds slack messages with open PRs, slack messages with merged PRs and no PR links and determine again, does the AI try to add the correct reaction given our expectations.

These are both deterministic or code-based evals that you could use to iterate on your solutions.

The use for an LLM-as-a-Judge eval is more nuanced and usually there to measure subjective results. Things like: did the LLM make assumptions not present in the context window (hallucinate) or did it respond with something completely out of context? These should be simple yes or no questions that would be easy for a human but hard to code up a deterministic test case.

Once you have your evals defined, you can begin running these with some regularity and you're to a point where you can iterate on your prompts with a higher level of confidence than vibes

Edit: I did want to share that if you can make something deterministic, you probably should. The slack PR example is something that id just make a simple script that runs on a cron schedule, but it was easy to pull on as an example.

New comment by mchonedev in "Building an internal agent: Code-driven vs. LLM-driven workflows"

mchonedev — Thu, 01 Jan 2026 22:30:56 +0000

This is absolutely possible but likely not desirable for a large enough population of customers such that current LLM inference providers don't offer it. You can get closer by lowering a variable, temperature. This is typically a floating point number 0-1 or 0-2. The lower this number, the less noise in responses, but a 0 still does not result in identical responses due to other variability.

In response to the idea of iterative development, it is still possible, actually! You run something more akin to integration tests and measure the output against either deterministic processes or have an LLM judge it's own output. These are called evals and in my experience are a pretty hard requirement to trusting deployed AI.

New comment by mchonedev in "Pixel 6a gets a mandatory Android update next week with battery reduction"

mchonedev — Wed, 02 Jul 2025 20:30:13 +0000

Oh goodie. My 6a has had terrible battery life (and actually overheated about a month and a half ago while charging) yet my phone doesn't qualify for the replacement for whatever reason.

This is my 3rd google phone in a row with issues:

Nexus 6p - Had to have the battery replaced and then inexplicitly died, dead to the world Pixel 4a - Similar battery issues + a screen that physically fell out of the phone Pixel 6a - Battery woes BEFORE this update in the near future

New comment by mchonedev in "Claude Code does our releases now"

mchonedev — Mon, 26 May 2025 17:23:21 +0000

I know you used the /s but it's quite common that 0 temperature is believed to be deterministic. For others coming across this thread, it's not deterministic, it is simply less likely to return different tokens (it still absolutely will)

Is anyone else experiencing networking problems in Azure North Central US?

mchonedev — Mon, 10 Feb 2025 17:25:37 +0000

Since Friday afternoon, we've been experiencing some really odd issues in the NCUS region of Azure. Despite microsoft's status page saying that everyine is green, we're seeing odd behavior such as:

- TLS Timeouts when pushing to ACR repositories

- Databricks queries taking exponentially longer than last week

- ADO builds hanging indefinitely until killed

- Application Insights dashboards only half loading / the other half working on a refresh

- Random, intermittent, authentication issues in the azure portal that work if tried again immediately and without change.

It feels like there are gremlins running around in Azure, but I haven't any proof outside of these experiences of mine and of other colleagues (who are working in different tennants)

Comments URL: https://news.ycombinator.com/item?id=43002742

Points: 1

# Comments: 0