Hacker News: drothlis

New comment by drothlis in "Launch HN: Skyvern (YC S23) – open-source AI agent for browser automations"

drothlis — Fri, 25 Oct 2024 09:21:36 +0000

> Claude's ability to count pixels and interact with a screen using precise coordinate

I guess you mean its "Computer use" API that can (if I understand correctly) send mouse click at specific coordinates?

I got excited thinking Claude can finally do accurate object detection, but alas no. Here's its output:

> Looking at the image directly, the SPACE key appears near the bottom left of the keyboard interface, but I cannot determine its exact pixel coordinates just by looking at the image. I can see it's positioned below the letter grid and appears wider than the regular letter keys, but I apologize - I cannot reliably extract specific pixel coordinates from just viewing the screenshot.

This is 3.5 Sonnet (their most current model).

And they explicitly call out spatial reasoning as a limitation:

> Claude’s spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.

--https://docs.anthropic.com/en/docs/build-with-claude/vision#...

Since 2022 I occasionally dip in and test this use-case with the latest models but haven't seen much progress on the spatial reasoning. The multi-modality has been a neat addition though.

New comment by drothlis in "Launch HN: GPT Driver (YC S21) – End-to-end app testing in natural language"

drothlis — Fri, 25 Oct 2024 08:14:26 +0000

I noticed in your demo it generated the prompt "tap on the 'Log in' button located directly below the 'Facebook Password' field".

Does your model consistently get the positions right? (above, below, etc). Every time I play with ChatGPT, even GPT-4o, it can't do basic spatial reasoning. For example, here's a typical output (emphasis mine):

> If YouTube is to the upper *left* of ESPN, press "Up" once, then *"Right"* to move the focus.

(I test TV apps where the input is a remote control, rather than tapping directly on the UI elements.)

New comment by drothlis in "The virtuous mean between time drunkenness and work martyrdom"

drothlis — Mon, 26 Feb 2024 12:29:51 +0000

Beautiful.

New comment by drothlis in "Automated Unit Test Improvement Using Large Language Models at Meta"

drothlis — Sat, 17 Feb 2024 08:37:06 +0000

https://en.wikipedia.org/wiki/Characterization_test

aka snapshot tests.

New comment by drothlis in "ViperGPT: Visual Inference via Python Execution for Reasoning"

drothlis — Mon, 20 Mar 2023 11:50:19 +0000

According to the ViperGPT paper their "ImagePatch.find()" uses GLIP.

According to the GLIP paper,† accuracy on a test-set not seen during training is around 60% so... neat demos but whether it'll be reliable enough depends on your application.

† https://arxiv.org/abs/2206.05836

New comment by drothlis in "Even the Pylint codebase uses Ruff"

drothlis — Mon, 06 Mar 2023 09:30:36 +0000

Could you implement (some of) astroid's inference using stack graphs? [1],[2]

That would allow a lot of caching optimisations, as you can "index" each file in isolation.

[1]: https://github.blog/2021-12-09-introducing-stack-graphs/

[2]: https://github.com/github/stack-graphs

New comment by drothlis in "Show HN: Touca – a better alternative to snapshot testing"

drothlis — Tue, 28 Feb 2023 08:56:52 +0000

It side-steps the problem of git conflicts, I suppose. You'd have to use their tool (`touca diff`? I don't know if that exists) instead of `git diff`.

New comment by drothlis in "Show HN: Touca – a better alternative to snapshot testing"

drothlis — Tue, 28 Feb 2023 08:41:55 +0000

Some ideas I got from Jeremias Rõßler's talk: https://t.co/xWtA58Q9q5

- Snapshot testing is like version-control but for the outputs rather than the inputs (source code).

- Asserts in traditional unit tests are like "block lists" specifying which changes aren't allowed. Instead, snapshot testing allows you to specify an "allow list" of acceptable differences (e.g. timestamps).

New comment by drothlis in "GPT is all you need for the back end"

drothlis — Tue, 24 Jan 2023 13:44:01 +0000

Obviously a sensationalised title, but it's a neat illustration of how you'd apply the language models of the future to real tasks.

GPT is all you need for the back end

drothlis — Tue, 24 Jan 2023 13:44:01 +0000

Article URL: https://github.com/TheAppleTucker/backend-GPT

Comments URL: https://news.ycombinator.com/item?id=34503418

Points: 252

# Comments: 264

New comment by drothlis in "Software testing, and why I'm unhappy about it"

drothlis — Thu, 19 Jan 2023 09:54:25 +0000

Think systems integrators and compliance tests. I would imagine that each of the individual systems being "integrated" do have their own unit tests, upstream, in their own repos.

New comment by drothlis in "Software testing, and why I'm unhappy about it"

drothlis — Tue, 17 Jan 2023 16:08:34 +0000

Some good ideas here for when your tests are in a separate repo than the system under test (GPUs/drivers/compilers in the case of the author, but it's applicable to a variety of industries).

Software testing, and why I'm unhappy about it

drothlis — Tue, 17 Jan 2023 16:06:09 +0000

Article URL: http://nhaehnle.blogspot.com/2023/01/software-testing-and-why-im-unhappy.html

Comments URL: https://news.ycombinator.com/item?id=34414193

Points: 78

# Comments: 73

New comment by drothlis in "Cross-Branch Testing"

drothlis — Mon, 16 Jan 2023 11:55:57 +0000

Related: I think it was Kernighan & Pike's "The Practice Of Programming" where I read the idea of testing a complex implementation by comparing its output against a simpler but less performant implementation.

New comment by drothlis in "Cross-Branch Testing"

drothlis — Mon, 16 Jan 2023 11:47:55 +0000

Interesting thought, somewhat related to the articles on "snapshot testing" that have been trending on HN lately.

Cross-Branch Testing

drothlis — Mon, 16 Jan 2023 11:47:30 +0000

Article URL: https://buttondown.email/hillelwayne/archive/cross-branch-testing/

Comments URL: https://news.ycombinator.com/item?id=34399691

Points: 2

# Comments: 2

New comment by drothlis in "“Expect tests” make test-writing feel like a REPL session"

drothlis — Mon, 16 Jan 2023 10:32:37 +0000

"Regression testing" can also refer to a process: When the QA team says they're doing regression testing, it means they're testing that existing functionality hasn't regressed (as opposed to testing a new feature).

I'm not particularly wedded to any of these terms, I'm just pointing out that "regression testing" has an established meaning, and it isn't snapshot testing (outside of certain industries, at least). I do find it amusing that one implementation of snapshot testing (https://pypi.org/project/pytest-regtest/) links to https://en.wikipedia.org/wiki/Regression_testing but that article doesn't describe snapshot testing at all! Maybe the article changed? Oh well, language changes too. ¯\_(ツ)_/¯

New comment by drothlis in "Ubuntu 22.04 LTS servers and phased apt updates"

drothlis — Sun, 15 Jan 2023 10:48:58 +0000

In the article they don't change /etc/machine-id, but APT::Machine-ID in apt.conf.

New comment by drothlis in "“Expect tests” make test-writing feel like a REPL session"

drothlis — Sat, 14 Jan 2023 21:26:56 +0000

https://approvaltests.com/

New comment by drothlis in "“Expect tests” make test-writing feel like a REPL session"

drothlis — Sat, 14 Jan 2023 17:10:54 +0000

...and my favourite term, "characterization test": https://en.wikipedia.org/wiki/Characterization_test

"Regression test" means something else, at least at the companies I've worked at: It means a test that was written after a defect was found in production, to ensure that the same defect doesn't happen again (that the fix doesn't "regress"). It can be a manual test or an automated test. https://en.wikipedia.org/wiki/Regression_testing