Hacker News: fiso64

New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"

fiso64 — Thu, 02 Jul 2026 15:59:44 +0000

That's why we should simulate changing requirements, for example with an LLM roleplaying as a human who's co-developing with an agent. Simply asking the LLM to add one big feature is not enough. I don't see why we shouldn't be able to build a more advanced benchmark. Attempting to benchmark "taste" is not the way.

New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"

fiso64 — Thu, 02 Jul 2026 12:03:21 +0000

Yes it is relevant and testable. It's exactly what I meant by "a measurable increase in quality of the final product". In fact a proper test harness would reveal that problem. You are forgetting that with LLMs, testing software does not have to end at the usual unit/integration/e2e level.

New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"

fiso64 — Thu, 02 Jul 2026 12:02:30 +0000

Yes, please do leave. The thing is that this isn't even necessarily about software engineering as much as it is about benchmarking/epistemology in general.

New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"

fiso64 — Thu, 02 Jul 2026 10:12:37 +0000

>Maintainability is important because you can never know if a feature will be built upon in the future or not.

Of course maintainability is important. It's almost like saying good code is important (duh). The issue is that what is or isn't maintainable depends on the problem at hand. Sometimes you need to build heavier abstractions or refactor existing code when implementing a feature because it will pay off later. Other times, that exact same approach is horrible over-engineering because a simple, direct fix was all that was needed, so in fact you introduced a maintenance burden. You cannot reliably decide whether a patch is "bloated" or "tasteful" when looking at a diff without knowing where the project is headed.

>You can write extremely poor code that has no bugs, it doesn't make it tasteful.

You can, but it becomes increasingly hard to do so as you try to add features and maintain it. Taste, whatever that is, should ultimately lead to a measurable increase in the quality of the final product; if it doesn't, then your definition of "taste" is irrelevant. What I'm proposing is to skip trying to measure this ill-defined concept and only assess the quality of the final product, after the agent spent a significant amount of time working on it, and a reviewer spent a significant amount of time testing it. Agents should be assessed on their ability to build entire projects (e.g., many large features or even an entire app), not just a single feature. If an agent has no taste, then its bad decisions will compound and result in it stalling, or its output having more bugs and performing worse, given a sufficiently large scope.

New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"

fiso64 — Thu, 02 Jul 2026 07:07:24 +0000

I think benchmarks like this are too subjective and narrow to be useful. For example, whether a patch "bloats" the codebase really depends on the situation: If it's building a feature that will grow in the future, or refactoring code that has a long history of bugs, then a larger patch might in fact be good. It's not clear from the blog just how much context the LLM judge receives about the long term project goals and history. Benchmarks should be focused on evaluating the final result only. Maybe ask the coder to build a full app, or implement many new large features for an existing app in sequence, with a larger set of requirements, or have another LLM roleplay as the human to make the instructions a little more underspecified. When done, ask a reviewer harness to test the product for 5 hours, not the code. Count the number of bugs and weigh them by severity. "Taste" would then become an automatic consequence of correctness.

(Full disclosure, I'm not a software engineer.)

New comment by fiso64 in "DeepSWE: A contamination-free benchmark for long-horizon coding agents"

fiso64 — Wed, 27 May 2026 06:50:19 +0000

The fact that claude and gpt 5.5 have nearly the same scores tells me your benchmark is not capturing a significant gap in capability between these two. What the linked page says about Claude is true in my experience: It frequently forgets important instructions and likes to take lazy shortcuts. Gpt by contrast is much more attentive and takes its time when needed to deliver a complete and robust solution. I have tested both models on two private repos (c#, go) on two long-horizon tasks with well-defined stop conditions and observed the same pattern in both cases. Both models still require a large harness to reduce shortcuts and architecturally unclean code, but gpt performs much better, to the point where I find claude unusable for any significant work.

New comment by fiso64 in "TeX Live 2026 is available for download now"

fiso64 — Fri, 06 Mar 2026 08:23:22 +0000

Worth noting that LLMs are very bad at writing cetz code, even if you try to feed them all the docs. I had to use TiKZ and import the resulting PDFs for some of the more complex illustrations in my thesis.

New comment by fiso64 in "There Is No Future for Online Safety Without Privacy and Security"

fiso64 — Tue, 23 Dec 2025 16:58:39 +0000

How do you prevent people from using their keys to set up servers that remotely provide tokens to anyone?

New comment by fiso64 in "Garfield's proof of the Pythagorean Theorem"

fiso64 — Sat, 29 Nov 2025 11:11:19 +0000

I don't get his "modern" proof. Specifically the step where he says "it's easy to see geometrically that these matrices differ by a rotation" seems to be doing a lot of heavy lifting. The first matrix transforms e1 to (a,-b), the second scales e1 to (c,0). If you can see that you obtain one of these vectors by rotating the other, then you've shown that their lengths are equal (i.e. a²+b²=c²), which is what we want to show in the first place.

New comment by fiso64 in "What we talk about when we talk about sideloading"

fiso64 — Wed, 29 Oct 2025 07:58:36 +0000

And if you do have root, there is a good chance you're blocked from using common services on your phone such as mobile banking.

New comment by fiso64 in "Typst 0.14"

fiso64 — Sun, 26 Oct 2025 10:56:56 +0000

I have laptop with a good-ish CPU that is only a few years old, and on page 3 tinymist is already starting to struggle. There is a noticeable input delay between me pressing a key on the keyboard, and the key getting typed & the preview updating. I think it's more of a tinymist issue though, as it has no debouncing and apparently also runs the preview updates on the same thread as vscode's input handling.

New comment by fiso64 in "4chan will refuse to pay daily online safety fines, lawyer tells BBC"

fiso64 — Fri, 22 Aug 2025 16:21:49 +0000

At this point your hypothesis is unfalsifiable. I was on /g/ before and after the hack and didn't detect any big changes. It was as shitty as ever.

New comment by fiso64 in "Firing programmers for AI is a mistake"

fiso64 — Tue, 11 Feb 2025 17:54:48 +0000

>Causal models require machinery which is symbolic, which is able to generate hypotheses and test and prove statements about a world. LLMs are not yet capable of this and the fundamental architecture of the llm machine is not built for it.

Prove that the human brain does symbolic computation.

New comment by fiso64 in "Right to root access"

fiso64 — Mon, 13 Jan 2025 13:56:45 +0000

I'm actually confused about why banks are so aggressive in denying users the ability to use their apps while rooted. Unlike Google and Apple I can't think of any financial incentives for this, and the security argument is quite obviously nonsense, as I don't think there has been a single person in history who managed to fall for a scam that made them follow the complicated procedure of rooting a smartphone. Nevertheless there is a clear continuous effort in developing new root detection methods to keep me from using their apps.

New comment by fiso64 in "LineageOS 22"

fiso64 — Tue, 31 Dec 2024 17:59:29 +0000

>The last thing I want, or the bank wants, is some grandmother downloading the "Wells Fargo Bank Plus with Giant Legible Accessible Text" app she saw in an ad as an APK, installing it, and being a victim of silent fraud for years.

I don't think this happens nowadays. Android will either block by default or give you a million prompts and warnings before it allows you to install an apk from an unknown source. It's far, far easier to install it from google play. I don't think any grandmother would manage to accidentally ignore the first 3 pages of genuine links on google and then push the right buttons that enable sideloading.

New comment by fiso64 in "Chain-of-thought can hurt performance on tasks where thinking makes humans worse"

fiso64 — Thu, 31 Oct 2024 09:14:35 +0000

A framing that is longer, far harder to parse, and carries less information.

New comment by fiso64 in "Spain sentences 15 schoolchildren over AI-generated naked images"

fiso64 — Tue, 09 Jul 2024 19:00:48 +0000

>Each of the defendants was handed a year’s probation and ordered to attend classes on gender and equality awareness

Sounds like it has absolutely nothing to do with gender inequality and everything to do with a teenager's spiking hormones.

New comment by fiso64 in "Pop culture has become an oligopoly (2022)"

fiso64 — Mon, 17 Jun 2024 17:16:13 +0000

This doesn't explain why this has started happening in the last 40 years for media types that have existed for far longer than that. Even movies have been around for long enough to build up a large catalogue and for investors to catch on, so why didn't we start seeing a rise in remakes and sequels until recently?

New comment by fiso64 in "DOJ: Man sentenced to 14 years for posession of deepfake CSAM"

fiso64 — Sat, 18 May 2024 15:29:32 +0000

>I know people are sitting in jail for drawings

People are taking this for granted but I have yet to see an actual case. Every time this topic appeared on HN it also turned out that the offender had real CSAM on his devices as well.

New comment by fiso64 in "DOJ: Man sentenced to 14 years for posession of deepfake CSAM"

fiso64 — Sat, 18 May 2024 12:51:27 +0000

>Is there any actual research on this? Surely psychologists must have studied this.

That's not an easy thing to study, not many pedophiles would be willing to participate in studies. Existing research mostly examines just the ones who have offended. There's some research about non-contact offenders though: https://sci-hub.se/https://journals.sagepub.com/doi/abs/10.1...

TLDR

> while some have argued that exposure to child pornography may promote contact sexual offending by validating and reinforcing attitudes surrounding the sexualization of children (Bourke & Hernandez, 2009), others have argued that child pornography acts as a substitute for contact offending, thereby preventing the direct sexual victimization of children (Riegel, 2004). Although plausible, such causative positions are yet to be directly examined or established within the existing empirical literature base, limiting the strength of these arguments.

> Nonetheless, the available evidence does not appear to support the idea of a direct causal relationship between child pornography and contact sexual offending, at least in the short-term. This is consistent with the findings of McCarthy (2010), who reported that the majority of dual offenders in her sample (84%) had committed contact sexual offenses prior to, rather than following, their involvement with child pornography. Furthermore, if child pornography directly promoted contact sexual offending, one would reasonably expect rates of contact sexual offending to have similarly increased over the last two decades (Glasgow, 2010). Fortunately, official crime statistics indicate that this has not been the case (Brennan, 2012; Motivans & Kyckelhahn, 2007; Victoria Police, 2014).

> Taken together, these findings suggest that although some CPOs do go on to commit sexual offenses against children, engaging in child pornography offending does not inevitably lead to the direct sexual victimization of children.