<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: fiso64</title><link>https://news.ycombinator.com/user?id=fiso64</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 03 Jul 2026 00:57:07 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=fiso64" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"]]></title><description><![CDATA[
<p>That's why we should simulate changing requirements, for example with an LLM roleplaying as a human who's co-developing with an agent. Simply asking the LLM to add one big feature is not enough. I don't see why we shouldn't be able to build a more advanced benchmark. Attempting to benchmark "taste" is not the way.</p>
]]></description><pubDate>Thu, 02 Jul 2026 15:59:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=48763479</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=48763479</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48763479</guid></item><item><title><![CDATA[New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"]]></title><description><![CDATA[
<p>Yes it is relevant and testable. It's exactly what I meant by "a measurable increase in quality of the final product". In fact a proper test harness would reveal that problem. You are forgetting that with LLMs, testing software does not have to end at the usual unit/integration/e2e level.</p>
]]></description><pubDate>Thu, 02 Jul 2026 12:03:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=48760109</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=48760109</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48760109</guid></item><item><title><![CDATA[New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"]]></title><description><![CDATA[
<p>Yes, please do leave. The thing is that this isn't even necessarily about software engineering as much as it is about benchmarking/epistemology in general.</p>
]]></description><pubDate>Thu, 02 Jul 2026 12:02:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=48760103</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=48760103</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48760103</guid></item><item><title><![CDATA[New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"]]></title><description><![CDATA[
<p>>Maintainability is important because you can never know if a feature will be built upon in the future or not.<p>Of course maintainability is important. It's almost like saying good code is important (duh). The issue is that what is or isn't maintainable depends on the problem at hand. Sometimes you need to build heavier abstractions or refactor existing code when implementing a feature because it will pay off later. Other times, that exact same approach is horrible over-engineering because a simple, direct fix was all that was needed, so in fact you introduced a maintenance burden. You cannot reliably decide whether a patch is "bloated" or "tasteful" when looking at a diff without knowing where the project is headed.<p>>You can write extremely poor code that has no bugs, it doesn't make it tasteful.<p>You can, but it becomes increasingly hard to do so as you try to add features and maintain it. Taste, whatever that is, should ultimately lead to a measurable increase in the quality of the final product; if it doesn't, then your definition of "taste" is <i>irrelevant</i>. What I'm proposing is to skip trying to measure this ill-defined concept and only assess the quality of the final product, after the agent spent a significant amount of time working on it, and a reviewer spent a significant amount of time testing it. Agents should be assessed on their ability to build entire projects (e.g., many large features or even an entire app), not just a single feature. If an agent has no taste, then its bad decisions will compound and result in it stalling, or its output having more bugs and performing worse, given a sufficiently large scope.</p>
]]></description><pubDate>Thu, 02 Jul 2026 10:12:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=48759066</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=48759066</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48759066</guid></item><item><title><![CDATA[New comment by fiso64 in "Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers"]]></title><description><![CDATA[
<p>I think benchmarks like this are too subjective and narrow to be useful. For example, whether a patch "bloats" the codebase really depends on the situation: If it's building a feature that will grow in the future, or refactoring code that has a long history of bugs, then a larger patch might in fact be good. It's not clear from the blog just how much context the LLM judge receives about the long term project goals and history. Benchmarks should be focused on evaluating the final result only. Maybe ask the coder to build a full app, or implement many new large features for an existing app in sequence, with a larger set of requirements, or have another LLM roleplay as the human to make the instructions a little more underspecified. When done, ask a reviewer harness to test the product for 5 hours, not the code. Count the number of bugs and weigh them by severity. "Taste" would then become an automatic consequence of correctness.<p>(Full disclosure, I'm not a software engineer.)</p>
]]></description><pubDate>Thu, 02 Jul 2026 07:07:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=48757602</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=48757602</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48757602</guid></item><item><title><![CDATA[New comment by fiso64 in "DeepSWE: A contamination-free benchmark for long-horizon coding agents"]]></title><description><![CDATA[
<p>The fact that claude and gpt 5.5 have nearly the same scores tells me your benchmark is not capturing a significant gap in capability between these two. What the linked page says about Claude is true in my experience: It frequently forgets important instructions and likes to take lazy shortcuts. Gpt by contrast is much more attentive and takes its time when needed to deliver a complete and robust solution. I have tested both models on two private repos (c#, go) on two long-horizon tasks with well-defined stop conditions and observed the same pattern in both cases. Both models still require a large harness to reduce shortcuts and architecturally unclean code, but gpt performs much better, to the point where I find claude unusable for any significant work.</p>
]]></description><pubDate>Wed, 27 May 2026 06:50:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=48290611</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=48290611</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48290611</guid></item><item><title><![CDATA[New comment by fiso64 in "TeX Live 2026 is available for download now"]]></title><description><![CDATA[
<p>Worth noting that LLMs are very bad at writing cetz code, even if you try to feed them all the docs. I had to use TiKZ and import the resulting PDFs for some of the more complex illustrations in my thesis.</p>
]]></description><pubDate>Fri, 06 Mar 2026 08:23:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=47272395</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=47272395</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47272395</guid></item><item><title><![CDATA[New comment by fiso64 in "There Is No Future for Online Safety Without Privacy and Security"]]></title><description><![CDATA[
<p>How do you prevent people from using their keys to set up servers that remotely provide tokens to anyone?</p>
]]></description><pubDate>Tue, 23 Dec 2025 16:58:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=46366903</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=46366903</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46366903</guid></item><item><title><![CDATA[New comment by fiso64 in "Garfield's proof of the Pythagorean Theorem"]]></title><description><![CDATA[
<p>I don't get his "modern" proof. Specifically the step where he says "it's easy to see geometrically that these matrices differ by a rotation" seems to be doing a lot of heavy lifting. The first matrix transforms e1 to (a,-b), the second scales e1 to (c,0). If you can see that you obtain one of these vectors by rotating the other, then you've shown that their lengths are equal (i.e. a²+b²=c²), which is what we want to show in the first place.</p>
]]></description><pubDate>Sat, 29 Nov 2025 11:11:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=46086683</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=46086683</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46086683</guid></item><item><title><![CDATA[New comment by fiso64 in "What we talk about when we talk about sideloading"]]></title><description><![CDATA[
<p>And if you do have root, there is a good chance you're blocked from using common services on your phone such as mobile banking.</p>
]]></description><pubDate>Wed, 29 Oct 2025 07:58:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=45743939</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=45743939</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45743939</guid></item><item><title><![CDATA[New comment by fiso64 in "Typst 0.14"]]></title><description><![CDATA[
<p>I have laptop with a good-ish CPU that is only a few years old, and on page 3 tinymist is already starting to struggle. There is a noticeable input delay between me pressing a key on the keyboard, and the key getting typed & the preview updating. I think it's more of a tinymist issue though, as it has no debouncing and apparently also runs the preview updates on the same thread as vscode's input handling.</p>
]]></description><pubDate>Sun, 26 Oct 2025 10:56:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=45710740</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=45710740</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45710740</guid></item><item><title><![CDATA[New comment by fiso64 in "4chan will refuse to pay daily online safety fines, lawyer tells BBC"]]></title><description><![CDATA[
<p>At this point your hypothesis is unfalsifiable. I was on /g/ before and after the hack and didn't detect any big changes. It was as shitty as ever.</p>
]]></description><pubDate>Fri, 22 Aug 2025 16:21:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=44986435</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=44986435</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44986435</guid></item><item><title><![CDATA[New comment by fiso64 in "Firing programmers for AI is a mistake"]]></title><description><![CDATA[
<p>>Causal models require machinery which is symbolic, which is able to generate hypotheses and test and prove statements about a world. LLMs are not yet capable of this and the fundamental architecture of the llm machine is not built for it.<p>Prove that the human brain does symbolic computation.</p>
]]></description><pubDate>Tue, 11 Feb 2025 17:54:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=43015844</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=43015844</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43015844</guid></item><item><title><![CDATA[New comment by fiso64 in "Right to root access"]]></title><description><![CDATA[
<p>I'm actually confused about why banks are so aggressive in denying users the ability to use their apps while rooted. Unlike Google and Apple I can't think of any financial incentives for this, and the security argument is quite obviously nonsense, as I don't think there has been a single person in history who managed to fall for a scam that made them follow the complicated procedure of rooting a smartphone. Nevertheless there is a clear continuous effort in developing new root detection methods to keep me from using their apps.</p>
]]></description><pubDate>Mon, 13 Jan 2025 13:56:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=42683441</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=42683441</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42683441</guid></item><item><title><![CDATA[New comment by fiso64 in "LineageOS 22"]]></title><description><![CDATA[
<p>>The last thing I want, or the bank wants, is some grandmother downloading the "Wells Fargo Bank Plus with Giant Legible Accessible Text" app she saw in an ad as an APK, installing it, and being a victim of silent fraud for years.<p>I don't think this happens nowadays. Android will either block by default or give you a million prompts and warnings before it allows you to install an apk from an unknown source. It's far, far easier to install it from google play. I don't think any grandmother would manage to accidentally ignore the first 3 pages of genuine links on google and then push the right buttons that enable sideloading.</p>
]]></description><pubDate>Tue, 31 Dec 2024 17:59:29 +0000</pubDate><link>https://news.ycombinator.com/item?id=42560450</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=42560450</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42560450</guid></item><item><title><![CDATA[New comment by fiso64 in "Chain-of-thought can hurt performance on tasks where thinking makes humans worse"]]></title><description><![CDATA[
<p>A framing that is longer, far harder to parse, and carries less information.</p>
]]></description><pubDate>Thu, 31 Oct 2024 09:14:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=42004840</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=42004840</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42004840</guid></item><item><title><![CDATA[New comment by fiso64 in "Spain sentences 15 schoolchildren over AI-generated naked images"]]></title><description><![CDATA[
<p>>Each of the defendants was handed a year’s probation and ordered to attend classes on gender and equality awareness<p>Sounds like it has absolutely nothing to do with gender inequality and everything to do with a teenager's spiking hormones.</p>
]]></description><pubDate>Tue, 09 Jul 2024 19:00:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=40919594</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=40919594</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40919594</guid></item><item><title><![CDATA[New comment by fiso64 in "Pop culture has become an oligopoly (2022)"]]></title><description><![CDATA[
<p>This doesn't explain why this has started happening in the last 40 years for media types that have existed for far longer than that. Even movies have been around for long enough to build up a large catalogue and for investors to catch on, so why didn't we start seeing a rise in remakes and sequels until recently?</p>
]]></description><pubDate>Mon, 17 Jun 2024 17:16:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=40708200</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=40708200</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40708200</guid></item><item><title><![CDATA[New comment by fiso64 in "DOJ: Man sentenced to 14 years for posession of deepfake CSAM"]]></title><description><![CDATA[
<p>>I know people are sitting in jail for drawings<p>People are taking this for granted but I have yet to see an actual case. Every time this topic appeared on HN it also turned out that the offender had real CSAM on his devices as well.</p>
]]></description><pubDate>Sat, 18 May 2024 15:29:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=40399812</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=40399812</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40399812</guid></item><item><title><![CDATA[New comment by fiso64 in "DOJ: Man sentenced to 14 years for posession of deepfake CSAM"]]></title><description><![CDATA[
<p>>Is there any actual research on this? Surely psychologists must have studied this.<p>That's not an easy thing to study, not many pedophiles would be willing to participate in studies. Existing research mostly examines just the ones who have offended. There's some research about non-contact offenders though: <a href="https://sci-hub.se/https://journals.sagepub.com/doi/abs/10.1177/1079063215603690" rel="nofollow">https://sci-hub.se/https://journals.sagepub.com/doi/abs/10.1...</a><p>TLDR<p>> while some have argued that exposure to child pornography may promote contact sexual offending by validating and reinforcing attitudes surrounding the sexualization of children (Bourke & Hernandez, 2009), others have argued that child pornography acts as a substitute for contact offending, thereby preventing the direct sexual victimization of children (Riegel, 2004). Although plausible, such causative positions are yet to be directly examined or established within the existing empirical literature base, limiting the strength of these arguments.<p>> Nonetheless, the available evidence does not appear to support the idea of a direct causal relationship between child pornography and contact sexual offending, at least in the short-term. This is consistent with the findings of McCarthy (2010), who reported that the majority of dual offenders in her sample (84%) had committed contact sexual offenses prior to, rather than following, their involvement with child pornography. Furthermore, if child pornography directly promoted contact sexual offending, one would reasonably expect rates of contact sexual offending to have similarly increased over the last two decades (Glasgow, 2010). Fortunately, official crime statistics indicate that this has not been the case (Brennan, 2012; Motivans & Kyckelhahn, 2007; Victoria Police, 2014).<p>> Taken together, these findings suggest that although some CPOs do go on to commit sexual offenses against children, engaging in child pornography offending does not inevitably lead to the direct sexual victimization of children.</p>
]]></description><pubDate>Sat, 18 May 2024 12:51:27 +0000</pubDate><link>https://news.ycombinator.com/item?id=40398598</link><dc:creator>fiso64</dc:creator><comments>https://news.ycombinator.com/item?id=40398598</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=40398598</guid></item></channel></rss>