<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: Turn_Trout</title><link>https://news.ycombinator.com/user?id=Turn_Trout</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Thu, 09 Apr 2026 05:18:07 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=Turn_Trout" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by Turn_Trout in "System Card: Claude Mythos Preview [pdf]"]]></title><description><![CDATA[
<p>I agree that they called many things remarkably well! That doesn't change the fact that AI 2027 is not a thing which happened, so it isn't valid to point out "this killed us in AI 2027." There are many reasons to want to preserve CoT monitorability. Instead of AI 2027, I'd point to <a href="https://arxiv.org/html/2507.11473" rel="nofollow">https://arxiv.org/html/2507.11473</a>.</p>
]]></description><pubDate>Wed, 08 Apr 2026 15:31:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=47691608</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=47691608</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47691608</guid></item><item><title><![CDATA[New comment by Turn_Trout in "System Card: Claude Mythos Preview [pdf]"]]></title><description><![CDATA[
<p>AI 2027 is not a real thing which happened. At best, it is informed speculation.</p>
]]></description><pubDate>Wed, 08 Apr 2026 01:27:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=47683624</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=47683624</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47683624</guid></item><item><title><![CDATA[Automatic Alt Text Generation]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/alexander-turner/alt-text-llm">https://github.com/alexander-turner/alt-text-llm</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46019580">https://news.ycombinator.com/item?id=46019580</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Sun, 23 Nov 2025 00:23:32 +0000</pubDate><link>https://github.com/alexander-turner/alt-text-llm</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=46019580</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46019580</guid></item><item><title><![CDATA[An Opinionated Guide to Privacy Despite Authoritarianism]]></title><description><![CDATA[
<p>Article URL: <a href="https://turntrout.com/privacy-despite-authoritarianism">https://turntrout.com/privacy-despite-authoritarianism</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45750849">https://news.ycombinator.com/item?id=45750849</a></p>
<p>Points: 14</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 29 Oct 2025 18:14:31 +0000</pubDate><link>https://turntrout.com/privacy-despite-authoritarianism</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=45750849</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45750849</guid></item><item><title><![CDATA[New comment by Turn_Trout in "Persona vectors: Monitoring and controlling character traits in language models"]]></title><description><![CDATA[
<p>No one has empirically validated the so-called "most forbidden" descriptor. It's a theoretical worry which may or may not be correct. We should run experiments to find out.</p>
]]></description><pubDate>Mon, 04 Aug 2025 22:04:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=44791856</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=44791856</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44791856</guid></item><item><title><![CDATA[English Writes Numbers Backwards]]></title><description><![CDATA[
<p>Article URL: <a href="https://turntrout.com/english-numbers-are-backwards">https://turntrout.com/english-numbers-are-backwards</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44689547">https://news.ycombinator.com/item?id=44689547</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 25 Jul 2025 23:04:52 +0000</pubDate><link>https://turntrout.com/english-numbers-are-backwards</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=44689547</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44689547</guid></item><item><title><![CDATA[New comment by Turn_Trout in "Chain of thought monitorability: A new and fragile opportunity for AI safety"]]></title><description><![CDATA[
<p>As someone who did their PhD in RL and alignment, it was not obvious to me a priori if, or when, or how badly obfuscation would be a problem. Yes, it's been predicted (and was predicted significantly before that Zvi post). But many other alignment fears have been _predicted_, and those didn't actually happen.<p>I don't think the existence of specification gaming in unrelated settings was strong evidence that obfuscation would occur in modern CoT supervision. Speculatively, I think CoT obfuscation happens due to the internal structure of LLMs and it being inductively "easier" to reweight model circuits to not admit wrongthink, rather than to rewire circuits to solve problems in entirely different ways.</p>
]]></description><pubDate>Fri, 18 Jul 2025 18:00:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=44607871</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=44607871</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44607871</guid></item><item><title><![CDATA[New comment by Turn_Trout in "Guess I'm a rationalist now"]]></title><description><![CDATA[
<p>> The #1 comment says that the rationality community is about "trying to reason about things from first principle", when if fact it is the opposite.<p>Oh? Eliezer Yudkowsky (the most prominent Rationalist) bragged about how he was able to figure out AI was dangerous (the most stark Rationalist claim) from "the null string as input."[1]<p>[1] <a href="https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities" rel="nofollow">https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a...</a></p>
]]></description><pubDate>Wed, 25 Jun 2025 19:20:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=44380957</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=44380957</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44380957</guid></item><item><title><![CDATA[Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models]]></title><description><![CDATA[
<p>Article URL: <a href="https://turntrout.com/self-fulfilling-misalignment">https://turntrout.com/self-fulfilling-misalignment</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=43270214">https://news.ycombinator.com/item?id=43270214</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 05 Mar 2025 18:18:04 +0000</pubDate><link>https://turntrout.com/self-fulfilling-misalignment</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=43270214</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43270214</guid></item><item><title><![CDATA[New comment by Turn_Trout in "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [pdf]"]]></title><description><![CDATA[
<p>They ran (at least) two control conditions. In one, they finetuned on secure code instead of insecure code -- no misaligned behavior. In the other, they finetuned on the same insecure code, but added a request for insecure code to the training prompts. Also no misaligned behavior.<p>So it isn't catastrophic forgetting due to training on 6K examples.</p>
]]></description><pubDate>Tue, 25 Feb 2025 22:49:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=43178582</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=43178582</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=43178582</guid></item><item><title><![CDATA[New comment by Turn_Trout in "Representation Engineering: Mistral-7B on Acid"]]></title><description><![CDATA[
<p>I'm the author of the GPT-2 work. This is a nice post, thanks for making it more available. :)<p>Li et al[1] and I independently derived this technique last spring, and also someone else independently derived it last fall. Something is in the air.<p>Regarding your footnote 2 re capabilities: I considered these kinds of uses before releasing the technique. Ultimately, practically successful real-world alignment techniques will let you do new things (which is generally good IMO). The technique so far seems to be delivering the new things I was hoping for.<p>[1] <a href="https://openreview.net/forum?id=aLLuYpn83y" rel="nofollow">https://openreview.net/forum?id=aLLuYpn83y</a></p>
]]></description><pubDate>Mon, 19 Feb 2024 23:41:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=39436215</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=39436215</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39436215</guid></item><item><title><![CDATA[New comment by Turn_Trout in "Optimal AI agents tend to seek power"]]></title><description><![CDATA[
<p>First author here. Thanks for your comment!<p>> there's a lot hidden in the "if physically possible" part of the quote from the paper: "Average-optimal agents would generally stop us from deactivating them, if physically possible".<p>Let me check that I'm understanding correctly. Your main objection is that even optimal agents wouldn't be able to find plans which screw us over, as long as they don't start off with much power. Is that roughly correct?<p>> Theories on optimal policies have no bearing if<p>See my followup work [1] extending this to learned policies and suboptimal decision-making procedures. Optimality is not a necessary criterion, just a sufficient one.<p>> if as we start understanding ML models better, we can do things like hardware-block policies that lead to certain predicted outcome sequences (blocking an off switch, harming a human, etc.)<p>I'm a big fan of interpretability research. I don't think we'll scale it far enough for it to give us this capability, and even if it did, I think there are some very, very alignment-theoretically difficult problems with robustly blocking certain bad outcomes.<p>My other line of PhD work has been on negative side effect avoidance. [2] In my opinion, it's hard and probably doesn't admit a good enough solution for us to say "and now we've blocked the bad thing!" and be confident we succeeded.<p>[1] <a href="https://www.alignmentforum.org/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via" rel="nofollow">https://www.alignmentforum.org/posts/nZY8Np759HYFawdjH/satis...</a><p>[2] <a href="https://avoiding-side-effects.github.io/" rel="nofollow">https://avoiding-side-effects.github.io/</a></p>
]]></description><pubDate>Tue, 07 Dec 2021 19:00:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=29476725</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=29476725</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=29476725</guid></item><item><title><![CDATA[New comment by Turn_Trout in "Optimal AI agents tend to seek power"]]></title><description><![CDATA[
<p>Maybe you should read the paper, and/or the reviewer threads (as we discussed the nomenclature, and eventually agreed that "power" was accurate). We straightforwardly formalize a mainstream definition of power and show how it's more intuitive than the current standard measurement (information-theoretic empowerment).</p>
]]></description><pubDate>Tue, 07 Dec 2021 06:55:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=29469719</link><dc:creator>Turn_Trout</dc:creator><comments>https://news.ycombinator.com/item?id=29469719</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=29469719</guid></item></channel></rss>