<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: JoshMandel</title><link>https://news.ycombinator.com/user?id=JoshMandel</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 04 May 2026 17:26:42 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=JoshMandel" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by JoshMandel in "Claude Code is suddenly everywhere inside Microsoft"]]></title><description><![CDATA[
<p>Same. Sometimes even repeated nudges don't help. The underlying 3.0 Pro model is great to talk and ideate with, but its inability to deliver within the Gemini CLI harness is ... almost comical.</p>
]]></description><pubDate>Mon, 02 Feb 2026 17:59:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=46859015</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=46859015</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46859015</guid></item><item><title><![CDATA[New comment by JoshMandel in "You should write an agent"]]></title><description><![CDATA[
<p>I think that it's basically fair and I often write simple agents using exactly the technique that you describe. I typically provide a TypeScript interface for the available tools and just ask the model to respond with a JSON block and it works fine.<p>That said, it is worth understanding that the current generation of models is extensively RL-trained on how to make tool calls... so they may in fact be better at issuing tool calls in the specific format that their training has focused on (using specific internal tokens to demarcate and indicate when a tool call begins/ends, etc). Intuitively, there's probably a lot of transfer learning between this format and any ad-hoc format that you might request inline your prompt.<p>There may be recent literature quantifying the performance gap here. And certainly if you're doing anything performance-sensitive you will want to characterize this for your use case, with benchmarks. But conceptually, I think your model is spot on.</p>
]]></description><pubDate>Fri, 07 Nov 2025 17:37:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=45848802</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=45848802</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45848802</guid></item><item><title><![CDATA[New comment by JoshMandel in "Opening up ‘Zero-Knowledge Proof’ technology"]]></title><description><![CDATA[
<p>But to be clear, mdoc already accounts for this through its selective disclosure protocol, without the need for a zero knowledge proof technology. When you share an mdoc you are really just sharing a signed pile of hashes ("mobile security object") and then you can choose which salted pre-images to share along with the pile of hashes. So for example your name and your birth date are two separate data elements and sharing your MSO will share the hashes for both, but you might only choose to share the pre-image representing your birthday, or even a simple boolean claim that you are over 21 years old.<p>What you don't get with this scheme (and which zero knowledge proofs can provide) is protection against correlation: if you sign into the same site twice or sign into different sites, can the site owners recognize that it is the same user? With the design of the core mdoc selector disclosure protocol, the answer is yes.</p>
]]></description><pubDate>Thu, 03 Jul 2025 20:24:57 +0000</pubDate><link>https://news.ycombinator.com/item?id=44458872</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=44458872</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44458872</guid></item><item><title><![CDATA[New comment by JoshMandel in "GitHub MCP exploited: Accessing private repositories via MCP"]]></title><description><![CDATA[
<p>Last week I tried Google's Jules coding agent and saw it requested broad GitHub OAuth permissions --essentially "full access to everything your account can do." When you authorize it, you're granting access to all your repositories.<p>This is partly driven by developer convenience on the agent side, but it's also driven by GitHub OAuth flow. It should be easier to create a downscoped approval during authorization that still allows the app to request additional access later. It should be easy to let an agent submit an authorization request scoped to a specific repository, etc.<p>Instead, I had to create a companion GitHub account (<a href="https://github.com/jmandel-via-jules">https://github.com/jmandel-via-jules</a>) with explicit access only to the repositories and permissions I want Jules to touch. It's pretty inconvenient but I don't see another way to safely use these agents without potentially exposing everything.<p>GitHub does endorse creating "machine users" as dedicated accounts for applications, which validates this approach, but it shouldn't be necessary for basic repository scoping.<p>Please let me know if there is an easier way that Ip'm just missing.</p>
]]></description><pubDate>Tue, 27 May 2025 13:50:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=44107013</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=44107013</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44107013</guid></item><item><title><![CDATA[New comment by JoshMandel in "Advent of Code 2024"]]></title><description><![CDATA[
<p>Fair enough! I'll document my progress at <a href="https://github.com/jmandel/advent-of-claude/tree/main">https://github.com/jmandel/advent-of-claude/tree/main</a>, though I may not keep up.</p>
]]></description><pubDate>Mon, 02 Dec 2024 02:09:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=42292436</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=42292436</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42292436</guid></item><item><title><![CDATA[New comment by JoshMandel in "Advent of Code 2024"]]></title><description><![CDATA[
<p>My personal challenge last year was to solve everything on my mobile phone, using LLMs (mostly ChatGPT4 with code interpreter; I didn't paste in the problems, but rather described the code I wanted.)<p>This year I'm declaring "Advent of Claude"!<p>Challenge: Write a Claude custom style to solve Advent of Code puzzles within Claude's UI.<p>Score: # adventofcode.com stars earned in 2 daily conversation turns.<p>Fine print: web app artifacts are allowed, including paste of your custom input into the artifact UI; one click only.<p>Per <a href="https://adventofcode.com/2024/about" rel="nofollow">https://adventofcode.com/2024/about</a>, wait until the daily <a href="http://adventofcode.com" rel="nofollow">http://adventofcode.com</a> leaderboard is full before submitting LLM-generated solutions!<p>Of course, feel free to use ChatGPT custom instructions, static prompts, etc.<p>Day 1: two stars, <a href="https://claude.site/artifacts/d16e6bdb-f697-45fe-930c-7f58b2b5bb16" rel="nofollow">https://claude.site/artifacts/d16e6bdb-f697-45fe-930c-7f58b2...</a></p>
]]></description><pubDate>Mon, 02 Dec 2024 01:06:05 +0000</pubDate><link>https://news.ycombinator.com/item?id=42292097</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=42292097</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42292097</guid></item><item><title><![CDATA[New comment by JoshMandel in "Thinking about recipe formats more than anyone should"]]></title><description><![CDATA[
<p>I find "higher level" format issues to be of greater concern. These are issues like: is the recipe structured in a way that makes the prep/process flow clear, makes it obvious when a certain ingredient needs to be prepped but divided into multiple parts for use in different stages, or when different stages lead to products that are combined and subsequent poisons in the workflow?<p>A recent example: I really like the Hainanese chicken recipe at <a href="https://www.google.com/amp/s/amp.theguardian.com/food/article/2024/may/11/yotam-ottolenghi-five-ingredient-or-thereabouts-recipes-chicken-rice-spring-onion-broad-beans" rel="nofollow">https://www.google.com/amp/s/amp.theguardian.com/food/articl...</a> ... But I find it very hard to follow in this format.<p>Using o1-preview to restructure it, I get something that I find much easier to follow during my cooking workflow: <a href="https://chatgpt.com/share/6733e594-df28-8009-ac80-d5dabd1ae01b" rel="nofollow">https://chatgpt.com/share/6733e594-df28-8009-ac80-d5dabd1ae0...</a><p>But getting from a well-written recipe to structured data is now pretty straightforward... if/when you need structure data.</p>
]]></description><pubDate>Tue, 12 Nov 2024 23:33:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=42121177</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=42121177</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42121177</guid></item><item><title><![CDATA[New comment by JoshMandel in "LLMD: A Large Language Model for Interpreting Longitudinal Medical Records"]]></title><description><![CDATA[
<p>There's so much good stuff here, and I agree it's an important message for you to get across.<p>I think trying to convey these ideas through a quantitative benchmark result (particularly a benchmark which has a clear common interpretation that you're essentially redefining) risks 1) misleading readers, and 2) failing to convey the rich and detailed analysis you've included here in your HN comment.<p>I'd suggest you restrict your quantitative PubMedQA analysis to report previously published numbers for other models (so you're not in the role of having to defend choices that might cripple other models) or a very straightforward log probs analysis if no outside numbers are available (making it clear which numbers you've produced vs sourced externally). Then separately explain that many of the small models with high benchmark scores exhibit poor instruction following capabilities (which will not be a surprise for many readers, since these models aren't necessary tuned or evaluated for that), and you can make the point that some of them are so poor at instruction following that they're very hard to deploy in contexts that require instruction following; you could even demonstrate that they're only able to follow an instructions to "conclude answers with 'Final Answer: [ABCDE]'" on x% of questions, given a standard prompt that you've created and published. In other words, if it's clear that the problem is in instruction following, analyze that.<p>(Not all abstraction pipelines leveraging an LLM need it to exhibit instruction following, and in your own case, I'm not sure you can claim that your model follows instructions well on the basis of its PubMedQA or abstraction performance, since you've fine tuned on prompt,answer pairs in both domains. You'd need a different baseline for comparison to really explore this claim.)<p>Then I'd suggest creating a detailed table of wrong/surprising stuff that frontier models don't understand about healthcare data, but which your model does understand. Categorize them, show examples in the table, and explain them in narrative much like you've done here.</p>
]]></description><pubDate>Sun, 20 Oct 2024 13:22:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=41895211</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=41895211</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41895211</guid></item><item><title><![CDATA[New comment by JoshMandel in "LLMD: A Large Language Model for Interpreting Longitudinal Medical Records"]]></title><description><![CDATA[
<p>I appreciate the response!<p>I can't understand your methods without example prompts or code, so it's hard for me to interpret the data in figure 6. It will be important to document the methodology carefully to avoid concerns that your "text response" methodology is unfairly punishing other models.<p>In any case, since the methodology that Anthropic applied is documented and straightforward, it would be possible to do an apples to apples comparison with your model.<p>(I'm also very curious to know how 3.5 Sonnet performs.)<p>Is your text methodology based on CoT (like the "PubMedQA training dataset enriched with CoT" you trained on) or a forced single token completion like Anthropic used in their evaluation? In the latter case, I'm not sure how "text responses" differ from log probabilities at Temperature T=0 (i.e., isn't the most likely token always going to be the text response?)</p>
]]></description><pubDate>Sat, 19 Oct 2024 02:16:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=41885139</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=41885139</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41885139</guid></item><item><title><![CDATA[New comment by JoshMandel in "LLMD: A Large Language Model for Interpreting Longitudinal Medical Records"]]></title><description><![CDATA[
<p>>LLMD-8B achieves state of the art responses on PubMedQA over all models<p>Hang on -- while this is a cool result, beating a limited number of models that you chose to include in your comparison does not qualify LLMD-8B as SOTA. (For example, Claude 3 Sonnet scores 10 percentage points higher.)<p>>This result confirms the power of continued pretraining and suggests that records themselves have content useful for improving benchmark performance.<p>In support of this conclusion, it would be informative to include an ablation study, e.g. evaluating a  continued pre-training data set of the same total size but omitting medical record content from the data mix.</p>
]]></description><pubDate>Fri, 18 Oct 2024 22:08:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=41883940</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=41883940</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41883940</guid></item><item><title><![CDATA[New comment by JoshMandel in "Jetstream: Shrinking the AT Protocol Firehose by >99%"]]></title><description><![CDATA[
<p>Server-Sent Events (SSE) with standard gzip compression could be a simpler solution -- or maybe I'm missing something about the websocket + zstd approach.<p>SSE Benefits: Standard HTTP protocol, Built-in gzip compression, Simpler client implementation</p>
]]></description><pubDate>Tue, 24 Sep 2024 12:54:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=41635989</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=41635989</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41635989</guid></item><item><title><![CDATA[New comment by JoshMandel in "Reflection 70B, the top open-source model"]]></title><description><![CDATA[
<p>I'm surprised this does so well in benchmarks, given the intuition I'm getting about its behavior from quick testing.<p>I gave it a medium-complexity design problem: Design the typescript interface for the state of a react app that manages a tree of chat turns/responses and displays the current path through the tree. (In other words, the kind of state that sits logically behind the ChatGPT or Claude Web UI, where previous conversation turns can be edited and used as a branching off point for new turns.)<p>Reflection-70B suffered from a bad initial idea, just as Llama 70B generally does (proposing to duplicate state between the "tree of all messages" and the "path to currently displayed message"), which is a very common error. The automated reflection process identified a whole bunch of nitpicks but missed the glaring logical bug. Furthermore the final output was missing many of the details included in the initial reflection / chain-of-thought scratchpad, even though the UI hides the scratchpad as though it's unimportant for the user to read.</p>
]]></description><pubDate>Thu, 05 Sep 2024 20:53:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=41460326</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=41460326</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41460326</guid></item><item><title><![CDATA[New comment by JoshMandel in "10 > 64, in QR Codes"]]></title><description><![CDATA[
<p>We used essentially this technique in the SMART Health Cards specification for vaccine and lab result QRs.<p><a href="https://spec.smarthealth.cards/#encoding-qrs" rel="nofollow">https://spec.smarthealth.cards/#encoding-qrs</a><p>It's well supported by scanners but can create unwieldy values for users to copy/paste.<p>For more recent work with dynamic content (and the assumption that a web server is involved in the flow), we're just limiting the payload size and using ordinary byte mode (<a href="https://docs.smarthealthit.org/smart-health-links/spec" rel="nofollow">https://docs.smarthealthit.org/smart-health-links/spec</a>)</p>
]]></description><pubDate>Tue, 02 Apr 2024 23:33:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=39912005</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=39912005</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39912005</guid></item><item><title><![CDATA[New comment by JoshMandel in "HuggingChat: Chat with Open Source Models"]]></title><description><![CDATA[
<p>I'm very pleased this UX includes "can edit any previous conversation turn" functionality, making conversations a tree rather than a list.<p>For me this is one of the highest-impact and most-often-overlooked features of the ChatGPT Web UI (so much so that openai does not even include this feature in their native clients).</p>
]]></description><pubDate>Wed, 21 Feb 2024 15:42:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=39455163</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=39455163</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=39455163</guid></item><item><title><![CDATA[New comment by JoshMandel in "Don't upload your PWA to the app stores"]]></title><description><![CDATA[
<p>I <i>wish</i> the native ChatGPT app on Android had all the functionality of the web app. I dearly miss the ability to navigate conversations as a tree, going back and editing any prior turns to try out different ideaa.</p>
]]></description><pubDate>Thu, 11 Jan 2024 03:05:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=38946982</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=38946982</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38946982</guid></item><item><title><![CDATA[New comment by JoshMandel in "The GPT Store"]]></title><description><![CDATA[
<p>My experience putting together <a href="https://chat.openai.com/g/g-bdnABvG92-reci-pop" rel="nofollow">https://chat.openai.com/g/g-bdnABvG92-reci-pop</a> (transcribes recipes as succinct bullet lists, suitable for scrolling during meal prep) was that the Actions configuration for custom GPTs is quite brittle.<p>OpenAI has implemented controls to stop the model from adding hallucinated parameters to an action payload... but this results in user-facing failures.<p>I initially worked around the user-facing failures by wrapping the entire payload in a {"request": {... payload}} structure (which helps because the controls only perform a shallow check ).<p>It is frustrating that users have no way to view the action response, even though users can view the action request. Not infrequently, the model will take an essentially empty or irrelevant response and silently ignore it, hallucinating an answer as though the response had been informative... so it's hard for users to trust what they see in the generated output.<p>It would be so easy to enable a toggle for users to inspect the response, but I think the OpenAI team wants to somehow "protect" the IP or internal decisions of custom GPT "creators. It would at least be nice to have a toggle for developers who <i>don't</i> feel proprietary about those details. And maybe a fork button :-)</p>
]]></description><pubDate>Wed, 10 Jan 2024 23:52:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=38945337</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=38945337</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38945337</guid></item><item><title><![CDATA[New comment by JoshMandel in "Pushing ChatGPT's Structured Data Support to Its Limits"]]></title><description><![CDATA[
<p>Yes -- the distinction with "function calling" is that you have to play a game of telephone where you describe your target schema in JSON Schema (only, apparently, for OpenAI to turn into a typescript interface internally) vs describing it more directly and succinctly (and with opportunities to include inline comments, order fields ordered however you want, and use advanced TS features... or even use an adhoc schema "language").</p>
]]></description><pubDate>Wed, 27 Dec 2023 16:58:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=38783901</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=38783901</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38783901</guid></item><item><title><![CDATA[New comment by JoshMandel in "Pushing ChatGPT's Structured Data Support to Its Limits"]]></title><description><![CDATA[
<p>FWIW, I've seen stronger performance from gpt-4-1106-preview when I use `response_format: { type: "json_object" },` (providing a target typescript interface in context), vs the "tools" API.<p>More flexible, and (evaluating non-scientifically!) qualitatively better answers & instruction following -- particularly for deeply nested or complex schemas, which typescript expresses very clearly and succinctly.<p>Example from a hack week project earlier this month (using a TS-ish schema description that's copy/pasted from healthcare's FHIR standard): <a href="https://github.com/microsoft-healthcare-madison/hackweek-2023-12/blob/c0e8b0235409ee173b8b1cbb24007427ea66707d/day-1-fhir-questionnaire-creation/src/prompts/generate.js#L139">https://github.com/microsoft-healthcare-madison/hackweek-202...</a><p>Or a more complex example with one model call to invent a TS schema on-the-fly and another call to abstract clinical data into it: <a href="https://github.com/microsoft-healthcare-madison/hackweek-2023-12/blob/c0e8b0235409ee173b8b1cbb24007427ea66707d/day-2-answer-suggestion/outputs/provide-all-details-needed-to-determine-the-breast-cancer-stage.md">https://github.com/microsoft-healthcare-madison/hackweek-202...</a></p>
]]></description><pubDate>Wed, 27 Dec 2023 16:42:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=38783709</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=38783709</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38783709</guid></item><item><title><![CDATA[New comment by JoshMandel in "Show HN: Talk Paper Scissors"]]></title><description><![CDATA[
<p>ChatGPT 4 can not only play, it can design and implement a commitment scheme to make the game more interesting (...as long as you don't peek at the code interpreter output -- that's foul play... and the entropy in that nonce is suspect ;-))<p>Opponent conversation:
<a href="https://chat.openai.com/share/f545b885-4176-4824-831b-7680cc8d240c" rel="nofollow noreferrer">https://chat.openai.com/share/f545b885-4176-4824-831b-7680cc...</a><p>Helper conversation: <a href="https://chat.openai.com/share/1d2455f3-3dbc-40c9-a5ae-3b9f1aca90a8" rel="nofollow noreferrer">https://chat.openai.com/share/1d2455f3-3dbc-40c9-a5ae-3b9f1a...</a></p>
]]></description><pubDate>Sun, 24 Dec 2023 00:03:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=38749563</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=38749563</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38749563</guid></item><item><title><![CDATA[New comment by JoshMandel in "Climbing 50 steps a day can cut your risk of heart disease"]]></title><description><![CDATA[
<p>> So... we're actually worse off<p>Careful about inferring causality here.<p>What kind of active person suddenly stops being active? You presumably don't want to be that kind of person... but the stopping isn't necessarily a causal factor of the cardiovascular events (and the initial "being active" seems still far less likely to be a causal factor).</p>
]]></description><pubDate>Sun, 26 Nov 2023 23:23:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=38426197</link><dc:creator>JoshMandel</dc:creator><comments>https://news.ycombinator.com/item?id=38426197</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=38426197</guid></item></channel></rss>