Hacker News: JoshMandel

New comment by JoshMandel in "Claude Code is suddenly everywhere inside Microsoft"

JoshMandel — Mon, 02 Feb 2026 17:59:41 +0000

Same. Sometimes even repeated nudges don't help. The underlying 3.0 Pro model is great to talk and ideate with, but its inability to deliver within the Gemini CLI harness is ... almost comical.

New comment by JoshMandel in "You should write an agent"

JoshMandel — Fri, 07 Nov 2025 17:37:19 +0000

I think that it's basically fair and I often write simple agents using exactly the technique that you describe. I typically provide a TypeScript interface for the available tools and just ask the model to respond with a JSON block and it works fine.

That said, it is worth understanding that the current generation of models is extensively RL-trained on how to make tool calls... so they may in fact be better at issuing tool calls in the specific format that their training has focused on (using specific internal tokens to demarcate and indicate when a tool call begins/ends, etc). Intuitively, there's probably a lot of transfer learning between this format and any ad-hoc format that you might request inline your prompt.

There may be recent literature quantifying the performance gap here. And certainly if you're doing anything performance-sensitive you will want to characterize this for your use case, with benchmarks. But conceptually, I think your model is spot on.

New comment by JoshMandel in "Opening up ‘Zero-Knowledge Proof’ technology"

JoshMandel — Thu, 03 Jul 2025 20:24:57 +0000

But to be clear, mdoc already accounts for this through its selective disclosure protocol, without the need for a zero knowledge proof technology. When you share an mdoc you are really just sharing a signed pile of hashes ("mobile security object") and then you can choose which salted pre-images to share along with the pile of hashes. So for example your name and your birth date are two separate data elements and sharing your MSO will share the hashes for both, but you might only choose to share the pre-image representing your birthday, or even a simple boolean claim that you are over 21 years old.

What you don't get with this scheme (and which zero knowledge proofs can provide) is protection against correlation: if you sign into the same site twice or sign into different sites, can the site owners recognize that it is the same user? With the design of the core mdoc selector disclosure protocol, the answer is yes.

New comment by JoshMandel in "GitHub MCP exploited: Accessing private repositories via MCP"

JoshMandel — Tue, 27 May 2025 13:50:39 +0000

Last week I tried Google's Jules coding agent and saw it requested broad GitHub OAuth permissions --essentially "full access to everything your account can do." When you authorize it, you're granting access to all your repositories.

This is partly driven by developer convenience on the agent side, but it's also driven by GitHub OAuth flow. It should be easier to create a downscoped approval during authorization that still allows the app to request additional access later. It should be easy to let an agent submit an authorization request scoped to a specific repository, etc.

Instead, I had to create a companion GitHub account (https://github.com/jmandel-via-jules) with explicit access only to the repositories and permissions I want Jules to touch. It's pretty inconvenient but I don't see another way to safely use these agents without potentially exposing everything.

GitHub does endorse creating "machine users" as dedicated accounts for applications, which validates this approach, but it shouldn't be necessary for basic repository scoping.

Please let me know if there is an easier way that Ip'm just missing.

New comment by JoshMandel in "Advent of Code 2024"

JoshMandel — Mon, 02 Dec 2024 02:09:22 +0000

Fair enough! I'll document my progress at https://github.com/jmandel/advent-of-claude/tree/main, though I may not keep up.

New comment by JoshMandel in "Advent of Code 2024"

JoshMandel — Mon, 02 Dec 2024 01:06:05 +0000

My personal challenge last year was to solve everything on my mobile phone, using LLMs (mostly ChatGPT4 with code interpreter; I didn't paste in the problems, but rather described the code I wanted.)

This year I'm declaring "Advent of Claude"!

Challenge: Write a Claude custom style to solve Advent of Code puzzles within Claude's UI.

Score: # adventofcode.com stars earned in 2 daily conversation turns.

Fine print: web app artifacts are allowed, including paste of your custom input into the artifact UI; one click only.

Per https://adventofcode.com/2024/about, wait until the daily http://adventofcode.com leaderboard is full before submitting LLM-generated solutions!

Of course, feel free to use ChatGPT custom instructions, static prompts, etc.

Day 1: two stars, https://claude.site/artifacts/d16e6bdb-f697-45fe-930c-7f58b2...

New comment by JoshMandel in "Thinking about recipe formats more than anyone should"

JoshMandel — Tue, 12 Nov 2024 23:33:02 +0000

I find "higher level" format issues to be of greater concern. These are issues like: is the recipe structured in a way that makes the prep/process flow clear, makes it obvious when a certain ingredient needs to be prepped but divided into multiple parts for use in different stages, or when different stages lead to products that are combined and subsequent poisons in the workflow?

A recent example: I really like the Hainanese chicken recipe at https://www.google.com/amp/s/amp.theguardian.com/food/articl... ... But I find it very hard to follow in this format.

Using o1-preview to restructure it, I get something that I find much easier to follow during my cooking workflow: https://chatgpt.com/share/6733e594-df28-8009-ac80-d5dabd1ae0...

But getting from a well-written recipe to structured data is now pretty straightforward... if/when you need structure data.

New comment by JoshMandel in "LLMD: A Large Language Model for Interpreting Longitudinal Medical Records"

JoshMandel — Sun, 20 Oct 2024 13:22:22 +0000

There's so much good stuff here, and I agree it's an important message for you to get across.

I think trying to convey these ideas through a quantitative benchmark result (particularly a benchmark which has a clear common interpretation that you're essentially redefining) risks 1) misleading readers, and 2) failing to convey the rich and detailed analysis you've included here in your HN comment.

I'd suggest you restrict your quantitative PubMedQA analysis to report previously published numbers for other models (so you're not in the role of having to defend choices that might cripple other models) or a very straightforward log probs analysis if no outside numbers are available (making it clear which numbers you've produced vs sourced externally). Then separately explain that many of the small models with high benchmark scores exhibit poor instruction following capabilities (which will not be a surprise for many readers, since these models aren't necessary tuned or evaluated for that), and you can make the point that some of them are so poor at instruction following that they're very hard to deploy in contexts that require instruction following; you could even demonstrate that they're only able to follow an instructions to "conclude answers with 'Final Answer: [ABCDE]'" on x% of questions, given a standard prompt that you've created and published. In other words, if it's clear that the problem is in instruction following, analyze that.

(Not all abstraction pipelines leveraging an LLM need it to exhibit instruction following, and in your own case, I'm not sure you can claim that your model follows instructions well on the basis of its PubMedQA or abstraction performance, since you've fine tuned on prompt,answer pairs in both domains. You'd need a different baseline for comparison to really explore this claim.)

Then I'd suggest creating a detailed table of wrong/surprising stuff that frontier models don't understand about healthcare data, but which your model does understand. Categorize them, show examples in the table, and explain them in narrative much like you've done here.

New comment by JoshMandel in "LLMD: A Large Language Model for Interpreting Longitudinal Medical Records"

JoshMandel — Sat, 19 Oct 2024 02:16:31 +0000

I appreciate the response!

I can't understand your methods without example prompts or code, so it's hard for me to interpret the data in figure 6. It will be important to document the methodology carefully to avoid concerns that your "text response" methodology is unfairly punishing other models.

In any case, since the methodology that Anthropic applied is documented and straightforward, it would be possible to do an apples to apples comparison with your model.

(I'm also very curious to know how 3.5 Sonnet performs.)

Is your text methodology based on CoT (like the "PubMedQA training dataset enriched with CoT" you trained on) or a forced single token completion like Anthropic used in their evaluation? In the latter case, I'm not sure how "text responses" differ from log probabilities at Temperature T=0 (i.e., isn't the most likely token always going to be the text response?)

New comment by JoshMandel in "LLMD: A Large Language Model for Interpreting Longitudinal Medical Records"

JoshMandel — Fri, 18 Oct 2024 22:08:41 +0000

>LLMD-8B achieves state of the art responses on PubMedQA over all models

Hang on -- while this is a cool result, beating a limited number of models that you chose to include in your comparison does not qualify LLMD-8B as SOTA. (For example, Claude 3 Sonnet scores 10 percentage points higher.)

>This result confirms the power of continued pretraining and suggests that records themselves have content useful for improving benchmark performance.

In support of this conclusion, it would be informative to include an ablation study, e.g. evaluating a continued pre-training data set of the same total size but omitting medical record content from the data mix.

New comment by JoshMandel in "Jetstream: Shrinking the AT Protocol Firehose by >99%"

JoshMandel — Tue, 24 Sep 2024 12:54:45 +0000

Server-Sent Events (SSE) with standard gzip compression could be a simpler solution -- or maybe I'm missing something about the websocket + zstd approach.

SSE Benefits: Standard HTTP protocol, Built-in gzip compression, Simpler client implementation

New comment by JoshMandel in "Reflection 70B, the top open-source model"

JoshMandel — Thu, 05 Sep 2024 20:53:33 +0000

I'm surprised this does so well in benchmarks, given the intuition I'm getting about its behavior from quick testing.

I gave it a medium-complexity design problem: Design the typescript interface for the state of a react app that manages a tree of chat turns/responses and displays the current path through the tree. (In other words, the kind of state that sits logically behind the ChatGPT or Claude Web UI, where previous conversation turns can be edited and used as a branching off point for new turns.)

Reflection-70B suffered from a bad initial idea, just as Llama 70B generally does (proposing to duplicate state between the "tree of all messages" and the "path to currently displayed message"), which is a very common error. The automated reflection process identified a whole bunch of nitpicks but missed the glaring logical bug. Furthermore the final output was missing many of the details included in the initial reflection / chain-of-thought scratchpad, even though the UI hides the scratchpad as though it's unimportant for the user to read.

New comment by JoshMandel in "10 > 64, in QR Codes"

JoshMandel — Tue, 02 Apr 2024 23:33:52 +0000

We used essentially this technique in the SMART Health Cards specification for vaccine and lab result QRs.

https://spec.smarthealth.cards/#encoding-qrs

It's well supported by scanners but can create unwieldy values for users to copy/paste.

For more recent work with dynamic content (and the assumption that a web server is involved in the flow), we're just limiting the payload size and using ordinary byte mode (https://docs.smarthealthit.org/smart-health-links/spec)

New comment by JoshMandel in "HuggingChat: Chat with Open Source Models"

JoshMandel — Wed, 21 Feb 2024 15:42:45 +0000

I'm very pleased this UX includes "can edit any previous conversation turn" functionality, making conversations a tree rather than a list.

For me this is one of the highest-impact and most-often-overlooked features of the ChatGPT Web UI (so much so that openai does not even include this feature in their native clients).

New comment by JoshMandel in "Don't upload your PWA to the app stores"

JoshMandel — Thu, 11 Jan 2024 03:05:48 +0000

I wish the native ChatGPT app on Android had all the functionality of the web app. I dearly miss the ability to navigate conversations as a tree, going back and editing any prior turns to try out different ideaa.

New comment by JoshMandel in "The GPT Store"

JoshMandel — Wed, 10 Jan 2024 23:52:02 +0000

My experience putting together https://chat.openai.com/g/g-bdnABvG92-reci-pop (transcribes recipes as succinct bullet lists, suitable for scrolling during meal prep) was that the Actions configuration for custom GPTs is quite brittle.

OpenAI has implemented controls to stop the model from adding hallucinated parameters to an action payload... but this results in user-facing failures.

I initially worked around the user-facing failures by wrapping the entire payload in a {"request": {... payload}} structure (which helps because the controls only perform a shallow check ).

It is frustrating that users have no way to view the action response, even though users can view the action request. Not infrequently, the model will take an essentially empty or irrelevant response and silently ignore it, hallucinating an answer as though the response had been informative... so it's hard for users to trust what they see in the generated output.

It would be so easy to enable a toggle for users to inspect the response, but I think the OpenAI team wants to somehow "protect" the IP or internal decisions of custom GPT "creators. It would at least be nice to have a toggle for developers who don't feel proprietary about those details. And maybe a fork button :-)

New comment by JoshMandel in "Pushing ChatGPT's Structured Data Support to Its Limits"

JoshMandel — Wed, 27 Dec 2023 16:58:47 +0000

Yes -- the distinction with "function calling" is that you have to play a game of telephone where you describe your target schema in JSON Schema (only, apparently, for OpenAI to turn into a typescript interface internally) vs describing it more directly and succinctly (and with opportunities to include inline comments, order fields ordered however you want, and use advanced TS features... or even use an adhoc schema "language").

New comment by JoshMandel in "Pushing ChatGPT's Structured Data Support to Its Limits"

JoshMandel — Wed, 27 Dec 2023 16:42:33 +0000

FWIW, I've seen stronger performance from gpt-4-1106-preview when I use `response_format: { type: "json_object" },` (providing a target typescript interface in context), vs the "tools" API.

More flexible, and (evaluating non-scientifically!) qualitatively better answers & instruction following -- particularly for deeply nested or complex schemas, which typescript expresses very clearly and succinctly.

Example from a hack week project earlier this month (using a TS-ish schema description that's copy/pasted from healthcare's FHIR standard): https://github.com/microsoft-healthcare-madison/hackweek-202...

Or a more complex example with one model call to invent a TS schema on-the-fly and another call to abstract clinical data into it: https://github.com/microsoft-healthcare-madison/hackweek-202...

New comment by JoshMandel in "Show HN: Talk Paper Scissors"

JoshMandel — Sun, 24 Dec 2023 00:03:07 +0000

ChatGPT 4 can not only play, it can design and implement a commitment scheme to make the game more interesting (...as long as you don't peek at the code interpreter output -- that's foul play... and the entropy in that nonce is suspect ;-))

Opponent conversation: https://chat.openai.com/share/f545b885-4176-4824-831b-7680cc...

Helper conversation: https://chat.openai.com/share/1d2455f3-3dbc-40c9-a5ae-3b9f1a...

New comment by JoshMandel in "Climbing 50 steps a day can cut your risk of heart disease"

JoshMandel — Sun, 26 Nov 2023 23:23:18 +0000

> So... we're actually worse off

Careful about inferring causality here.

What kind of active person suddenly stops being active? You presumably don't want to be that kind of person... but the stopping isn't necessarily a causal factor of the cardiovascular events (and the initial "being active" seems still far less likely to be a causal factor).