<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: a1j9o94</title><link>https://news.ycombinator.com/user?id=a1j9o94</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 17 Apr 2026 16:14:10 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=a1j9o94" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by a1j9o94 in "Codex for almost everything"]]></title><description><![CDATA[
<p>Disclaimer I work at Zapier, but we're doing a ton of this. I have an agent that runs every morning and creates prep documents for my calls. Then a separate one that runs at the end of every week to give me feedback</p>
]]></description><pubDate>Thu, 16 Apr 2026 20:43:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=47799229</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=47799229</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47799229</guid></item><item><title><![CDATA[New comment by a1j9o94 in "Codex for almost everything"]]></title><description><![CDATA[
<p>This is effectively how I treat my AI agents. A lot of the reason this doesn't work well for people today is due to context/memory/harness management that makes it too complex for someone to set up if they don't want a full time second job or just like to tinker.<p>If you productize that it will be an experience a lot of people like.<p>And on the UI piece, I think most people will just interact through text and voice interfaces. Wherever they already spend time like sms, what's app, etc.</p>
]]></description><pubDate>Thu, 16 Apr 2026 20:38:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=47799188</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=47799188</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47799188</guid></item><item><title><![CDATA[New comment by a1j9o94 in "Introspective Diffusion Language Models"]]></title><description><![CDATA[
<p>You would only use the base model during training. This is a distillation technique</p>
]]></description><pubDate>Tue, 14 Apr 2026 11:33:28 +0000</pubDate><link>https://news.ycombinator.com/item?id=47764230</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=47764230</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47764230</guid></item><item><title><![CDATA[New comment by a1j9o94 in "Convincing Is Not Persuading"]]></title><description><![CDATA[
<p>I fall into this trap a lot. The platonic ideal argument is a fun mental exercise but doesn't get anything done</p>
]]></description><pubDate>Sun, 22 Mar 2026 14:36:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=47477962</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=47477962</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47477962</guid></item><item><title><![CDATA[New comment by a1j9o94 in "Show HN: VR.dev – Open-source verifiers for what AI agents did"]]></title><description><![CDATA[
<p>This is an interesting space. Right now we've gotten to a point where agents can do most tasks, but they will get lazy/skip steps if you're not precise in the requirements. We need ways to validate that expands beyond software tests. This is a good direction but a few thoughts:
1. From what I can tell the agent who does the task is running the validation. Keeping the validation agent as a separate context avoids the validator knowing what the software is supposed to do vs what it does
2. There's a lot of prior art around org structures to validate things that we've built out over the last ~100 years that we can apply in this space. E.g. look at the way that blind trials are run</p>
]]></description><pubDate>Wed, 11 Mar 2026 14:03:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=47335713</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=47335713</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47335713</guid></item><item><title><![CDATA[Show HN: Sales Agent Benchmark – SWE-Bench for sales AI agents (open source)]]></title><description><![CDATA[
<p>Live leaderboard: <a href="https://sales-agent-benchmarks.fly.dev/benchmark" rel="nofollow">https://sales-agent-benchmarks.fly.dev/benchmark</a>
GitHub: <a href="https://github.com/a1j9o94/sales-agent-benchmark" rel="nofollow">https://github.com/a1j9o94/sales-agent-benchmark</a><p>I built an open-source benchmark for evaluating LLMs as sales agents. The idea came from noticing that every sales AI tool demos well on clean summaries but falls apart on real deal data — and there was no rigorous way to measure that gap.<p>How it works<p>You register an API endpoint. We send your agent deal context (anonymized real B2B deals), it returns structured recommendations (risks, next steps, stakeholder analysis). A multi-judge panel (Claude, GPT, Gemini via OpenRouter) scores against ground truth — what actually happened in the deal.<p>Two evaluation modes:<p>Summary Benchmark — Pre-digested checkpoint summaries. Single-turn. 15 deals, 36 checkpoints, 4 scoring dimensions. Models score 68–81%. This is the easy mode.<p>Artifact-Based Benchmark — Raw call transcripts, email threads, CRM snapshots, Slack messages, documents. Multi-turn (agent can request specific artifacts before answering). 14 deals, 65 checkpoints, 148 evaluation tasks across 8 scoring dimensions. Models score 26–38%.<p>Every model we tested drops roughly in half when switching from summaries to real artifacts.<p>The interesting findings<p>Risk Identification collapses. Best model goes from 8.0/10 on summaries to 2.3/10 on real data. Models confidently identify risks that don't exist in the source material.<p>Hallucinated stakeholders. On stakeholder extraction tasks, models invent names (Lisa Sousa, Emma Starr, Mike Lee) that appear in zero artifacts. The actual stakeholders are in the transcripts — models just don't extract them.<p>Structured frameworks survive. MEDDPICC qualification scoring holds up at 7.5/10. Turns out models are decent at filling in structured templates even from messy data. It's the open-ended analysis that falls apart.<p>Communication quality is fine. Models score 5–8/10 on drafting follow-up emails and call summaries. The writing is good. The reasoning behind it isn't.<p>Technical details<p>Stack: Bun, TypeScript, React, Postgres (Neon), deployed on Fly.io<p>Evaluation: Task-specific judge prompts per artifact type. Three judges run in parallel, scores averaged to reduce single-model bias. Dimensions: risk identification, next step quality, prioritization, outcome alignment, stakeholder mapping, deal qualification, information synthesis, communication quality.<p>Artifact types: TranscriptArtifact (speaker-labeled turns from Granola AI), EmailArtifact (threaded messages with metadata), CrmSnapshotArtifact (HubSpot deal properties + stage history), DocumentArtifact (proposals, decks), SlackThreadArtifact, CalendarEventArtifact<p>Multi-turn protocol: Artifact-based requests include turnNumber/maxTurns. Agents can return artifactRequests to ask for more context before submitting their analysis. The benchmark runner handles the conversation loop.<p>API contract: POST your endpoint, receive { version: 2, artifacts: [...], stakeholders: [...], evaluationTask: {...} }, return structured JSON with risks, next steps, and dimension-specific analysis.<p>What I'm looking for<p>Try it. Register an endpoint and benchmark your agent: <a href="https://sales-agent-benchmarks.fly.dev/benchmark" rel="nofollow">https://sales-agent-benchmarks.fly.dev/benchmark</a><p>Data partners. The dataset is small (29 deals). If you have anonymized deal artifacts — call transcripts, email exports, CRM data with outcomes — I'd love to process them through the pipeline and credit you as a founding contributor.<p>Feedback on evaluation methodology. The multi-judge approach works but I'm not confident the prompts are optimal. Happy to discuss the judge prompt design in issues.<p>The gap between summary performance and real-artifact performance seems like a general problem beyond sales. If anyone's seen similar benchmark work in other domains (legal document analysis, medical records, etc.), I'd be interested to compare notes.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46946742">https://news.ycombinator.com/item?id=46946742</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Mon, 09 Feb 2026 16:05:55 +0000</pubDate><link>https://sales-agent-benchmarks.fly.dev/benchmark</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46946742</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46946742</guid></item><item><title><![CDATA[New comment by a1j9o94 in "I Tried to Give AI "Imagination" to Solve Physics Problems"]]></title><description><![CDATA[
<p>Honestly just didn't think about it. Added it.</p>
]]></description><pubDate>Sun, 25 Jan 2026 16:59:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=46755745</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46755745</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46755745</guid></item><item><title><![CDATA[New comment by a1j9o94 in "I Tried to Give AI "Imagination" to Solve Physics Problems"]]></title><description><![CDATA[
<p>Hey HN,<p><pre><code>  I spent the last few weeks exploring whether AI systems could benefit from generating video predictions before making decisions—like how humans mentally simulate "what happens if I pour this coffee?" before acting.

  The idea: Show an AI an image, ask "what happens if I push this?", have it generate a video prediction, then compare that prediction to reality. If the prediction looks wrong, maybe the AI could catch its own mistakes.

  The result: Current models can't do this. But I learned some interesting things along the way.

  What I tested:
  - 7 different architectures for predicting future video frames from VLM latent space
  - Whether perceptual similarity (LPIPS) between predicted and actual video correlates with correctness
  - Self-correction loops where the model gets feedback on its predictions

  Key findings:

  1. VLMs can't predict the future – Every architecture I tried performed worse than just copying the current frame as the "prediction." The model understands what's in an image but can't predict what will change.
  2. Visual similarity ≠ semantic correctness – This one surprised me. Wrong predictions often looked MORE similar to reality than correct ones (LPIPS correlation: 0.106). You can't use "does it look right?" to catch mistakes.
  3. Some things worked – Hybrid encoders (DINOv2 + VLM) preserve spatial information that VLMs lose. VLMs understand generated video well (93% semantic retention). Small adapters (10M params) work better than large ones (100M).

  I'm releasing this as a benchmark proposal. Video generation is improving fast—capabilities that don't exist today might emerge in future models. Seems worth tracking.

  Links:
  - Demo video: https://youtu.be/YJxDt_zCrUI
  - Code + paper: https://github.com/a1j9o94/foresight
  - Live demo: https://foresight-demo-kappa.vercel.app

  Built with Qwen2.5-VL, LTX-Video, Modal (GPUs), and the Something-Something v2 dataset.

  Happy to answer questions about the experiments or methodology.</code></pre></p>
]]></description><pubDate>Sun, 25 Jan 2026 16:09:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=46755306</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46755306</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46755306</guid></item><item><title><![CDATA[I Tried to Give AI "Imagination" to Solve Physics Problems]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/a1j9o94/foresight">https://github.com/a1j9o94/foresight</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46755305">https://news.ycombinator.com/item?id=46755305</a></p>
<p>Points: 2</p>
<p># Comments: 3</p>
]]></description><pubDate>Sun, 25 Jan 2026 16:09:41 +0000</pubDate><link>https://github.com/a1j9o94/foresight</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46755305</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46755305</guid></item><item><title><![CDATA[New comment by a1j9o94 in "AI Police Reports: Year in Review"]]></title><description><![CDATA[
<p>Pretty much every major LLM client has web search built in. They aren't just using what's in their weights to generate the answers.<p>When it gives you a link, it literally takes you to the part of the page that it got its answer from. That's how we can quickly validate.</p>
]]></description><pubDate>Sat, 27 Dec 2025 15:17:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=46402439</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46402439</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46402439</guid></item><item><title><![CDATA[New comment by a1j9o94 in "I know you didn't write this"]]></title><description><![CDATA[
<p>I would argue that's  just your coworker giving you a bad answer. If you prompt a chatbot with the right business context, look at what it spits out, and layer in your judgement before you hit send, then it's fine if the AI typed it out.<p>If they answer your question with irrelevant context, then that's the problem, not that it was AI</p>
]]></description><pubDate>Mon, 22 Dec 2025 20:07:42 +0000</pubDate><link>https://news.ycombinator.com/item?id=46358378</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46358378</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46358378</guid></item><item><title><![CDATA[New comment by a1j9o94 in "I know you didn't write this"]]></title><description><![CDATA[
<p>Honestly if you have a working relationship/communication norms where that's expected, I agree just send the 5 bullets.<p>In most of my work contexts, people want more formal documents with clean headings titles, detailed risks even if it's the same risks we've put on every project.</p>
]]></description><pubDate>Mon, 22 Dec 2025 20:04:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=46358335</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46358335</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46358335</guid></item><item><title><![CDATA[New comment by a1j9o94 in "I know you didn't write this"]]></title><description><![CDATA[
<p>I know I'm an outlier on HN, but I really don't care if AI was used to write something I'm reading. I just care whether or not the ideas are good and clear. And if we're talking about work output 99% of what people were putting out before AI wasn't particularly good. And in my genuine experience AI's output is better than things people I worked with would spend hours and days on.<p>I feel like more time is wasted trying to catch your coworkers using AI vs just engaging with the plan. If it's a bad plan say that and make sure your coworker is held accountable for presenting a bad plan. But it shouldn't matter if he gave 5 bullets to Chat gpt that expanded it to a full page with a detailed plan.</p>
]]></description><pubDate>Mon, 22 Dec 2025 19:09:00 +0000</pubDate><link>https://news.ycombinator.com/item?id=46357566</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46357566</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46357566</guid></item><item><title><![CDATA[New comment by a1j9o94 in "History LLMs: Models trained exclusively on pre-1913 texts"]]></title><description><![CDATA[
<p>Not the person you're responding to, but I think there's a non trivial argument to make that our thoughts are just auto complete. What is the next most likely word based on what you're seeing. Ever watched a movie and guessed the plot? Or read a comment and know where it was going to go by the end?<p>And I know not everyone thinks in a literal stream of words all the time (I do) but I would argue that those people's brains are just using a different "token"</p>
]]></description><pubDate>Fri, 19 Dec 2025 05:57:03 +0000</pubDate><link>https://news.ycombinator.com/item?id=46322678</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46322678</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46322678</guid></item><item><title><![CDATA[New comment by a1j9o94 in "Gemini 3 Pro: the frontier of vision AI"]]></title><description><![CDATA[
<p>Having one tool that you can use to do all of these things makes a big difference. If I'm a financial analyst at a company I don't need to know how to implement and use 5 different specialized ML models, I can just ask one tool (that can still use tools on the backend to complete the task efficiently)</p>
]]></description><pubDate>Sat, 06 Dec 2025 14:08:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=46173481</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46173481</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46173481</guid></item><item><title><![CDATA[New comment by a1j9o94 in "Google Antigravity exfiltrates data via indirect prompt injection attack"]]></title><description><![CDATA[
<p>yy</p>
]]></description><pubDate>Tue, 25 Nov 2025 22:29:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=46051605</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=46051605</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46051605</guid></item><item><title><![CDATA[New comment by a1j9o94 in "Kiro: A new agentic IDE"]]></title><description><![CDATA[
<p>The above is saying more precise not completely precise. The overall point they're making is you still are responsible for the code you commit.<p>If they are saying the code in this project was in line with what they would have written, I lean towards trusting their assessment.</p>
]]></description><pubDate>Tue, 15 Jul 2025 02:51:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=44567427</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=44567427</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44567427</guid></item><item><title><![CDATA[New comment by a1j9o94 in "At Amazon, some coders say their jobs have begun to resemble warehouse work"]]></title><description><![CDATA[
<p>Why do you say that? I would argue that as long as your tests and interfaces are clearly defined no reason it couldn't scale indefinitely.</p>
]]></description><pubDate>Mon, 26 May 2025 13:09:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=44097119</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=44097119</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44097119</guid></item><item><title><![CDATA[New comment by a1j9o94 in "At Amazon, some coders say their jobs have begun to resemble warehouse work"]]></title><description><![CDATA[
<p>I agree with this completely. I get the impression that a lot of people here think of software development as a craft, which is great for your own learning and development but not relevant from the company's perspective. It just has to work good enough.<p>Your point about management being vibe coding is spot on. I have hired people to build something and just had to hope that they built it the way I wanted. I honestly feel like AI is better than most of the outsourced code work I do.<p>One last piece, if anyone does have trouble getting value out of AI tools, I would encourage you to talk to/guide them like you would a junior team member. Actually "discuss" what you're trying to accomplish, lay out a plan, build your tests, and only then start working on the output. Most examples I see of people trying to get AI to do things fail because of poor communication.</p>
]]></description><pubDate>Mon, 26 May 2025 13:08:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=44097107</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=44097107</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44097107</guid></item><item><title><![CDATA[New comment by a1j9o94 in "At Amazon, some coders say their jobs have begun to resemble warehouse work"]]></title><description><![CDATA[
<p>The point is devs aren't sales/client facing. So from the customers perspective, it's just a delivery detail.</p>
]]></description><pubDate>Mon, 26 May 2025 12:54:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=44097016</link><dc:creator>a1j9o94</dc:creator><comments>https://news.ycombinator.com/item?id=44097016</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44097016</guid></item></channel></rss>