Hacker News: a1j9o94

New comment by a1j9o94 in "Codex for almost everything"

a1j9o94 — Thu, 16 Apr 2026 20:43:06 +0000

Disclaimer I work at Zapier, but we're doing a ton of this. I have an agent that runs every morning and creates prep documents for my calls. Then a separate one that runs at the end of every week to give me feedback

New comment by a1j9o94 in "Codex for almost everything"

a1j9o94 — Thu, 16 Apr 2026 20:38:51 +0000

This is effectively how I treat my AI agents. A lot of the reason this doesn't work well for people today is due to context/memory/harness management that makes it too complex for someone to set up if they don't want a full time second job or just like to tinker.

If you productize that it will be an experience a lot of people like.

And on the UI piece, I think most people will just interact through text and voice interfaces. Wherever they already spend time like sms, what's app, etc.

New comment by a1j9o94 in "Introspective Diffusion Language Models"

a1j9o94 — Tue, 14 Apr 2026 11:33:28 +0000

You would only use the base model during training. This is a distillation technique

New comment by a1j9o94 in "Convincing Is Not Persuading"

a1j9o94 — Sun, 22 Mar 2026 14:36:24 +0000

I fall into this trap a lot. The platonic ideal argument is a fun mental exercise but doesn't get anything done

New comment by a1j9o94 in "Show HN: VR.dev – Open-source verifiers for what AI agents did"

a1j9o94 — Wed, 11 Mar 2026 14:03:34 +0000

This is an interesting space. Right now we've gotten to a point where agents can do most tasks, but they will get lazy/skip steps if you're not precise in the requirements. We need ways to validate that expands beyond software tests. This is a good direction but a few thoughts: 1. From what I can tell the agent who does the task is running the validation. Keeping the validation agent as a separate context avoids the validator knowing what the software is supposed to do vs what it does 2. There's a lot of prior art around org structures to validate things that we've built out over the last ~100 years that we can apply in this space. E.g. look at the way that blind trials are run

Show HN: Sales Agent Benchmark – SWE-Bench for sales AI agents (open source)

a1j9o94 — Mon, 09 Feb 2026 16:05:55 +0000

Live leaderboard: https://sales-agent-benchmarks.fly.dev/benchmark GitHub: https://github.com/a1j9o94/sales-agent-benchmark

I built an open-source benchmark for evaluating LLMs as sales agents. The idea came from noticing that every sales AI tool demos well on clean summaries but falls apart on real deal data — and there was no rigorous way to measure that gap.

How it works

You register an API endpoint. We send your agent deal context (anonymized real B2B deals), it returns structured recommendations (risks, next steps, stakeholder analysis). A multi-judge panel (Claude, GPT, Gemini via OpenRouter) scores against ground truth — what actually happened in the deal.

Two evaluation modes:

Summary Benchmark — Pre-digested checkpoint summaries. Single-turn. 15 deals, 36 checkpoints, 4 scoring dimensions. Models score 68–81%. This is the easy mode.

Artifact-Based Benchmark — Raw call transcripts, email threads, CRM snapshots, Slack messages, documents. Multi-turn (agent can request specific artifacts before answering). 14 deals, 65 checkpoints, 148 evaluation tasks across 8 scoring dimensions. Models score 26–38%.

Every model we tested drops roughly in half when switching from summaries to real artifacts.

The interesting findings

Risk Identification collapses. Best model goes from 8.0/10 on summaries to 2.3/10 on real data. Models confidently identify risks that don't exist in the source material.

Hallucinated stakeholders. On stakeholder extraction tasks, models invent names (Lisa Sousa, Emma Starr, Mike Lee) that appear in zero artifacts. The actual stakeholders are in the transcripts — models just don't extract them.

Structured frameworks survive. MEDDPICC qualification scoring holds up at 7.5/10. Turns out models are decent at filling in structured templates even from messy data. It's the open-ended analysis that falls apart.

Communication quality is fine. Models score 5–8/10 on drafting follow-up emails and call summaries. The writing is good. The reasoning behind it isn't.

Technical details

Stack: Bun, TypeScript, React, Postgres (Neon), deployed on Fly.io

Evaluation: Task-specific judge prompts per artifact type. Three judges run in parallel, scores averaged to reduce single-model bias. Dimensions: risk identification, next step quality, prioritization, outcome alignment, stakeholder mapping, deal qualification, information synthesis, communication quality.

Artifact types: TranscriptArtifact (speaker-labeled turns from Granola AI), EmailArtifact (threaded messages with metadata), CrmSnapshotArtifact (HubSpot deal properties + stage history), DocumentArtifact (proposals, decks), SlackThreadArtifact, CalendarEventArtifact

Multi-turn protocol: Artifact-based requests include turnNumber/maxTurns. Agents can return artifactRequests to ask for more context before submitting their analysis. The benchmark runner handles the conversation loop.

API contract: POST your endpoint, receive { version: 2, artifacts: [...], stakeholders: [...], evaluationTask: {...} }, return structured JSON with risks, next steps, and dimension-specific analysis.

What I'm looking for

Try it. Register an endpoint and benchmark your agent: https://sales-agent-benchmarks.fly.dev/benchmark

Data partners. The dataset is small (29 deals). If you have anonymized deal artifacts — call transcripts, email exports, CRM data with outcomes — I'd love to process them through the pipeline and credit you as a founding contributor.

Feedback on evaluation methodology. The multi-judge approach works but I'm not confident the prompts are optimal. Happy to discuss the judge prompt design in issues.

The gap between summary performance and real-artifact performance seems like a general problem beyond sales. If anyone's seen similar benchmark work in other domains (legal document analysis, medical records, etc.), I'd be interested to compare notes.

Comments URL: https://news.ycombinator.com/item?id=46946742

Points: 1

# Comments: 0

New comment by a1j9o94 in "I Tried to Give AI "Imagination" to Solve Physics Problems"

a1j9o94 — Sun, 25 Jan 2026 16:59:15 +0000

Honestly just didn't think about it. Added it.

New comment by a1j9o94 in "I Tried to Give AI "Imagination" to Solve Physics Problems"

a1j9o94 — Sun, 25 Jan 2026 16:09:41 +0000

Hey HN,

  I spent the last few weeks exploring whether AI systems could benefit from generating video predictions before making decisions—like how humans mentally simulate "what happens if I pour this coffee?" before acting.

  The idea: Show an AI an image, ask "what happens if I push this?", have it generate a video prediction, then compare that prediction to reality. If the prediction looks wrong, maybe the AI could catch its own mistakes.

  The result: Current models can't do this. But I learned some interesting things along the way.

  What I tested:
  - 7 different architectures for predicting future video frames from VLM latent space
  - Whether perceptual similarity (LPIPS) between predicted and actual video correlates with correctness
  - Self-correction loops where the model gets feedback on its predictions

  Key findings:

  1. VLMs can't predict the future – Every architecture I tried performed worse than just copying the current frame as the "prediction." The model understands what's in an image but can't predict what will change.
  2. Visual similarity ≠ semantic correctness – This one surprised me. Wrong predictions often looked MORE similar to reality than correct ones (LPIPS correlation: 0.106). You can't use "does it look right?" to catch mistakes.
  3. Some things worked – Hybrid encoders (DINOv2 + VLM) preserve spatial information that VLMs lose. VLMs understand generated video well (93% semantic retention). Small adapters (10M params) work better than large ones (100M).

  I'm releasing this as a benchmark proposal. Video generation is improving fast—capabilities that don't exist today might emerge in future models. Seems worth tracking.

  Links:
  - Demo video: https://youtu.be/YJxDt_zCrUI
  - Code + paper: https://github.com/a1j9o94/foresight
  - Live demo: https://foresight-demo-kappa.vercel.app

  Built with Qwen2.5-VL, LTX-Video, Modal (GPUs), and the Something-Something v2 dataset.

  Happy to answer questions about the experiments or methodology.

I Tried to Give AI "Imagination" to Solve Physics Problems

a1j9o94 — Sun, 25 Jan 2026 16:09:41 +0000

Article URL: https://github.com/a1j9o94/foresight

Comments URL: https://news.ycombinator.com/item?id=46755305

Points: 2

# Comments: 3

New comment by a1j9o94 in "AI Police Reports: Year in Review"

a1j9o94 — Sat, 27 Dec 2025 15:17:07 +0000

Pretty much every major LLM client has web search built in. They aren't just using what's in their weights to generate the answers.

When it gives you a link, it literally takes you to the part of the page that it got its answer from. That's how we can quickly validate.

New comment by a1j9o94 in "I know you didn't write this"

a1j9o94 — Mon, 22 Dec 2025 20:07:42 +0000

I would argue that's just your coworker giving you a bad answer. If you prompt a chatbot with the right business context, look at what it spits out, and layer in your judgement before you hit send, then it's fine if the AI typed it out.

If they answer your question with irrelevant context, then that's the problem, not that it was AI

New comment by a1j9o94 in "I know you didn't write this"

a1j9o94 — Mon, 22 Dec 2025 20:04:41 +0000

Honestly if you have a working relationship/communication norms where that's expected, I agree just send the 5 bullets.

In most of my work contexts, people want more formal documents with clean headings titles, detailed risks even if it's the same risks we've put on every project.

New comment by a1j9o94 in "I know you didn't write this"

a1j9o94 — Mon, 22 Dec 2025 19:09:00 +0000

I know I'm an outlier on HN, but I really don't care if AI was used to write something I'm reading. I just care whether or not the ideas are good and clear. And if we're talking about work output 99% of what people were putting out before AI wasn't particularly good. And in my genuine experience AI's output is better than things people I worked with would spend hours and days on.

I feel like more time is wasted trying to catch your coworkers using AI vs just engaging with the plan. If it's a bad plan say that and make sure your coworker is held accountable for presenting a bad plan. But it shouldn't matter if he gave 5 bullets to Chat gpt that expanded it to a full page with a detailed plan.

New comment by a1j9o94 in "History LLMs: Models trained exclusively on pre-1913 texts"

a1j9o94 — Fri, 19 Dec 2025 05:57:03 +0000

Not the person you're responding to, but I think there's a non trivial argument to make that our thoughts are just auto complete. What is the next most likely word based on what you're seeing. Ever watched a movie and guessed the plot? Or read a comment and know where it was going to go by the end?

And I know not everyone thinks in a literal stream of words all the time (I do) but I would argue that those people's brains are just using a different "token"

New comment by a1j9o94 in "Gemini 3 Pro: the frontier of vision AI"

a1j9o94 — Sat, 06 Dec 2025 14:08:23 +0000

Having one tool that you can use to do all of these things makes a big difference. If I'm a financial analyst at a company I don't need to know how to implement and use 5 different specialized ML models, I can just ask one tool (that can still use tools on the backend to complete the task efficiently)

New comment by a1j9o94 in "Google Antigravity exfiltrates data via indirect prompt injection attack"

a1j9o94 — Tue, 25 Nov 2025 22:29:39 +0000

New comment by a1j9o94 in "Kiro: A new agentic IDE"

a1j9o94 — Tue, 15 Jul 2025 02:51:02 +0000

The above is saying more precise not completely precise. The overall point they're making is you still are responsible for the code you commit.

If they are saying the code in this project was in line with what they would have written, I lean towards trusting their assessment.

New comment by a1j9o94 in "At Amazon, some coders say their jobs have begun to resemble warehouse work"

a1j9o94 — Mon, 26 May 2025 13:09:54 +0000

Why do you say that? I would argue that as long as your tests and interfaces are clearly defined no reason it couldn't scale indefinitely.

New comment by a1j9o94 in "At Amazon, some coders say their jobs have begun to resemble warehouse work"

a1j9o94 — Mon, 26 May 2025 13:08:35 +0000

I agree with this completely. I get the impression that a lot of people here think of software development as a craft, which is great for your own learning and development but not relevant from the company's perspective. It just has to work good enough.

Your point about management being vibe coding is spot on. I have hired people to build something and just had to hope that they built it the way I wanted. I honestly feel like AI is better than most of the outsourced code work I do.

One last piece, if anyone does have trouble getting value out of AI tools, I would encourage you to talk to/guide them like you would a junior team member. Actually "discuss" what you're trying to accomplish, lay out a plan, build your tests, and only then start working on the output. Most examples I see of people trying to get AI to do things fail because of poor communication.

New comment by a1j9o94 in "At Amazon, some coders say their jobs have begun to resemble warehouse work"

a1j9o94 — Mon, 26 May 2025 12:54:41 +0000

The point is devs aren't sales/client facing. So from the customers perspective, it's just a delivery detail.