New comment by tangweigang in "Lessons from Building Evals for Financial AI Agents"

tangweigang — Mon, 22 Jun 2026 09:05:53 +0000

A useful distinction would be whether the agent ships with an evaluation surface, not just a workflow surface.

For finance I would look for: the exact task class it claims to handle, the data snapshot used for an answer, the tool calls it was allowed to make, a failure taxonomy, and examples where the agent chooses not to answer. If those are visible, it is much easier to compare it with other finance agents. If they are not visible, it is mostly a UI/product-positioning difference.

Hacker News: tangweigang

New comment by tangweigang in "Lessons from Building Evals for Financial AI Agents"