<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: nikhilpareek13</title><link>https://news.ycombinator.com/user?id=nikhilpareek13</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Thu, 30 Apr 2026 08:47:55 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=nikhilpareek13" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by nikhilpareek13 in "GitHub CLI now collects pseudoanonymous telemetry"]]></title><description><![CDATA[
<p>Telemetry in a CLI is one of those things that sounds harmless until you remember how often CLIs end up inside CI, internal tooling, and security-sensitive workflows. If GitHub wanted trust from the people who use gh most, default off with a plain schema would have landed much better than pseudoanonymous by default.</p>
]]></description><pubDate>Thu, 23 Apr 2026 09:42:06 +0000</pubDate><link>https://news.ycombinator.com/item?id=47873798</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=47873798</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47873798</guid></item><item><title><![CDATA[New comment by nikhilpareek13 in "[dead]"]]></title><description><![CDATA[
<p>We’ve been building production RAG systems and kept running into the same failure patterns. Documented everything in a free handbook.<p>Covers hybrid retrieval (vector + BM25 with rank fusion), knowledge graph integration, semantic/AST-based chunking, multi-stage reranking pipelines, domain-specific RAG for code/SQL/legal/medical, evaluation without ground truth labels, agentic self-correction, and production observability.<p>118 pages, 16 chapters, free PDF. 
Happy to discuss any of the architectural trade-offs. Particularly interested in feedback on the hybrid retrieval section (Ch 2) and evaluation frameworks (Ch 11).</p>
]]></description><pubDate>Tue, 17 Feb 2026 18:29:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=47051096</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=47051096</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47051096</guid></item><item><title><![CDATA[New comment by nikhilpareek13 in "[dead]"]]></title><description><![CDATA[
<p>Over the past few weeks, we rebuilt synthetic data generation at Future AGI.<p>Recent updates:<p>- Outputs anchored to uploaded knowledge bases<p>- ~90% adherence to source material observed<p>- 1.78× faster dataset creation (1,000+ rows in ~10 mins)<p>- Edit columns before/during/after runs<p>- Better diversity beyond 5,000 rows<p>- SOP uploads converted into structured evaluation scenarios<p>- One-click synthetic variable generation for prompt testing<p>For teams evaluating LLM systems under data constraints, this has reduced iteration friction significantly.<p>Curious how others are validating grounding + diversity at scale.</p>
]]></description><pubDate>Fri, 13 Feb 2026 18:14:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=47005785</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=47005785</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47005785</guid></item><item><title><![CDATA[New comment by nikhilpareek13 in "[dead]"]]></title><description><![CDATA[
<p>The more skilled you are at writing prompts, the more dangerous your process becomes.<p>Because you stop measuring.<p>Expert intuition works on 10 examples.
It doesn’t generalize to 10,000 inputs and three interacting failure modes.<p>When you optimize by feel:<p>-results aren’t reproducible<p>-changes aren’t versioned<p>-trade-offs aren’t quantified<p>-regressions slip in silently<p>This isn’t a prompting problem. It’s an optimization problem.<p>Treat prompts like hyperparameters.<p>Dataset → Evaluator → Optimizer → Ranked prompts.<p>Once you introduce an objective function, intuition becomes optional.<p>We wrote a cookbook that lays out the full workflow step by step for teams moving beyond manual iteration.</p>
]]></description><pubDate>Thu, 12 Feb 2026 19:00:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=46993378</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=46993378</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46993378</guid></item><item><title><![CDATA[New comment by nikhilpareek13 in "[dead]"]]></title><description><![CDATA[
<p>When working with image generation or vision pipelines, a common issue is that model outputs aren’t visible where the prompt is defined. Reviewing quality and comparing runs often requires exporting outputs and switching tools.<p>We’ve added native image rendering inside Datasets and Prompt Workbench so generated images appear inline, next to the prompts that produced them.<p>This allows:<p>Faster output review<p>Easier comparison across runs<p>Tighter iteration loops without context switching<p>Curious how others are handling evaluation and iteration for multimodal pipelines today.</p>
]]></description><pubDate>Tue, 10 Feb 2026 20:12:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=46966124</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=46966124</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46966124</guid></item><item><title><![CDATA[Why text-based evals fail for vision-language models]]></title><description><![CDATA[
<p>Text hallucination gets most of the attention, but image hallucination may be the larger long-term problem.<p>In vision-language systems, hallucination often means inventing objects, attributes, or actions that are not present in the image at all.<p>Examples:<p>- Describing people who don’t exist
- Inferring actions that never occurred
- Assigning attributes unsupported by visual evidence<p>As these models are increasingly used for e-commerce listings, accessibility captions, document extraction, and medical imaging, the consequences escalate quickly.<p>Most evaluation pipelines are still text-centric. They don’t verify whether the generated description is actually grounded in the image.<p>Detecting image hallucination requires multimodal evaluation that reasons over both the image and the output jointly.<p>Curious how teams here are approaching hallucination detection for vision-language models today.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46508362">https://news.ycombinator.com/item?id=46508362</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 06 Jan 2026 03:17:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=46508362</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=46508362</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46508362</guid></item><item><title><![CDATA[New comment by nikhilpareek13 in "[dead]"]]></title><description><![CDATA[
<p>While working with teams building voice agents in SIMULATE, we kept seeing the same pattern:<p>PMs and engineers would run large batches of tests, then immediately jump into the workflow graph to replay calls and figure out one thing:<p>Where exactly did the agent’s path diverge from the intended flow?<p>The only way to answer that was manual flow tracing — stepping through nodes, comparing expected vs actual paths, and trying to spot the turn where behavior shifted. It was slow but necessary work.<p>We turned that repeated behavior into a feature called Flow Analysis.<p>For each test run, Flow Analysis computes and surfaces:<p>- The exact path the agent took through the workflow
- The node where it diverged from the expected path
- How the rest of the conversation evolved after that point<p>This makes debugging more of an engineering task (fix the specific node/logic/prompt) instead of an investigation across the whole graph.<p>If you’re working with voice agents or complex conversational flows and still manually scanning graphs to debug failures, would be interested in your thoughts on this approach.<p>More details: <a href="https://app.futureagi.com/dashboard/simulate/agent-definitions?utm_source=0112HNflowanalysis&utm_medium=organic&utm_campaign=content_distribution" rel="nofollow">https://app.futureagi.com/dashboard/simulate/agent-definitio...</a></p>
]]></description><pubDate>Mon, 01 Dec 2025 18:18:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=46110864</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=46110864</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46110864</guid></item><item><title><![CDATA[We built a black box X-Ray for AI Agents]]></title><description><![CDATA[
<p>Article URL: <a href="https://devhunt.org/tool/agent-compass-by-future-agi">https://devhunt.org/tool/agent-compass-by-future-agi</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45883590">https://news.ycombinator.com/item?id=45883590</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 11 Nov 2025 02:42:51 +0000</pubDate><link>https://devhunt.org/tool/agent-compass-by-future-agi</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=45883590</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45883590</guid></item><item><title><![CDATA[New comment by nikhilpareek13 in "[dead]"]]></title><description><![CDATA[
<p>We built Agent Compass after hitting the same wall over and over: agents generate thousands of traces with branching tool calls and no obvious pattern. APMs (Datadog/New Relic) tell you infra health. LLM observability tools (LangSmith/Arize) show trace detail. But the gap remained: turning all that data into a fast, defensible root cause and a concrete fix.<p>What it is<p>Automatic error clustering for AI agents<p>Symptom → likely root cause mapping<p>Actionable fix suggestions you can validate with a focused eval loop<p>Why it’s different<p>You debug categories of failures, not one-off traces<p>It ranks hypotheses (e.g., threshold too high, retrieval drift, prompt regression, guardrail friction)<p>It proposes small, surgical changes you can A/B and roll back quickly<p>How it works (high level)<p>We instrument LLM calls, tool invocations, retrieval hits, guardrail events, and outputs as spans (OpenTelemetry compatible)<p>We build semantic signatures of failure states and cluster them<p>We label clusters and map them to ranked hypotheses using a mix of rules + learned patterns from historical fixes<p>We attach a minimal eval set per cluster so you can confirm the fix without re-running your whole suite</p>
]]></description><pubDate>Wed, 29 Oct 2025 20:04:21 +0000</pubDate><link>https://news.ycombinator.com/item?id=45752318</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=45752318</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45752318</guid></item><item><title><![CDATA[AI is probabilistic. Your testing can't stay deterministic]]></title><description><![CDATA[
<p>Article URL: <a href="https://docs.futureagi.com/future-agi/get-started/evaluation/evaluate-ci-cd-pipeline#evaluate-via-ci-cd-pipeline">https://docs.futureagi.com/future-agi/get-started/evaluation/evaluate-ci-cd-pipeline#evaluate-via-ci-cd-pipeline</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45330720">https://news.ycombinator.com/item?id=45330720</a></p>
<p>Points: 2</p>
<p># Comments: 2</p>
]]></description><pubDate>Mon, 22 Sep 2025 08:50:45 +0000</pubDate><link>https://docs.futureagi.com/future-agi/get-started/evaluation/evaluate-ci-cd-pipeline#evaluate-via-ci-cd-pipeline</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=45330720</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45330720</guid></item><item><title><![CDATA[The only evals that matter while agent testing are the ones you write yourself]]></title><description><![CDATA[
<p>Article URL: <a href="https://app.futureagi.com/dashboard/evaluations">https://app.futureagi.com/dashboard/evaluations</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45186499">https://news.ycombinator.com/item?id=45186499</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 09 Sep 2025 18:35:40 +0000</pubDate><link>https://app.futureagi.com/dashboard/evaluations</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=45186499</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45186499</guid></item><item><title><![CDATA[New comment by nikhilpareek13 in "From Theory to Reality: A Handbook on Scaling RAG for Enterprises"]]></title><description><![CDATA[
<p>I wrote a free handbook on enterprise RAG, not the theory, but what happens when you try to scale it in production.<p>Inside, you’ll find practical insights on chunking methodologies, re-ranking systems, embedding techniques, hallucination control, RAG implementation, evaluation strategies, plus countless additional topics.<p>You’ll also learn:<p>- Frameworks to reduce hallucinations
- Enterprise evaluation practices
- ROI optimization via metrics<p>Would love feedback from this community.</p>
]]></description><pubDate>Mon, 01 Sep 2025 04:24:18 +0000</pubDate><link>https://news.ycombinator.com/item?id=45089433</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=45089433</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45089433</guid></item><item><title><![CDATA[From Theory to Reality: A Handbook on Scaling RAG for Enterprises]]></title><description><![CDATA[
<p>Article URL: <a href="https://futureagi.com/mastering-agentic-rag">https://futureagi.com/mastering-agentic-rag</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45089432">https://news.ycombinator.com/item?id=45089432</a></p>
<p>Points: 2</p>
<p># Comments: 1</p>
]]></description><pubDate>Mon, 01 Sep 2025 04:24:18 +0000</pubDate><link>https://futureagi.com/mastering-agentic-rag</link><dc:creator>nikhilpareek13</dc:creator><comments>https://news.ycombinator.com/item?id=45089432</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45089432</guid></item></channel></rss>