Hacker News: nikhilpareek13

New comment by nikhilpareek13 in "GitHub CLI now collects pseudoanonymous telemetry"

nikhilpareek13 — Thu, 23 Apr 2026 09:42:06 +0000

Telemetry in a CLI is one of those things that sounds harmless until you remember how often CLIs end up inside CI, internal tooling, and security-sensitive workflows. If GitHub wanted trust from the people who use gh most, default off with a plain schema would have landed much better than pseudoanonymous by default.

New comment by nikhilpareek13 in "[dead]"

nikhilpareek13 — Tue, 17 Feb 2026 18:29:20 +0000

We’ve been building production RAG systems and kept running into the same failure patterns. Documented everything in a free handbook.

Covers hybrid retrieval (vector + BM25 with rank fusion), knowledge graph integration, semantic/AST-based chunking, multi-stage reranking pipelines, domain-specific RAG for code/SQL/legal/medical, evaluation without ground truth labels, agentic self-correction, and production observability.

118 pages, 16 chapters, free PDF. Happy to discuss any of the architectural trade-offs. Particularly interested in feedback on the hybrid retrieval section (Ch 2) and evaluation frameworks (Ch 11).

New comment by nikhilpareek13 in "[dead]"

nikhilpareek13 — Fri, 13 Feb 2026 18:14:10 +0000

Over the past few weeks, we rebuilt synthetic data generation at Future AGI.

Recent updates:

- Outputs anchored to uploaded knowledge bases

- ~90% adherence to source material observed

- 1.78× faster dataset creation (1,000+ rows in ~10 mins)

- Edit columns before/during/after runs

- Better diversity beyond 5,000 rows

- SOP uploads converted into structured evaluation scenarios

- One-click synthetic variable generation for prompt testing

For teams evaluating LLM systems under data constraints, this has reduced iteration friction significantly.

Curious how others are validating grounding + diversity at scale.

New comment by nikhilpareek13 in "[dead]"

nikhilpareek13 — Thu, 12 Feb 2026 19:00:55 +0000

The more skilled you are at writing prompts, the more dangerous your process becomes.

Because you stop measuring.

Expert intuition works on 10 examples. It doesn’t generalize to 10,000 inputs and three interacting failure modes.

When you optimize by feel:

-results aren’t reproducible

-changes aren’t versioned

-trade-offs aren’t quantified

-regressions slip in silently

This isn’t a prompting problem. It’s an optimization problem.

Treat prompts like hyperparameters.

Dataset → Evaluator → Optimizer → Ranked prompts.

Once you introduce an objective function, intuition becomes optional.

We wrote a cookbook that lays out the full workflow step by step for teams moving beyond manual iteration.

New comment by nikhilpareek13 in "[dead]"

nikhilpareek13 — Tue, 10 Feb 2026 20:12:47 +0000

When working with image generation or vision pipelines, a common issue is that model outputs aren’t visible where the prompt is defined. Reviewing quality and comparing runs often requires exporting outputs and switching tools.

We’ve added native image rendering inside Datasets and Prompt Workbench so generated images appear inline, next to the prompts that produced them.

This allows:

Faster output review

Easier comparison across runs

Tighter iteration loops without context switching

Curious how others are handling evaluation and iteration for multimodal pipelines today.

Why text-based evals fail for vision-language models

nikhilpareek13 — Tue, 06 Jan 2026 03:17:34 +0000

Text hallucination gets most of the attention, but image hallucination may be the larger long-term problem.

In vision-language systems, hallucination often means inventing objects, attributes, or actions that are not present in the image at all.

Examples:

- Describing people who don’t exist - Inferring actions that never occurred - Assigning attributes unsupported by visual evidence

As these models are increasingly used for e-commerce listings, accessibility captions, document extraction, and medical imaging, the consequences escalate quickly.

Most evaluation pipelines are still text-centric. They don’t verify whether the generated description is actually grounded in the image.

Detecting image hallucination requires multimodal evaluation that reasons over both the image and the output jointly.

Curious how teams here are approaching hallucination detection for vision-language models today.

Comments URL: https://news.ycombinator.com/item?id=46508362

Points: 1

# Comments: 0

New comment by nikhilpareek13 in "[dead]"

nikhilpareek13 — Mon, 01 Dec 2025 18:18:36 +0000

While working with teams building voice agents in SIMULATE, we kept seeing the same pattern:

PMs and engineers would run large batches of tests, then immediately jump into the workflow graph to replay calls and figure out one thing:

Where exactly did the agent’s path diverge from the intended flow?

The only way to answer that was manual flow tracing — stepping through nodes, comparing expected vs actual paths, and trying to spot the turn where behavior shifted. It was slow but necessary work.

We turned that repeated behavior into a feature called Flow Analysis.

For each test run, Flow Analysis computes and surfaces:

- The exact path the agent took through the workflow - The node where it diverged from the expected path - How the rest of the conversation evolved after that point

This makes debugging more of an engineering task (fix the specific node/logic/prompt) instead of an investigation across the whole graph.

If you’re working with voice agents or complex conversational flows and still manually scanning graphs to debug failures, would be interested in your thoughts on this approach.

More details: https://app.futureagi.com/dashboard/simulate/agent-definitio...

We built a black box X-Ray for AI Agents

nikhilpareek13 — Tue, 11 Nov 2025 02:42:51 +0000

Article URL: https://devhunt.org/tool/agent-compass-by-future-agi

Comments URL: https://news.ycombinator.com/item?id=45883590

Points: 1

# Comments: 0

New comment by nikhilpareek13 in "[dead]"

nikhilpareek13 — Wed, 29 Oct 2025 20:04:21 +0000

We built Agent Compass after hitting the same wall over and over: agents generate thousands of traces with branching tool calls and no obvious pattern. APMs (Datadog/New Relic) tell you infra health. LLM observability tools (LangSmith/Arize) show trace detail. But the gap remained: turning all that data into a fast, defensible root cause and a concrete fix.

What it is

Automatic error clustering for AI agents

Symptom → likely root cause mapping

Actionable fix suggestions you can validate with a focused eval loop

Why it’s different

You debug categories of failures, not one-off traces

It ranks hypotheses (e.g., threshold too high, retrieval drift, prompt regression, guardrail friction)

It proposes small, surgical changes you can A/B and roll back quickly

How it works (high level)

We instrument LLM calls, tool invocations, retrieval hits, guardrail events, and outputs as spans (OpenTelemetry compatible)

We build semantic signatures of failure states and cluster them

We label clusters and map them to ranked hypotheses using a mix of rules + learned patterns from historical fixes

We attach a minimal eval set per cluster so you can confirm the fix without re-running your whole suite

AI is probabilistic. Your testing can't stay deterministic

nikhilpareek13 — Mon, 22 Sep 2025 08:50:45 +0000

Article URL: https://docs.futureagi.com/future-agi/get-started/evaluation/evaluate-ci-cd-pipeline#evaluate-via-ci-cd-pipeline

Comments URL: https://news.ycombinator.com/item?id=45330720

Points: 2

# Comments: 2

The only evals that matter while agent testing are the ones you write yourself

nikhilpareek13 — Tue, 09 Sep 2025 18:35:40 +0000

Article URL: https://app.futureagi.com/dashboard/evaluations

Comments URL: https://news.ycombinator.com/item?id=45186499

Points: 1

# Comments: 0

New comment by nikhilpareek13 in "From Theory to Reality: A Handbook on Scaling RAG for Enterprises"

nikhilpareek13 — Mon, 01 Sep 2025 04:24:18 +0000

I wrote a free handbook on enterprise RAG, not the theory, but what happens when you try to scale it in production.

Inside, you’ll find practical insights on chunking methodologies, re-ranking systems, embedding techniques, hallucination control, RAG implementation, evaluation strategies, plus countless additional topics.

You’ll also learn:

- Frameworks to reduce hallucinations - Enterprise evaluation practices - ROI optimization via metrics

Would love feedback from this community.

From Theory to Reality: A Handbook on Scaling RAG for Enterprises

nikhilpareek13 — Mon, 01 Sep 2025 04:24:18 +0000

Article URL: https://futureagi.com/mastering-agentic-rag

Comments URL: https://news.ycombinator.com/item?id=45089432

Points: 2

# Comments: 1