Hacker News: gauravvij137

New comment by gauravvij137 in "Ask HN: What are tools you have made for yourself since the advent of AI?"

gauravvij137 — Sun, 14 Jun 2026 09:38:53 +0000

A fully automated prompt optimizer - https://github.com/gauravvij/autoprompter

Built with inspiration from Karpathy's AutoResearch + PromptFoo.

New comment by gauravvij137 in "AWS Bedrock to require sharing data with Anthropic for Mythos and future models"

gauravvij137 — Wed, 10 Jun 2026 12:22:17 +0000

The data leaving AWS boundary kills this for any regulated workload. We've been running side-by-side evals of open models against Claude on private test suites, using Neo as the orchestration layer. Keeps everything in-house and gives us objective comparison data.

New comment by gauravvij137 in "What's inside the trending "skills" repos for Claude Code"

gauravvij137 — Wed, 03 Jun 2026 13:00:48 +0000

Five of the top 10 AI repos on GitHub trending this week are "skills" packs for Claude Code. The label is doing wildly different work across them.

forrestchang/andrej-karpathy-skills (~70k stars). One CLAUDE.md file, four behavioral rules, derived from Karpathy's January tweet about agent coding failure modes (silent wrong assumptions, over-complication, not surfacing tradeoffs). Karpathy didn't write the file or endorse it. The README has had a typo in the install command (andrej-karpthy-skills, missing the second "a") since launch. A second repo, multica-ai/andrej-karpathy-skills, is trending in parallel with the same content republished.

mattpocock/skills (~115k stars). Matt Pocock's personal .claude/skills/ folder, published. About 10 small SKILL.md files: tdd, to-issues, to-prd, triage, zoom-out, setup-matt-pocock-skills. Each one is a self-contained markdown prompt with YAML frontmatter declaring when it should auto-fire. Third-party writeups describe it as a reference implementation of Anthropic's SKILL.md format.

affaan-m/everything-claude-code (~175k stars; plus a second repo affaan-m/ECC at ~205k stars which is the same project under a renamed identifier). 48 agent definitions, 182 SKILL.md files, 68 legacy slash-command shims, hooks, rules, MCP configurations, npm packages (ecc-universal, ecc-agentshield), a Tkinter desktop dashboard, and a security scanner (1282 tests, 102 static analysis rules). Includes per-harness adapters for Claude Code, Codex CLI, Codex macOS app, Cursor, OpenCode, Gemini CLI, and Antigravity. Anthropic hackathon winner.

Three orders of magnitude in scope under one word.

What's underneath is Anthropic's SKILL.md format: markdown with YAML frontmatter, auto-loaded at session start. The frontmatter declares when the skill should fire; the harness picks relevant skills based on the description and injects only those into context. It's RAG-over-prompts using model-based routing on descriptions rather than vector stores. The format works well enough that you can mix skills from different authors in the same .claude/ folder without the harness caring, which is the actual reason this took off. Trending packs ship per-harness adapters on top of that substrate so the same skill content installs into Codex, Cursor, OpenCode, etc., with per-harness rewrites.

The trending list is measuring three different things under one label: small high-leverage CLAUDE.md edits (karpathy-skills, cost-to-try wins), curated personal reference sets (mattpocock, distribution-by-reputation wins), and full framework distributions (ECC, comprehensive-catalog-marketing wins). Stars are not telling us which of these are surviving on actual reuse.

The defensibility implications are uncomfortable. When a startup pitches "our agent does X better because of our prompting and workflow," and the artifact is a folder of markdown files with YAML frontmatter, that's a Notion template, not a moat. Karpathy's four rules will be absorbed into the default behavior of the next Claude release. mattpocock's TDD skill is sharp but copyable. ECC's 182-skill catalog is impressive engineering, but the prompts inside can be diffed and ported in an afternoon.

What does seem to hold value after reading these: harness ergonomics (install paths, hook plumbing, cross-tool sync scripts, MCP-server lifecycle), distribution (mattpocock-the-person is a moat, the markdown isn't), and security tooling around skill files specifically (AgentShield, ECC's scanner, is a real product even if the skills it scans aren't). None of those are the prompt.

Repos: github.com/forrestchang/andrej-karpathy-skills, github.com/mattpocock/skills, github.com/affaan-m/everything-claude-code.

What's inside the trending "skills" repos for Claude Code

gauravvij137 — Wed, 03 Jun 2026 13:00:48 +0000

Article URL: https://aisignals.heyneo.com/

Comments URL: https://news.ycombinator.com/item?id=48383438

Points: 4

# Comments: 1

Show HN: Host any GGUF model in one command

gauravvij137 — Tue, 31 Mar 2026 12:44:46 +0000

Running a GGUF model locally usually means writing custom inference code or wrestling with llama.cpp's CLI flags every time you want to test something.

Existing OpenAI-compatible servers often require Docker, complex configuration files, or GPU support.

The gap between "I have a .gguf file" and "I have a working API endpoint" is wider than it should be.

A simple CLI tool to serve GGUF models as an endpoint: gguf-serve

To cut this short, we asked Neo to build gguf-serve.

Point it at any .gguf file, run the server, and immediately get OpenAI-compatible endpoints that work with any client library or tool that speaks the OpenAI API format.

Comments URL: https://news.ycombinator.com/item?id=47586549

Points: 3

# Comments: 0

Show HN: FC-Eval – CLI to Benchmark Local or Cloud LLMs on Function Calling

gauravvij137 — Tue, 17 Mar 2026 14:02:44 +0000

I built FC-Eval to have a repeatable way to evaluate how well different LLMs handle function calling before using them in agent workflows.

It runs models through 30 test cases covering single-turn, multi-turn, and agentic scenarios, modeled loosely after the Berkeley Function Calling Leaderboard methodology.

Validation uses AST matching rather than string comparison to avoid false positives from formatting variations.

Supports two backends: OpenRouter for cloud models (GPT-5.2, Claude, Qwen 3.5, Mistral, etc.) and Ollama for local models with no API key needed.

Tests for best of N trials giving you a reliable score alongside raw accuracy.

Results export to JSON, TXT, CSV, or Markdown.

Quick start commands: Via Openrouter: `fc-eval --provider openrouter --models openai/gpt-5.2 anthropic/claude-sonnet-4.6`

Via Ollama: `fc-eval --provider ollama --models llama3.2`

GitHub repo: https://github.com/gauravvij/function-calling-cli

Happy to answer questions, especially around the test case design or validation logic.

Comments URL: https://news.ycombinator.com/item?id=47412836

Points: 3

# Comments: 0

Show HN: Auto LLM Ranker – Describe a task in English and get ranked models

gauravvij137 — Mon, 09 Mar 2026 12:44:46 +0000

I got tired of picking LLMs based on vibes and leaderboards that don't reflect real workloads, so I built this.

You describe a task in plain English. The tool generates a test suite for that specific task, discovers candidate models via OpenRouter, benchmarks them in parallel, and uses a Judge LLM to score every response across 5 dimensions: accuracy, hallucination, grounding, tool-calling, and clarity.

Output is a ranked top 3 with average latency per model and a task-specific system prompt optimized for the winner.

A few things I learned while building it:

- Score and latency rarely correlate. The best model for accuracy on coding tasks was almost never the fastest. This tradeoff is completely task-dependent and impossible to see from benchmarks that don't reflect your workload. - The Judge LLM approach is surprisingly consistent but introduces positional and familiarity bias. Using one model to score others isn't perfect, but it's far more reproducible than manual eval. Open to ideas on how to reduce judge bias without blowing up the cost. - Model discovery matters more than I expected. The top performers on generic benchmarks often weren't the top performers on narrow tasks.

Stack: Python, OpenRouter for model access, MIT licensed.

https://github.com/gauravvij/llm-evaluator

Happy to answer questions on the design decisions.

Comments URL: https://news.ycombinator.com/item?id=47308325

Points: 3

# Comments: 0

Show HN: GitHub Repo Agent – an agent that explores and reasons on GitHub repos

gauravvij137 — Tue, 03 Mar 2026 06:43:20 +0000

Built a small agent that can explore a GitHub repository, understand it in-depth, and answer questions about the codebase.

The idea is simple. When you open a new repo, most of the time goes into figuring out: - Where the main logic lives - How modules connect - How to run or debug things

This agent clones a repo, indexes files, and lets an LLM reason over the structure so you can ask questions or automate tasks.

Useful for: - Onboarding large codebases - Understanding OSS repos - Debugging unfamiliar projects - Building higher-level code agents

This came out of experiments we were doing with NEO AI for building autonomous AI Agents where agents need to read repos before modifying them.

Looking for feedback on repo indexing strategies, eval benchmarks, or similar tools people have built.

Comments URL: https://news.ycombinator.com/item?id=47228967

Points: 4

# Comments: 0

Show HN: Kitten TTS Based Low-Latency Streaming Voice Assistant on CPU

gauravvij137 — Thu, 26 Feb 2026 12:42:18 +0000

We asked Neo AI to build a small voice assistant pipeline that runs with low latency on CPU instead of requiring a GPU.

The goal was to see how responsive a LLM → speech system can be on normal laptops or edge devices.

It includes: - Voice Activity Detection - CPU-friendly LLM + TTS streaming - Async pipeline to reduce latency

Modular LLM backend

Useful for local assistants, robotics prototypes, privacy-first setups, or benchmarking STT/LLM/TTS latency.

We’ve been experimenting with similar CPU-first pipelines inside NEO workflows for on-device agents, and this repo is a minimal standalone version.

Would love suggestions on lightweight STT/TTS models or latency tricks people have used on CPU.

Comments URL: https://news.ycombinator.com/item?id=47165276

Points: 3

# Comments: 0

Show HN: LLM Council – Run multiple LLMs with critique and consensus eval

gauravvij137 — Wed, 25 Feb 2026 13:04:27 +0000

Building reliable LLM systems often means not trusting a single model.

We open-sourced LLM Council: https://github.com/abhishekgandhi-neo/llm_council

It’s a small framework we internally built with Neo to run multiple LLMs on the same task, let them critique each other, and produce a structured final answer.

Useful for tasks like: • Comparing local vs API models on your own dataset • Validating RAG outputs • Prompt regression testing • Dataset labeling with model-as-judge • Catching hallucinations in code or research summaries

A few practical details: • Async parallel calls so latency stays close to one model • Structured outputs with each model’s answer and critiques • Provider-agnostic configs for local + hosted models • Built to plug into evaluation pipelines, not just demos

We built this using Neo. We’ve been experimenting with similar council setups to catch silent failures in ML workflows, and this repo is a cleaned-up version of that idea.

If you’ve built multi-LLM evaluation pipelines, would love to hear what aggregation or critique strategies worked well for you.

Comments URL: https://news.ycombinator.com/item?id=47150967

Points: 4

# Comments: 0

New comment by gauravvij137 in "Show HN: CLI tool to analyze your Vector Embeddings!"

gauravvij137 — Fri, 20 Feb 2026 06:50:54 +0000

Working with embeddings (RAG, semantic search, clustering, recommendations, etc.), means: - Generate embeddings - Compute cosine similarity - Run retrieval - Hope it "works"

But then I stumbled upon the issue of not being able to determine why my RAG responses felt off, retrieval quality being inconsistent and clustering results looked weird.

Debugging embeddings was painful.

To solve this issue, we built this Embedding evaluation CLI tool to audit embedding spaces, not just generate them.

Instead of guessing whether your vectors make sense, it: - Detects semantic outliers - Identifies cluster inconsistencies - Flags global embedding collapse - Highlights ambiguous boundary tokens - Generates heatmaps and cluster visualizations - Produces structured reports (JSON / Markdown)

Checkout the tool and feel free to share your feedback: https://github.com/dakshjain-1616/Embedding-Evaluator

This is especially useful for: - RAG pipelines - Vector DB systems - Semantic search products - Embedding model comparisons - Fine-tuning experiments

It surfaces structural problems in the geometry of your embeddings before they break your system downstream.

Show HN: CLI tool to analyze your Vector Embeddings!

gauravvij137 — Fri, 20 Feb 2026 06:50:54 +0000

Article URL: https://github.com/dakshjain-1616/Embedding-Evaluator

Comments URL: https://news.ycombinator.com/item?id=47084592

Points: 2

# Comments: 1

New comment by gauravvij137 in "AI Agent swarm for Stock trading simulation"

gauravvij137 — Tue, 17 Feb 2026 16:16:29 +0000

We just released an open source project for agent swarm based stock trading simulations built and tested by heyneo.so - Fully autonomous ML engineering agent.

The agent swarm self coordinates with each other via an asynchronous message bus.

There are around 10 agents with distinct roles: - 3 Analyst Agents → Generate BUY/SELL signals (SMA crossovers, volume trends) - 4 Trader Agents → Execute trades, manage $250K portfolios each - 2 Risk Managers → Validate orders, enforce stop-loss rules - 1 Reporter Agent → Aggregate P&L and generate reports

The simulation consists of capital allocation, risk checks like stop-losses and order blocking, and reporting baked into the flow. The system backtests over ~250 trading days, starts with a fixed $1M capital, and logs things like drawdown, blocked orders, and approval rates.

Repo here if anyone wants to dig into the implementation or poke holes in the design: https://github.com/dakshjain-1616/Stock-trading-Agent-Swarm-...

Extend the project in your VS Code IDE with Neo: https://marketplace.visualstudio.com/items?itemName=NeoResea...

AI Agent swarm for Stock trading simulation

gauravvij137 — Tue, 17 Feb 2026 16:16:29 +0000

Article URL: https://github.com/dakshjain-1616/Stock-trading-Agent-Swarm---BY-NEO

Comments URL: https://news.ycombinator.com/item?id=47049140

Points: 3

# Comments: 1

New comment by gauravvij137 in "9x MobileNet V2 size reduction with Quantization aware training"

gauravvij137 — Mon, 16 Feb 2026 19:10:43 +0000

This project implements Quantization-Aware Training (QAT) for MobileNetV2, enabling deployment on resource-constrained edge devices. Built autonomously by [NEO](https://heyneo.so), the system achieves exceptional model compression while maintaining high accuracy.

Solution Highlights: - 9.08x Model Compression: 23.5 MB → 2.6 MB (far exceeds 4x target) - 77.2% Test Accuracy: Minimal 3.8% drop from baseline - Full INT8 Quantization: All weights, activations, and operations - Edge-Ready: TensorFlow Lite format optimized for deployment - Single-Command Pipeline: End-to-end automation

Training can be performed on newer Datasets as well.

Project is accessible here: https://github.com/dakshjain-1616/Quantisation-Awareness-tra...

9x MobileNet V2 size reduction with Quantization aware training

gauravvij137 — Mon, 16 Feb 2026 19:10:43 +0000

Article URL: https://github.com/dakshjain-1616/Quantisation-Awareness-training-by-NEO

Comments URL: https://news.ycombinator.com/item?id=47038969

Points: 2

# Comments: 2

New comment by gauravvij137 in "Machine learning agent in VS Code IDE"

gauravvij137 — Sat, 24 Jan 2026 18:04:58 +0000

Founder here. I built NEO, an AI agent designed specifically for AI and ML engineering workflows, after repeatedly hitting the same wall with existing tools: they work for short, linear tasks, but fall apart once workflows become long-running, stateful, and feedback-driven. In real ML work, you don’t just generate code and move on. You explore data, train models, evaluate results, adjust assumptions, rerun experiments, compare metrics, generate artifacts, and iterate; often over hours or days.

Most modern coding agents already go beyond single prompts. They can plan steps, write files, run commands, and react to errors. Where things still break down is when ML workflows become long-running and feedback-heavy. Training jobs, evaluations, retries, metric comparisons, and partial failures are still treated as ephemeral side effects rather than durable state.

Once a workflow spans hours, multiple experiments, or iterative evaluation, you either babysit the agent or restart large parts of the process. Feedback exists, but it is not something the system can reliably resume from.

NEO tries to model ML work the way it actually happens.

It is an AI agent that executes end-to-end ML workflows, not just code generation. Work is broken into explicit execution steps with state, checkpoints, and intermediate results. Feedback from metrics, evaluations, or failures feeds directly into the next step instead of forcing a full restart. You can pause a run, inspect what happened, tweak assumptions, and resume from where it left off.

Here's an example as well for your reference: You might ask NEO to explore a dataset, train a few baseline models, compare their performance, and generate plots and a short report. NEO will load the data, run EDA, train models, evaluate them, notice if something underperforms or fails, adjust, and continue. If training takes an hour and one model crashes at 45 minutes, you do not start over. Neo inspects the failure, fixes it, and resumes.

Happy to answer questions about Neo.

Machine learning agent in VS Code IDE

gauravvij137 — Sat, 24 Jan 2026 18:04:57 +0000

Article URL: https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo

Comments URL: https://news.ycombinator.com/item?id=46745907

Points: 3

# Comments: 1

Show HN: First autonomous ML and AI engineering Agent

gauravvij137 — Thu, 22 Jan 2026 19:50:16 +0000

In real ML work, you don’t just generate code and move on. You explore data, train models, evaluate results, adjust assumptions, rerun experiments, compare metrics, generate artifacts, and iterate; often over hours or days.

NEO tries to model ML work the way it actually happens.

Docs for the extension: https://docs.heyneo.so/#/vscode

Happy to answer questions about Neo.

Comments URL: https://news.ycombinator.com/item?id=46724298

Points: 5

# Comments: 0

New comment by gauravvij137 in "Neo (Autonomous ML engineer) is leading the MLE Bench with 34.2% score"

gauravvij137 — Tue, 19 Aug 2025 06:02:22 +0000

Did you miss reading the official leaderboard score on openai/mle-bench page where it clearly states that Neo has the best score on mlebench?